Difference between revisions of "Talk:Lemon8-XML Roadmap"

From PKP Wiki
Jump to: navigation, search
(Point to plug-in discussion)
(Integrated MJ)
Line 155: Line 155:
 
* Both, parsers and lookup services will be injected into the ''CitationMangager'' by way of core configuration or plug-ins (see the discussion of the role of plug-ins below). Plug-ins are part of the individual applications (OJS, OCS, OMP, Harvester).
 
* Both, parsers and lookup services will be injected into the ''CitationMangager'' by way of core configuration or plug-ins (see the discussion of the role of plug-ins below). Plug-ins are part of the individual applications (OJS, OCS, OMP, Harvester).
  
===Citation Entity (OO analysis)===
+
===Citation/Metadata Entities (OO analysis)===
 
* Citations have semantic overlap with ''Submission''s (PKP), ''Article''s (OJS), ''Paper''s (OCS), ''Monograph''s (OMP) and ''Record''/''Schema'' (Harvester).
 
* Citations have semantic overlap with ''Submission''s (PKP), ''Article''s (OJS), ''Paper''s (OCS), ''Monograph''s (OMP) and ''Record''/''Schema'' (Harvester).
 
* The overlap consist in all of these entities having a similar set of bibliographic meta-data (i.e. author, title, publication year, etc.)
 
* The overlap consist in all of these entities having a similar set of bibliographic meta-data (i.e. author, title, publication year, etc.)
Line 165: Line 165:
 
* Should necessity arise to share bibliographic meta-data between entities in the future, we would have to use the Proxy or Adapter patterns to do so. Conversion services/strategies could be implemented that extract/inject meta-data from/to all named entities. This is a little awkward but IMO the best option in practice. As we have no use case for this, it's all just thinking about the risks we assume in the worst case.
 
* Should necessity arise to share bibliographic meta-data between entities in the future, we would have to use the Proxy or Adapter patterns to do so. Conversion services/strategies could be implemented that extract/inject meta-data from/to all named entities. This is a little awkward but IMO the best option in practice. As we have no use case for this, it's all just thinking about the risks we assume in the worst case.
  
===Implementation of the Citation entity===
+
===Implementation of the Citation/Metadata entities===
  
 
We'll implement the ''Citation'' class with the above analysis in mind.
 
We'll implement the ''Citation'' class with the above analysis in mind.
  
We imagine a ''BibliographicMetadataProvider'' interface which we won't really implement (to maintain PHP4 compatibility) but enforce by convention:
+
We imagine a ''ResourceMetadataProvider'' interface which we won't really implement (to maintain PHP4 compatibility) but enforce by convention:
  
 
<pre>
 
<pre>
interface BibliographicMetadataProvider {
+
interface ResourceMetadataProvider {
 
     function getAuthor();
 
     function getAuthor();
 
     function getLocalizedAuthor();
 
     function getLocalizedAuthor();
Line 180: Line 180:
 
</pre>
 
</pre>
  
''Citation'', ''Article'', ''Monograph'', ''Paper'' all "implement" ''BibliographicMetadataProvider'':
+
''Citation'', ''Article'', ''Monograph'', ''Paper'' all "implement" ''ResourceMetadataProvider'':
  
 
A common abstract base class to "emulate" the interface is not an option in this case as we need to reserve inheritance to more central concerns of the classes (like ''Submission''). So we just have to enforce the interface by convention.
 
A common abstract base class to "emulate" the interface is not an option in this case as we need to reserve inheritance to more central concerns of the classes (like ''Submission''). So we just have to enforce the interface by convention.
Line 195: Line 195:
  
 
<pre>
 
<pre>
class BibliographicMetadata implements BibliographicMetadataProvider {
+
class ResourceMetadata implements ResourceMetadataProvider {
 
     ...
 
     ...
 
}
 
}
Line 203: Line 203:
  
 
<pre>
 
<pre>
$metadata =& (BibliographicMetadataHolder)$article;
+
$metadata =& (ResourceMetadataHolder)$article;
 
...
 
...
 
</pre>
 
</pre>
Line 210: Line 210:
  
 
<pre>
 
<pre>
function BibliographicMetadata($bibliographicMetadataProvider) {
+
function ResourceMetadata($resourceMetadataProvider) {
     $this->_author = $bibliographicMetadataProvider->getAuthor();
+
     $this->_author = $resourceMetadataProvider->getAuthor();
 
     $this->_localizedAuthor =
 
     $this->_localizedAuthor =
             $bibliographicMetadataProvider->getLocalizedAuthor();
+
             $resourceMetadataProvider->getLocalizedAuthor();
 
     ...
 
     ...
 
}
 
}
  
$metadata = new BibliographicMetadata($article);
+
$metadata = new ResourceMetadata($article);
 
$schemas =& $metadata->getSupportedSchemas();
 
$schemas =& $metadata->getSupportedSchemas();
 
if (in_array($schemas, SCHEMA_OAI) {
 
if (in_array($schemas, SCHEMA_OAI) {
Line 226: Line 226:
 
</pre>
 
</pre>
  
This way we get a semantically correct but still rather flexible and intuitive link between ''Article'', ''Monograph'', ''Citation'' and ''Paper'' on one side and ''Schema''/''Record'' on the other.  
+
This way we get a semantically correct but still rather flexible and intuitive link between ''Article'', ''Monograph'', ''Citation'' and ''Paper'' on one side and ''Schema''/''Record'' on the other.
 +
 
 +
===Metadata schemes===
 +
 
 +
One important question to answer will be which metadata schemes should be supported and how we represent them in the ResourceMetadataProvider interface. In other words: What is the minimal set of attributes/operations that the ResourceMetadataProvider should prescribe?
 +
 
 +
MJ, 2009-10-14: "In L8X, we use a sort of normalized mapping of the basic OpenURL 0.1 KEV [Key/Encoded-Value] format, but there is some crosswalk between the OpenURL 1.0 book/journal KEV formats as well as a little bit of DC [Dublin Core]. OJS and OCS also have their own variations on DC for article/paper metadata (can't say about OMP) - so we need to think about how this will be best represented in the metadata model."
 +
 
 +
My proposition is trying to implement the superset of all standards that we want to support.
  
 
==Next Steps==
 
==Next Steps==

Revision as of 23:19, 15 October 2009

Priorities (descending)

  • Add L8X's citation parsing, lookup/correction and editing functionality to OJS
  • Add L8X's metadata extraction to OJS
  • Add document parsing/editing capability to OJS
  • Add document conversion capability to OJS

Proposed Features / Integration Points (by descending priority)

  1. automatic citation lookup and editing in submission process ("citation box")
  2. addition of citation data in XML export (e.g. for PubMed, Synergies, and CrossRef)
  3. allow readers to view citations in multiple citation formats (including Zotero integration)
  4. generation of COinS (Context Object in Span) from citations
  5. add citation editing and lookup to editorial process
  6. automatic citation data extraction from ODT in submission process
  7. automatic metadata extraction from ODT in submission process
  8. add section parser / editor to editorial process (generate and edit full semantic XML structure) in OJS
  9. add automatic document conversion to submission process (*.*)->(*.odt) to allow automatic extraction for more formats
  10. use Harvester to retrieve additional citation meta-data that will be attached to the meta-data we already retrieve (i.e. every single harvester record may contain or point to additional citation records)
  11. provide citation support in reading tools (context sensors that use citation data to provide additional information in RT sidebar)

Proposed Architecture (citation integration only)

Citation Backend Services Library

  • Move L8X citation parser components to pkp/classes/citation/CitationParser*.inc.php
  • Move L8X citation lookup components to pkp/classes/citation/CitationLookup*.inc.php
  • Specific implementations extend a base object that enforces the API contract (template pattern), interfaces are no-go in PHP4
  • Make sure that the API can be used by all PKP applications
  • Use migrated code in L8X standalone
  • Make sure that the components can be integrated/extended for metadata/section parsers/editors later

Citation DAO Library

  • We can use the usual PKP DAO pattern for all citation data persistence requirements

Citation GUI Pages

  • We might need an extra step in the submission process and an extra page in the editorial process for citation editing/lookup
  • Pages have to be application specific so we cannot usually share them between applications.
  • We'll however try to move as much as possible to the GUI components library for re-use, the page will only consist of a very high-level outline (GUI components library)
  • Apart from citation editing I don't think we'll invent new pages, it's more about integrating new components into existing pages.

Citation GUI Components Library

L8X editing capability is a lot more demanding than anything I know so far in OCS/OJS.

My bet is that 90% of the migration effort will go into the GUI migration (MJ, can you comment, please?). We have to port from scriptaculous to jQuery and from CakePHP's MVC-implementation to PKP's (including smarty). To achieve re-use between PKP applications and between pages we'll have to "componentize" the GUI more than it currently is in L8X.

These are my ideas for the GUI architecture:

  • Create L8X-specific GUI components and template fragments in WAL (e.g. citation editor component, re-use in all PKP applications)
  • Create an L8X citation renderer template library
    • One smarty template per citation style, including COinS
    • Migrate COinS plugin to use COinS template fragment

AJAX Request Architecture

  • We should probably think of an AJAX specific high-performance MVC controller architecture. This means to implement shortcuts in the request processing for AJAX requests wherever possible (performance bottleneck!)
  • Both, the AJAX handler and the Page handler will be based on the same base classes but will extend them differently
  • Make sure, there is no AJAX security bypass of course (maintain the single point of entry + common security infrastructure for all types of request)

Installation/Infrastructure Requirements and Compatibility

  • No new initial installation requirements
  • Maintain PHP4 compatibility for initial installation
  • New installation requirements (additional software, PHP>4) only for optional plug-ins
  • Thorough documentation of additional installation / runtime environment requirements
  • Make sure that L8X functionality will be easily portable to other PKP products (OMP, OCS, Harvester)
  • Make sure that L8X standalone will continue working/improving by cleanly backporting/integrating migrated code to L8X (DRY!)
  • Use standard UI technology to make sure that backport of new OMP GUI will be easier
  • Comments have to follow Doxygen syntax
  • All contributions will be fully unit-test covered
  • All workflows will be fully web-test covered

GUI specification for Feature #1 (citation support in submission process)

Citation Extraction/Insertion

  1. must-have: copy & paste
    We can use the existing text field in the submission process for "bulk citation insert":
    • enter citations (text-only) -> disable TinyMCE-plugin for citation field
    • "parse" button will split up citations (one per line?) and send them to the configured parser services
    • new: parsing should be non-blocking if possible - alternatively: a progress bar should appear
    • citation editor appears as soon as citations have been recognized
  2. optional: automatic extraction
    Use document parser to extract citations:
    • recognize .odt file type and try to extract citations
    • show citations in citation editor if citations have been found

Citation Parsing/Editing/Lookup

I think the current citation editor GUI is already very good. It has the following functionality:

  • open/close citation details (current bug: opening details for one citation should close all other citations)
  • save citation details
  • parse/lookup citation
  • text field for editing the unparsed text
  • moving citations up and down
  • new: allow users to move a citation anywhere in the editor
  • remove citation
  • add citation

Plugin Configuration

  • enable/disable L8X citation parsing
  • enable/disable automatic citation extraction
  • select/configure parsing services
  • select/configure lookup services

Implementation

  1. citation insertion
    • use existing text-area for input in submission step 2 (metadata)
    • "parse"-button triggers AJAX request that will insert the citation editor on the same page
    • open: non-blocking AJAX-request / progress-bar
  2. citation extraction
    • use full-page-request on file upload
    • if citations have been found then display a check-box (default: on) to enable citation extraction
    • citation editor will automatically appear in step two (metadata) with the extracted citations
  3. citation editor
    • port existing GUI to jQuery
    • "edit" triggers an AJAX request that inserts the citation field editor
    • "edit" closes other open field editor (if any)
    • "edit" for an open citation closes it
    • implement dirty-pattern to avoid losing user-data on editor close
    • "save details" and "save citation text" will become one single button ("save")
    • "save" triggers an AJAX request that persists citation text and citation fields to the database
    • "parse citation" and "lookup citation" will become one single button ("lookup")
    • "lookup" triggers an AJAX request for parsing and lookup that inserts lookup data into fields and provides the user with feedback for the parsing/lookup score
    • unparsed citation is implemented as text area
    • citation fields are implemented as input fields
    • "move up" and "move down" trigger AJAX requests that update the GUI accordingly (this is different from current implementation which triggers a full-page request that is not really usable)
    • "insert before" is a drop-down field that shows all citations by number, it has an entry "at the end..."
    • "remove citation" triggers an alert "do you really want to remove ...citation title...?" - if confirmed, an AJAX request will be triggered that persists the removal
    • "add citation" triggers an AJAX request that inserts a new citation into the GUI and opens the citation editor with empty fields
    • make sure that GUI conforms to PKP's standard design re-using existing CSS wherever possible
  4. configuration
    • check-box in setup - step4: enable/disable L8X citation parsing (if jQuery support is enabled then this will trigger the other options to appear)
    • check-box in setup - step4: enable/disable automatic citation extraction (available only if L8X parsing is enabled)
    • select/configure parsing/lookup services: use the existing GUI elements from L8X (no AJAX required, if jQuery support is enabled then dependent sub-options will only appear when the main service is enabled)

Back-End Class Design

The Role of Plug-ins

What's our general approach to plug-ins?

  • Plug-ins use application specific hooks and therefore cannot be shared between PKP applications.
  • We'll offload as much functionality as possible to WAL and use plug-ins as thin wrappers around it where necessary.
  • We should only use plug-ins when they are really necessary (improve the user experience and/or code maintainability/testability)

When to do we need plug-ins?

  • non-standard installation requirements that need to be isolated
  • complex configuration or user-interface requirements that clutter the core interface for first-time users and should be kept out of the way
  • performance implications (when switching on a functionality causes a non-avoidable performance onus)
  • isolation of application-specific citation adapter code in one place to keep the core code clean -> improved code modularization and maintainability

Where to place citation plug-ins?

  • We'll have to create a new citation plugin category if we need additional hooks or have many plug-ins.
  • Otherwise we prefer to use existing categories.

Citation Service

  • A Citation represents a raw or parsed citation. The Citation can be in one of four states: raw, parsed, revised, confirmed (=looked-up). It implements the value object pattern. This class will be part of WAL.
  • A CitationDAO class will interface with the database to persist the Citation class. It implements the DAO pattern. This class will be part of WAL.
  • A CitationManager helper class will provide a simple interface (parse()/lookup()) to citation services. It implements the service façade pattern. This class will be part of WAL.
  • We'll implement several CitationParser and CitationLookup strategies. Citation parsers and lookup services isolate additional installation or configuration requirements. They implement the strategy pattern so that the CitationManager can use them transparently. These classes will be part of WAL.
  • Both, parsers and lookup services will be injected into the CitationMangager by way of core configuration or plug-ins (see the discussion of the role of plug-ins below). Plug-ins are part of the individual applications (OJS, OCS, OMP, Harvester).

Citation/Metadata Entities (OO analysis)

  • Citations have semantic overlap with Submissions (PKP), Articles (OJS), Papers (OCS), Monographs (OMP) and Record/Schema (Harvester).
  • The overlap consist in all of these entities having a similar set of bibliographic meta-data (i.e. author, title, publication year, etc.)
  • Article, Paper and Monograph already inherit from a common base class (Submission). Unfortunately the semantic concept of a submission is quite different from that of bibliographic meta-data (=Citation).
  • Apart from that most bibliographical meta-data accessors (i.e. author, title, etc.) are not shared among the different submission types but rather re-implemented for all of them following a common nomenclature only (getPaperTitle() vs. getArticleTitle(), etc.).
  • It would be a major re-factoring to extract common meta-data from all cited entities and gather them in a common Metadata class that all named entities, including Citation, could use or inherit from.
  • All this makes it difficult to encapsulate Metadata in a shared class and re-use this class in the named entities, including Citation.
  • Luckily none of our real-world use cases (see Features above) forces us to share/convert data between existing entities and the Citation entity. So in fact implementing the Citation class apart from the other entities is more a theoretical than a practical problem. If we implement Citations separately from the other named entities we'll however considerably reduce the system's future flexibility and potential for re-use. We clearly breach accepted OO best practices.
  • Should necessity arise to share bibliographic meta-data between entities in the future, we would have to use the Proxy or Adapter patterns to do so. Conversion services/strategies could be implemented that extract/inject meta-data from/to all named entities. This is a little awkward but IMO the best option in practice. As we have no use case for this, it's all just thinking about the risks we assume in the worst case.

Implementation of the Citation/Metadata entities

We'll implement the Citation class with the above analysis in mind.

We imagine a ResourceMetadataProvider interface which we won't really implement (to maintain PHP4 compatibility) but enforce by convention:

interface ResourceMetadataProvider {
    function getAuthor();
    function getLocalizedAuthor();
    function getTitle();
    ...
}

Citation, Article, Monograph, Paper all "implement" ResourceMetadataProvider:

A common abstract base class to "emulate" the interface is not an option in this case as we need to reserve inheritance to more central concerns of the classes (like Submission). So we just have to enforce the interface by convention.

$articleAuthor = $article->getAuthor();
$monographAuthor = $monograph->getAuthor();
$paperAuthor = $paper->getAuthor();
$citationAuthor = $citation->getAuthor();
...

This keeps our code simple, concise and intuitive. And we'll be able to define a class (PHP5):

class ResourceMetadata implements ResourceMetadataProvider {
    ...
}

This way we can do something like (PHP5):

$metadata =& (ResourceMetadataHolder)$article;
...

In PHP4 we can implement type casting by way of constructors:

function ResourceMetadata($resourceMetadataProvider) {
    $this->_author = $resourceMetadataProvider->getAuthor();
    $this->_localizedAuthor =
            $resourceMetadataProvider->getLocalizedAuthor();
    ...
}

$metadata = new ResourceMetadata($article);
$schemas =& $metadata->getSupportedSchemas();
if (in_array($schemas, SCHEMA_OAI) {
  $oai =& $metadata->getRecord(SCHEMA_OAI);
}

etc., etc.

This way we get a semantically correct but still rather flexible and intuitive link between Article, Monograph, Citation and Paper on one side and Schema/Record on the other.

Metadata schemes

One important question to answer will be which metadata schemes should be supported and how we represent them in the ResourceMetadataProvider interface. In other words: What is the minimal set of attributes/operations that the ResourceMetadataProvider should prescribe?

MJ, 2009-10-14: "In L8X, we use a sort of normalized mapping of the basic OpenURL 0.1 KEV [Key/Encoded-Value] format, but there is some crosswalk between the OpenURL 1.0 book/journal KEV formats as well as a little bit of DC [Dublin Core]. OJS and OCS also have their own variations on DC for article/paper metadata (can't say about OMP) - so we need to think about how this will be best represented in the metadata model."

My proposition is trying to implement the superset of all standards that we want to support.

Next Steps

  • implement citation service back-end
  • specify AJAX request architecture
  • specify GUI fragments/AJAX components
  • get specification approval from Alec, Brian, ...

Further Ideas (Attic)

  • Don't kill L8X as a standalone application, integrate it with PKP WAL
  • Package/brand/SEOize parser/lookup library separately for re-use in other document based OSS applications (ECM)
  • Let users configure "content types" (document types) to improve parsing and reduce manual work for batches of similar documents
  1. extract styles from sample document
  2. extract sections from sample document
  3. let user attribute semantic information to styles and sections (e.g. first section = always contains author information)
  4. parse document (metadata, citations, structure) batch based on these specific user definitions
  • Integrate more citation lookup services: CrossRef, OAIster, CiteSeer, Amazon, PubMed, ISBNdb, WorldCat, LibraryThing, OpenLibrary, SRU/SRW, Z39.50, generic OAI-DC; parse: Freecite, Parscit, Paracite, Regexp from L8X
  • Additional XML schemas for export
  • Additional file conversion based on plugins: XSLT, ICE, GD, ImageMagick, etc.
  • Improve support for metadata schemas
  • Use machine-learning approaches (data mining technology) to improve parser robustness (citations, document structure, metadata)
  • Introduce a "batch processing mode" for citation parsing/lookup
    • keep the application responsive while citation parsing is going on in the background
    • do citation parsing/lookup during off-hours (e.g. every night)