- 1 Priorities (descending)
- 2 Proposed Features / Integration Points (by descending priority)
- 3 Proposed Architecture (citation integration only)
- 4 Installation/Infrastructure Requirements and Compatibility
- 5 GUI specification for Feature #1 (citation support in submission process)
- 6 Back-End Class Design
- 7 Next Steps
- 8 Further Ideas (Attic)
- Add L8X's citation parsing, lookup/correction and editing functionality to OJS
- Add L8X's metadata extraction to OJS
- Add document parsing/editing capability to OJS
- Add document conversion capability to OJS
Proposed Features / Integration Points (by descending priority)
- automatic citation lookup and editing in submission process ("citation box")
- addition of citation data in XML export (e.g. for PubMed, Synergies, and CrossRef)
- allow readers to view citations in multiple citation formats (including Zotero integration)
- generation of COinS (Context Object in Span) from citations
- add citation editing and lookup to editorial process
- automatic citation data extraction from ODT in submission process
- automatic metadata extraction from ODT in submission process
- add section parser / editor to editorial process (generate and edit full semantic XML structure) in OJS
- add automatic document conversion to submission process (*.*)->(*.odt) to allow automatic extraction for more formats
- provide citation support in reading tools (context sensors that use citation data to provide additional information in RT sidebar)
Proposed Architecture (citation integration only)
Citation Backend Services Library
- Move L8X citation parser components to pkp/classes/citation/CitationParser*.inc.php
- Move L8X citation lookup components to pkp/classes/citation/CitationLookup*.inc.php
- Specific implementations extend a base object that enforces the API contract (template pattern), interfaces are no-go in PHP4
- Make sure that the API can be used by all PKP applications
- Use migrated code in L8X standalone
- Make sure that the components can be integrated/extended for metadata/section parsers/editors later
Citation DAO Library
- We can use the usual PKP DAO pattern for all citation data persistence requirements
Citation GUI Pages
- We might need an extra step in the submission process and an extra page in the editorial process for citation editing/lookup
- Pages have to be application specific so we cannot usually share them between applications.
- We'll however try to move as much as possible to the GUI components library for re-use, the page will only consist of a very high-level outline (GUI components library)
- Apart from citation editing I don't think we'll invent new pages, it's more about integrating new components into existing pages.
Citation GUI Components Library
L8X editing capability is a lot more demanding than anything I know so far in OCS/OJS.
My bet is that 90% of the migration effort will go into the GUI migration (MJ, can you comment, please?). We have to port from scriptaculous to jQuery and from CakePHP's MVC-implementation to PKP's (including smarty). To achieve re-use between PKP applications and between pages we'll have to "componentize" the GUI more than it currently is in L8X.
These are my ideas for the GUI architecture:
- Create L8X-specific GUI components and template fragments in WAL (e.g. citation editor component, re-use in all PKP applications)
- Create an L8X citation renderer template library
- One smarty template per citation style, including COinS
- Migrate COinS plugin to use COinS template fragment
AJAX Request Architecture
- We should probably think of an AJAX specific high-performance MVC controller architecture. This means to implement shortcuts in the request processing for AJAX requests wherever possible (performance bottleneck!)
- Both, the AJAX handler and the Page handler will be based on the same base classes but will extend them differently
- Make sure, there is no AJAX security bypass of course (maintain the single point of entry + common security infrastructure for all types of request)
Installation/Infrastructure Requirements and Compatibility
- No new initial installation requirements
- Maintain PHP4 compatibility for initial installation
- New installation requirements (additional software, PHP>4) only for optional plug-ins
- Thorough documentation of additional installation / runtime environment requirements
- Make sure that L8X functionality will be easily portable to other PKP products (OMP, OCS, Harvester)
- Make sure that L8X standalone will continue working/improving by cleanly backporting/integrating migrated code to L8X (DRY!)
- Use standard UI technology to make sure that backport of new OMP GUI will be easier
- Comments have to follow Doxygen syntax
- See e.g. http://pkp.sfu.ca/cvs/cvsweb.cgi/ojs2/classes/article/Article.inc.php?rev=1.48;content-type=text%2Fplain for a standard code header
- Functions should at least include a general description as well as @param and @return tags as necessary.
- All contributions will be fully unit-test covered
- All workflows will be fully web-test covered
GUI specification for Feature #1 (citation support in submission process)
- must-have: copy & paste
We can use the existing text field in the submission process for "bulk citation insert":
- enter citations (text-only) -> disable TinyMCE-plugin for citation field
- "parse" button will split up citations (one per line?) and send them to the configured parser services
- new: parsing should be non-blocking if possible - alternatively: a progress bar should appear
- citation editor appears as soon as citations have been recognized
- optional: automatic extraction
Use document parser to extract citations:
- recognize .odt file type and try to extract citations
- show citations in citation editor if citations have been found
I think the current citation editor GUI is already very good. It has the following functionality:
- open/close citation details (current bug: opening details for one citation should close all other citations)
- save citation details
- parse/lookup citation
- text field for editing the unparsed text
- moving citations up and down
- new: allow users to move a citation anywhere in the editor
- remove citation
- add citation
- enable/disable L8X citation parsing
- enable/disable automatic citation extraction
- select/configure parsing services
- select/configure lookup services
- citation insertion
- use existing text-area for input in submission step 2 (metadata)
- "parse"-button triggers AJAX request that will insert the citation editor on the same page
- open: non-blocking AJAX-request / progress-bar
- citation extraction
- use full-page-request on file upload
- if citations have been found then display a check-box (default: on) to enable citation extraction
- citation editor will automatically appear in step two (metadata) with the extracted citations
- citation editor
- port existing GUI to jQuery
- "edit" triggers an AJAX request that inserts the citation field editor
- "edit" closes other open field editor (if any)
- "edit" for an open citation closes it
- implement dirty-pattern to avoid losing user-data on editor close
- "save details" and "save citation text" will become one single button ("save")
- "save" triggers an AJAX request that persists citation text and citation fields to the database
- "parse citation" and "lookup citation" will become one single button ("lookup")
- "lookup" triggers an AJAX request for parsing and lookup that inserts lookup data into fields and provides the user with feedback for the parsing/lookup score
- unparsed citation is implemented as text area
- citation fields are implemented as input fields
- "move up" and "move down" trigger AJAX requests that update the GUI accordingly (this is different from current implementation which triggers a full-page request that is not really usable)
- "insert before" is a drop-down field that shows all citations by number, it has an entry "at the end..."
- "remove citation" triggers an alert "do you really want to remove ...citation title...?" - if confirmed, an AJAX request will be triggered that persists the removal
- "add citation" triggers an AJAX request that inserts a new citation into the GUI and opens the citation editor with empty fields
- make sure that GUI conforms to PKP's standard design re-using existing CSS wherever possible
- check-box in setup - step4: enable/disable L8X citation parsing (if jQuery support is enabled then this will trigger the other options to appear)
- check-box in setup - step4: enable/disable automatic citation extraction (available only if L8X parsing is enabled)
- select/configure parsing/lookup services: use the existing GUI elements from L8X (no AJAX required, if jQuery support is enabled then dependent sub-options will only appear when the main service is enabled)
Back-End Class Design
The Role of Plug-ins
- Plug-ins use application specific hooks and therefore cannot be shared between PKP applications.
- We'll offload as much functionality as possible to WAL and use plug-ins as thin wrappers around it.
- Plug-ins are important where we want to isolate additional installation requirements.
- We'll create a new citation plugin category that contains all lookup and parsing services
- A Citation represents a raw or parsed citation. The Citation can be in one of four states: raw, parsed, revised, confirmed (=looked-up). It implements the value object pattern. This class will be part of WAL.
- A CitationDAO class will interface with the database to persist the Citation class. It implements the DAO pattern. This class will be part of WAL.
- A CitationManager helper class will provide a simple interface (parse()/lookup()) to citation services. It implements the service façade pattern. This class will be part of WAL.
- We'll implement several CitationParser and CitationLookup strategies. Citation parsers and lookup services isolate additional installation or configuration requirements. They implement the strategy pattern so that the CitationManager can use them transparently. These classes will be part of WAL.
- Both, parsers and lookup services will be injected into the CitationMangager by way of plug-ins. Plug-ins act as minimal adapters between the applications and the citation services. Plug-ins provide the configuration GUI for parser and lookup services. Plug-ins are part of the individual applications (OJS, OCS, OMP, Harvester).
Citation Entity (OO analysis)
- Citations have semantic overlap with Submissions (PKP), Articles (OJS), Papers (OCS), Monographs (OMP) and Record/Schema (Harvester).
- The overlap consist in all of these entities having a similar set of bibliographic meta-data (i.e. author, title, publication year, etc.)
- Article, Paper and Monograph already inherit from a common base class (Submission). Unfortunately the semantic concept of a submission is quite different from that of bibliographic meta-data (=Citation).
- Apart from that most bibliographical meta-data accessors (i.e. author, title, etc.) are not shared among the different submission types but rather re-implemented for all of them following a common nomenclature only (getPaperTitle() vs. getArticleTitle(), etc.).
- It would be a major re-factoring to extract common meta-data from all cited entities and gather them in a common Metadata class that all named entities, including Citation, could use or inherit from.
- All this makes it difficult to encapsulate Metadata in a shared class and re-use this class in the named entities, including Citation.
- Luckily none of our real-world use cases (see Features above) forces us to share/convert data between existing entities and the Citation entity. So in fact implementing the Citation class apart from the other entities is more a theoretical than a practical problem. If we implement Citations separately from the other named entities we'll however considerably reduce the system's future flexibility and potential for re-use. We clearly breach accepted OO best practices.
- Should necessity arise to share bibliographic meta-data between entities in the future, we would have to use the Proxy or Adapter patterns to do so. Conversion services/strategies could be implemented that extract/inject meta-data from/to all named entities. This is a little awkward but IMO the best option in practice. As we have no use case for this, it's all just thinking about the risks we assume in the worst case.
- implement citation service back-end
- co-ordinate with OMP-development
- specify AJAX request architecture
- specify GUI fragments/AJAX components
- get specification approval from Alec, Brian, ...
- start coding
@Alec: As the co-ordination with OMP-development will take a little more time I'll start implementing the back-end classes first. Fortunately their design is very straightforward and easily encapsulated. So I don't think I've got a huge risk to implement the wrong thing.
Further Ideas (Attic)
- Don't kill L8X as a standalone application, integrate it with PKP WAL
- Package/brand/SEOize parser/lookup library separately for re-use in other document based OSS applications (ECM)
- Let users configure "content types" (document types) to improve parsing and reduce manual work for batches of similar documents
- extract styles from sample document
- extract sections from sample document
- let user attribute semantic information to styles and sections (e.g. first section = always contains author information)
- parse document (metadata, citations, structure) batch based on these specific user definitions
- Integrate more citation lookup services: CrossRef, OAIster, CiteSeer, Amazon, PubMed, ISBNdb, WorldCat, LibraryThing, OpenLibrary, SRU/SRW, Z39.50, generic OAI-DC; parse: Freecite, Parscit, Paracite, Regexp from L8X
- Additional XML schemas for export
- Additional file conversion based on plugins: XSLT, ICE, GD, ImageMagick, etc.
- Improve support for metadata schemas
- Use machine-learning approaches (data mining technology) to improve parser robustness (citations, document structure, metadata)
- Introduce a "batch processing mode" for citation parsing/lookup
- keep the application responsive while citation parsing is going on in the background
- do citation parsing/lookup during off-hours (e.g. every night)