PKP Hannover Sprint Notes Released: Metadata

By PKP Hannover Sprint Working Group "Metadata" / PKP Communications

The sprint notes from the PKP Hannover Sprint, hosted by the Leibniz Information Centre for Science and Technology in September 2023 are now available.

PKP HANNOVER 2023 SPRINT NOTES RELEASED: METADATA

The picture shows three participants of the Metadata working group at the PKP Sprint Hannover 2023. 

The blog post is a summary of the work of this group.

Sprints involve PKP community members joining diverse groups to work on PKP software and support. The Leibniz Information Centre for Science and Technology (TIB) hosted six working groups at the PKP Hannover Sprint in September. This is a summary of the Metadata working group.

Group members 

  • Mike Nason, PKP & UNB
  • Xenia van Edig, TIB
  • Erik Hanson, PKP
  • Patrick Vale, Crossref
  • Stephan Henn, UCL Cologne
  • Marco Tullney, TIB
  • Oliver Colberg, UCL Cologne

Background

The group identified Metadata pushed out of OJS is false/bad/compromised in a variety of ways.

Summary of Issues | Brainstorming

Approaches to sorting our or preventing false/erroneous metadata from being pushed to Crossref. This is already on the CRAFTOA to-do list. https://github.com/withanage/publicationValidator 

Maybe worth looking at:

Shi, J., Nason, M., Tullney, M., & Alperin, J. P. (2023, August 17). Identifying Metadata Quality Issues Across Cultures. https://doi.org/10.31235/osf.io/6fykh 

  • Language detection test
    • Facebook, for example, had an API for detecting language in forms.
      • “This work is supposed to be metadata in English, but it seems like this is French”.
      • What kind of metadata would benefit from a test before submission?
      • What subgroup of these might have a method for automatic checking?
        • “Usually this sort of metadata looks like x… but in this case, you have entered y“.
      • At the moment, the Crossref plugin is only looking for missing metadata.
      • In some cases, metadata is inconsistent in the same issue.
        • Capitalization in metadata fields such as author name, affiliation, and title, is probably more difficult to check but still problematic.

What sorts of metadata would benefit from metadata checks?

  • Valid XML coming out of the system.
  • Attached galleys
  • Language (checking to see if metadata exists for all languages and the check for whether the language matches the locale/field).
  • Affiliation/ROR ID
    • Does affiliation exist in general?
    • Adding departmental affiliation is possible via the institution_department in the Crossref API. 
  • Encoding issues. (from pasting via Word, PDF)
    • Exists everywhere, including references… 
  • Single number page ranges? 
  • Publication dates way in the future or super far in the past… 
  • Information entered differently than all the other articles (capitalization of author names, affiliations, titles, etc.) in general homogeneous metadata.

When might we check?

  • Submission stage
    • Is there a way to tie this into the citation plugin? Rendering of the citation to show people on submission what it would look like. 
  • When scheduled before publication.
  • On issue publication?
  • Any time you push the work to the next stage of the workflow?
    • If the user sees this too many times, they might ignore it. 

Maybe there should be a report as part of the publication workflow, a pre-flight check to show all the metadata for the works about to be published, ideally including writing a plugin for plan S or Crossref could hook into this workflow.

When are things supposed to happen?

  • As a principle, we aim at a tool that enables checks (and confirmation by an author/editor) of metadata that often is wrong at different stages of the submission and publishing process.
  • Can we have a pre-flight running at different stages, for example, whenever someone hands something over to another person/stage?
    • Seeing this too often could annoy the user, but submission and pre-publication are essential. 
  • Maybe the stages are so different that we cannot have the same processes/names, e.g.
    • During submission: extender submission overview in OJS 3.4/3.5.
    • During review/editing: new pre-flight.
    • Proof sign-off: make sure all the necessary information is available.

Users do not understand the “preferred name” field. We need a better explanation, pre-population, or removal of that field. (And if at any point we see the light and adopt a single string for names, it could be avoided.) 

Goals

Given an understanding that we couldn’t possibly solve this problem or create these solutions within the given time frame, we are breaking the work into one of scope for future development. 

  • Proposing a pre-flight check function or plugin for OJS. 
  • Discovering how many of these metadata issues are solvable.
    • How might we check?
    • What is possible?
  • Creating an issue description to take forward to PKP for plugin-related work. 

Results:  Pre-flight Check Scoping Proposal

We know that OJS journals are pushing, in some cases, pretty poor metadata downstream. This conversation is often about “completeness”. It is almost too easy to publish scholarly works with less than the bare minimum of article metadata. This is a bit of a double-edged sword. Obviously, you don’t want to get in the way of users publishing how they want, but this also presumes that editors and authors understand what may be missing. 

We’re proposing a pre-flight check function that would enable an overview of submitted and/or pre-publication work to allow for user review. We think this makes sense in two stages of the workflow:

  1. Submissions
  2. Issue Publication

Submissions

In OJS 3.4, we provide a summary for submissions. It provides users with an overview of the content being submitted to OJS. This is close to what we intend to propose at the submission level. We also know that the PKP team is working on a publication validator as part of the CRAFTOA project that can look for or report empty fields. (https://github.com/withanage/publicationValidator)

 A submission pre-flight would check against the following: 

  • Are affiliations provided for all authors?
  • Are author names “complete” for all authors?
  • Have references or abstracts been populated if required?
  • Is an ORCID provided, if required?
  • Is there a title? 
  • Is any of the metadata in ALL CAPS?
  • and, ideally…
    • Is metadata in a given field written in the same language as expected within the form?
    • Is an email address valid?
    • Is a ROR ID provided alongside affiliations?

In general, we’re not proposing that users be forced to correct these issues but, instead, they be made aware of anything missing they may not have considered. 

In case of a submission pre-flight check, we’re already on our way here. 

A screenshot of the deposit/submission review in OJS 3. 2023

Issue Publication

The act of publishing an issue in OJS is actually pretty unceremonious. You click publish. The issue is live. Content is distributed worldwide; metadata is sent to ORCID, Crossref, Datacite, via OAI to any downstream locations… etc. There’s an assumption that editors review their content in this space with due diligence before clicking the “publish” button. Or, maybe more importantly, an assumption that editors understand and take seriously the metadata rube-goldberg machine they’re setting in motion when they publish their issues. However, it is not easy to review all relevant metadata in OJS. 

Problems with checking include: 

  • taking many clicks to review all submission metadata for every article in an issue.
  • no easy way to see if collected metadata is consistent across an issue.
    • for example, two or three authors haven’t included affiliations
    • some articles have incomplete pagination
    • some articles have references where others do not
    • when some articles have multilingual metadata, others may not, and reviewing this is laborious or obtuse
  • it’s easier to spot a mistake than to know what might have been omitted. 

A publication pre-flight check would allow editors to review the work they are about to publish. It could take inspiration from the submission summary in 3.4. It would review the following elements and identify some or all of the following for completeness and, eventually, accuracy:

  • Issue-level metadata
  • Article metadata
    • Everything from Submission to Pre-flight
    • Related works identifiers
      • DOIs for deposited datasets?
      • DOIs for posted preprints? 
      • DOIs or records specific to peer review?
    • Pagination
    • Galleys
    • Sections
    • Article DOI Preview
      • More useful for, say, copy editors or layout editors than an opportunity to complain about the human-readability of a suffix.
  • CHECK FOR ALL CAPS
  • and, ideally…
    • Is metadata in a given field written in the same language as expected within the form?
    • or other programmatic/automatic format sanity checks 

Proposed Behaviour

The emphasis here isn’t just about what metadata is missing but which metadata has been recorded inconsistently. “We noticed that for these articles, you have [content x], but for these articles, you have [content y] or no content at all.”

We believe this would address general issues around completeness. Ideally, it would be possible to allow users to edit metadata from this space. And users could filter out any metadata fields they’re not specifically interested in checking.

In the future, additional functionality could be added based on downstream requirements specified by relevant organizations. For example: 

  • Crossref
  • DOAJ
  • Plan S
  • Coalition Publica

Certain intended downstream services may have specific requirements for metadata that aren’t required for publication but are required for those journals to fulfill their responsibilities or agreements. These could be plugins created or maintained by those partner organizations. 

Next Steps 

  • Implement. 
  • This would go under “discussions” as a GitHub issue, but not necessarily as an issue. Could convert the discussion to an issue after the dev team has an opportunity to talk about this. Discussions are a better place for this sort of proposal. 

The group also identified other relevant ideas for future work outside of the Sprint.