The Road to Preprints (Part 3): Metadata Matters
In our third and final post in our “Road to Preprints” series, we’re chatting with PKP Associate Director of Research Juan Pablo Alperin to learn more about the Preprint Uptake and Use Project, a joint research initiative between ASAPBIo and the ScholCommLab that turned disappointing data into a metadata mission.
In 2019, ScholCommLab visiting scholars Mario Malički and Janina Sarol (under the supervision of PKP’s Juan Pablo Alperin) began analyzing preprint metadata to “better understand the status of preprint adoption and impact in specific research communities.” Mario and Janina looked at several preprint servers including SHARE, OSF, BioRxiv, and arXiv. Their hope was to use data from these sources to answer questions such as “who publishes preprints?” and “how many preprints are published?” but the metadata they were mining turned out to be too unreliable to support. Incomplete, incorrect, and inconsistent metadata (e.g., author, subject, date) was so pervasive that they couldn’t exclude problematic entries in their analysis.
Despite these challenges, the team persevered with their analysis, coming up with suggestions along the way for preprint systems to improve their metadata. To learn more, including what this research meant – and will mean – for Open Preprint Systems (OPS), we asked Juan to share more about their unexpected findings.
1. To start, can you tell us more about the ScholCommLab? What is its relationship with PKP? How does their work influence ours?
The Scholarly Communications Lab—or ScholCommLab for short—is the research group that I co-direct with Dr. Stefanie Haustein at the University of Ottawa. The lab is made up of research associates, postdocs, graduate and undergraduate research assistants, and regularly hosts visiting scholars from around the world. Together, we work on research on how knowledge is produced, disseminated, and used.
While all of the work done at the Lab is relevant for PKP and could be done under the auspices of PKP, not all of it has direct application to the core of PKP’s activities. That’s why a few years ago I felt it would be best to give my research its own home. I like to think of the Lab as a sister project of PKP. We share interests and an ethos, collaborate when possible, and always support each other.
2. How did Mario and Janina’s analysis impact and/or inform the design and development of the OPS platform?
When ASAPBio approached me about working on preprints, I knew it would be a perfect complement to the development of OPS. Mario and Janina’s work was originally intended to help us understand preprint uptake and use, but it became clear early on in their work that there were underlying issues with the metadata being collected by existing servers. They are continuing to work on understanding preprint uptake (stay tuned for a preprint of our own!), but the analysis we published in a series of blog posts and the latest recommendations were shared early on with the OPS developers to make sure that OPS captures and shares metadata in ways that can be useful downstream.
3. Knowing what you know now about preprint metadata, what can and should we be doing differently with preprint systems? How does OPS, as a system, support metadata best practice?
One of the great things about OPS being built using the shared PKP library is that it benefits from the 20 years of learning that PKP has gone into OJS. This means that OPS is—in my biased opinion—already miles ahead of other preprint platforms when it comes to the collecting and handling of metadata. In this sense, my experience with OJS informed our recommendations for many of the fields. On the flip side, Mario and Janina learnt a lot about the importance of how versions should be tracked, and these lessons have now gone to the PKP team who have started implementing them.
4. Do you have any advice for new preprint servers? What should they be paying attention to early on – and teaching their authors – before the submissions start rolling in?
The biggest lesson we learned from this project is that the metadata being collected cannot be an afterthought. In the rush to launch preprint servers and build the community, some providers have traded off quality and complete metadata for simplicity of design (both of the software and of the interfaces). However, I believe that OPS demonstrates that it is possible to keep interfaces simple for users while ensuring that the data collected is of high quality. For us to get the most benefit from preprints, they need to be fully integrated into the scholarly record—for that to happen, the data needs to be complete and accurate and, where possible, integrated with other databases and the rest of the literature (e.g., using unique identifiers and links to related resources).
5. Last, but not least, what’s next for the ScholCommLab and their work on preprints?
We are still working—albeit slowly—on a publication that analyzes the data we were able to collect and clean from the various services. We hope to be able to “disambiguate” authors (figure out which names correspond to the same individuals) so that we can produce co-authorship networks and better understand the communities that are contributing to preprints. Preprints present an incredible opportunity for the research community and at the ScholCommLab we will continue to do our best to shed light on how this practice is evolving and suggest ways that software providers and researchers can make the most of them.