COS goes FOSS The sorry state of scientific publishing and how we could move to an open and resilient infrastructure

The crysis of publishing—symptomes

reproducibility crisis
only a small fraction of primary data available
even smaller fraction of code
open access, if exists, is very expensive, maintains the profit of legacy publishers
antiquated, dysfunctional system that rewards prestige/hype over quality/integrity
scholarly workflows use non-professional, closed-source software (MS, Adobe, Prism etc.)
sharing, integration, automation and collaboration difficult (who can use Git?)
final product of years of research: pdf file (1990s tech) behind a paywall
data, code, text not searchable, reuseable, discoverable

Most source data collected by scientists are not available

Code is very often not shared or not shared stably

study to assess the effectiveness of code sharing policy
random sample of 204 Science papers
artifacts from 44%
reproduce the findings for 26%

Typical responses:

“The data files remains our property and are not deposited for free access.”
“When you approach a PI for the source codes and raw data, you better explain who you are, whom you work for, why you need the data and what you are going to do with it.”
“I have to say that this is a very unusual request without any explanation! Please ask your supervisor to send me an email with a detailed, and I mean detailed, explanation.”
“We do not typically share our internal data or code with people outside our collaboration.”

Flipped protein structures due to buggy program

The structures of MsbA (purple) and Sav1866 (green) overlap little (left) until MsbA is inverted (right).

buggy non-published program flipped two columns, inverting electron density
program was inherited from another lab
mistake repeated in several papers
led to five retractions (three in Science)

Gene name errors are widespread in the scientific literature

Most scientists use software developed for accounting

the symbol MARCH1 has now become MARCHF1
SEPT1 has become SEPTIN1, and so on

Reporting and citation bias

The cumulative impact of reporting and citation biases on the evidence base for antidepressants

50% of randomized controlled trials have never been published
trials with statistically significant findings are more likely to be published
citation bias -> studies with positive results receive more citations than negative studies

Majority of high-impact cancer studies fail to replicate

The Reproducibility Project: Cancer Biology (RP:CB)
failure to replicate 30 of 53 papers published by Science, Nature, and Cell from 2010 to 2012
credibility of preclinical cancer biology?
need for authors to share more details of their experiments
vague protocols and uncooperative authors
one-third of contacted authors declined or did not respond

An epidemic of retractions

steep increase in retractions
monitoring retractions: http://retractionwatch.com
majority of all retractions is due to misconduct

Perverse incentives, publish or perish

Brembs

chasing ‘stories’ and IF instead of integrity, hypothesis-testing, rigour, openness
under the spell of glamour journals
“If I get this result, this will be a Nature paper!”
reporting bias (only positive results are reported)
low statistical power (a p-value of 0.05 brings only 50% reproducibility)
in worst cases data are fabricated

Impact factor - not a metric of quality

IF = number of citations to articles in a journal (the numerator), normalized by the number of articles in that journal (the denominator) in the last 2 y
calculated by Thomson Reuters
originally created to help librarians, not as a measure of quality
yet, emerged as a pervasive metric of quality
in some cases not calculated but negotiated (the denominator, e.g Curr Biol)
removing editorials/News-and-Views articles from the denominator (so called “front-matter”) can dramatically alter the resulting IF
not reproducible, not open (calculated from proprietary data)
a composite of multiple, highly diverse article types
comparison of journals not mathematically sound

IF - statistically flawed

highly skewed distributions
distorted by outliers (see Nature)
journal IF comparisons: comparison of means of two populations
allowed only if distributions follow a normal distributions!
simple ranking by mean is incorrect
median would be better or a more complex statistical test (e.g. Kruskal–Wallis test)

IF - strongly biased by outliers

fitting a more complex exponential function to the citation data
a journal impact factor can be calculated from the parameters of the fit
Science JIF = 25.3 instead of the reported 34
Nature JIF = 26.8 instead of the reported 37
a few highly cited papers have a substantial effect on the mean, but less on the exponential function

JIF does not correlate with quality metrics (e.g. statistical power)

no association between statistical power and journal IF

But: JIF correlates with retractions

‘journal rank’ is a strong predictor of the rate of retractions
(also of Excel errors :))

Current system is hugely wasteful

Robert Maxwell in 1985. Photograph: Terry O’Neill/Hulton/Getty

worldwide sales > USD 19 billion
dominated by five large publishing houses: Elsevier, Black & Wiley, Taylor & Francis, Springer Nature and SAGE
Elsevier has a profit margin around 40 % (higher than Microsoft, Google and Coca Cola)
about USD 6 billion per year goes to profits = 2 CERNs/year
APCs can be as high as $12,000

Kleptistan (Binjistan) - I

an oligopoly of legacy publishers
Elsevierstan
Wileystan
Taylorfrancistan
Springerstan
…

Kleptistan (Binjistan) - II

workflow monopoly
tools to cover the entire academic workflow (e.g. Elsevier)
high risk of vendor lock-in
totalizing, homogenising workflows, extractive of research communities

Kleptistan (Binjistan) - III

or
How to milk the same cow multiple times?

scientists provide content for free
scientist peer review for free
scientists buy over-priced product as APC, subscription or ‘transformative’ deals
publisher (now ‘data analysis company’) sells entire workflows to scientists
publishers tracks scientists on its platforms
sells the data to the employers (e.g. quality assessment) or third parties

Unacceptable practices of data tracking by publishers

Watchthem ‘Data gathering is an essential process, and most companies use it for their success.’

tracking site visits via authentication systems
detailed real-time data on the information behaviour of individuals and institution
page visits, accesses, clicks, downloads, etc.
assembly of granular profiles of academic behaviour
without user consent
selling the data, e.g. RELX – the parent company of Elsevier – establishes PURE at universities around the world
to provide ‘insights’ into the entire research cycle
RELX now also sells data to ICE…

The problem is the system

journal publishing system is fundamentally broken
a legacy system that prevents science from meeting its true potential for society
about 40,000 journals
public trust problem
science publishing must be built anew
illusion of truth and finality
artificial scarcity
narrow formats
incomplete information
prestige and journal-rank fallacies

What would a better system look like?

data, code and text are shared, indexed, archived and discoverable
analyses and workflows are also shared
reagents (e.g. plasmids), strains (e.g. mutants) shared
open-source software
maximise reproducibility
text + data + code = publication
publications openly accessible, not behind a paywall
affordable publishing (not hijacked by corporate for-profit publishers)
preprints = publication, followed by post-publication peer review

One example - the Uniprot database

a comprehensive resource for protein sequence and annotation data
entries uniquely identified by a stable URL
rich metadata that is both human-readable and machine-readable
shared vocabularies and ontologies
interlinking with more than 150 different databases
© 2002 – 2026 UniProt consortium

A history of Open Access — The Budapest Open Access Initiative

- https://www.budapestopenaccessinitiative.org/
February 14, 2002
Budapest, Hungary

removing barriers to literature
free and unrestricted online availability = open access
the costs of providing OA to literature are far lower than the costs of traditional publishing (printed press)
opportunity to save money and expand the scope of dissemination
recommendations: self-archiving (I.) and a new generation of open-access journals (II.)

The launch of PLoS

Paywalls and the story of Aaron Swartz

in 2011, 24 y old internet hacktivist Aaron Swartz was arrested at MIT
he downloaded several million articles from an online archive (JSTOR)
legal troubles
Swartz committed suicide in 2013
the internet was created so that scientists could communicate their research results with each other
billions of videos of cats for free, research results behind paywalls
(GPT-4o showed an 82% recognition rate for paywalled content)

The politicians weigh in - Plan S

Plan S was initiated in 2018 as a political solution to the OA problem
funders mandate immediate open access
proposal to cap APC (article processing charge)
publishers whine that this would hurt their profit
scrap the cap
let the ‘market’ solve it (it didn’t)
‘prestige’ journals can charge as much as they like (Nature-$12,690; Cell-$11,400)

OA has been hijacked by publishers

Diamond: OA journal without an APC
Green: not openly accessible from the publisher website but a free copy is accessible via a repository
Gold: OA journal with APC (profits can remain high!)
Hybrid: some papers OA others not (profits can remain high!)
Bronze: free to read, no identifiable licence
‘transformative agreements’

Towards some solutions…

What should we do now?

What should scientists and institutions do?

for the realist:

Safest bet: buy RELX stocks

(RELX: parent company of Elsevier)

We have the solution, but not the balls to implement it

under ‘closed’ models, institutions spend a lot of money on publishing
transitioning those funds to support community-led diamond OA, could fully support a global shift to OA
strenghten scholarly infrastructure for code, data, interoperability etc.
huge potential for cost savings (Schimmer et al. 2015)
publishing in the hand of public institutions
USD 6 billion/year is a lot of money for that
would also solve the problems with predatory publishers

What should scientists and institutions do?

for the idealist:

Taking back control

Public institutions (universities, libraries, funders etc.) should take back control of the digital scholarly infrastructure
create conditions of open competition for private sector (not oligopoly of a few publishers)
control data, text, code, citation metrics, scholarly workflows, databases, standards etc.
cancel all subscriptions and use money to fund databases, libraries, publishing etc.
support initiatives like OpenAIRE
build community, the commons -> publishing is community, care

New approaches to research assessment

the San Francisco Declaration on Research Assessment
eliminate the use of journal-based metrics (IF) in funding, appointment, and promotion decisions
assess research on its own merits rather than based on the journal
capitalize on the opportunities provided by online publication (e.g. relax limits on the number of words, figures, and references)

Chasing False Metrics — the Prestige Game

Harold Varnus
“We need to get away from false metrics and return to the task of looking at our colleagues’ work closely.”

we believe the most important work is published in so-called ‘high-impact’ journals
ceding judgments to journal editors
we have to eliminate the current situation in which the fate of researcher and their trainees depends on publishing in certain journals

eLife

funded by HHMI, Wellcome Trust, MPS, K&A Wallenberg Foundation

https://elifesciences.org/
The eLife process has five steps:
- Submission or transfer of a preprint from bioRxiv
- Peer review (eLife editors - who are all active researchers - discuss new submissions and decide which will be peer reviewed)
- Publication of Reviewed Preprint
- Publication of revised version
- Publication of Version of Record
- papers published together with eLife Assessment
- eLife has no IF! (good!!)

Sharing code in an ideal world - federated GitLab servers

Institutions should host their GitLab server for code
(GitLab is a database-backed web application running git)
(git is a Distributed Version Control Systems)
servers should be federated
European (-> world-wide) network of research/education institutions and libraries
code shared upon publication in a permanent repo with DOI

Code with persistend DOI

Permanent repository for data, text and code
integration with GitHub
version control
Safe — your research is stored safely for the future in CERN’s Data Centre for as long as CERN exists
citeable
usage statistics

What is the solution? - The Fediverse for Science

a federated infrastructure
run by public institutions (universities, libraries etc.)
for communication (microblogging = Mastodon)
for code (GitLab), data (e.g. Omero), text (preprint servers) etc.
taking back control of scholarly infrastructure

Towards a new, federated scholarly infrastructure

plan for a federated scholarly information network
a system that cannot be taken over by corporations
designed redundantly
open standards
“a decentralized, resilient, evolvable network that is interconnected by open standards and open-source norms under the governance of the scholarly community”

One example for publishing - Open Research Europe (ORE)

open access publishing venue for EC-funded researchers
no author or reader fees
Diamond (but authors need to be EC funded)
maintained by the European Commission
Wellcome Open Research (https://wellcomeopenresearch.org/) is similar, maintained by the Wellcome Trust
but too centralised and no community behind it

open up access methods, results, publications, data, software, materials, tools and peer reviews
standard tender process held regularly
no lock-in with a single publisher
regular procurement processes, no monopoly, fair prices

An European Infrastructure for Open Science

https://open-science-cloud.ec.europa.eu/

Available Services:
- File Sync & Share
- Interactive Notebooks
- Large File Transfer
- Virtual Machines
- Cloud Container Platform
- Bulk Data Transfer

…still early days

While we wait…

individual labs can change behaviour
my lab has completely switched to publishing preprints and in OA-only not-for-profit journals
raise your voice in hiring/promotion committees for DORA principles

COS goes FOSS The sorry state of scientific publishing and how we could move to an open and resilient infrastructure

The crysis of publishing—symptomes

Most source data collected by scientists are not available

Code is very often not shared or not shared stably

Flipped protein structures due to buggy program

Gene name errors are widespread in the scientific literature

Most scientists use software developed for accounting

Reporting and citation bias

Majority of high-impact cancer studies fail to replicate

An epidemic of retractions

Perverse incentives, publish or perish

Impact factor - not a metric of quality

IF - statistically flawed

IF - strongly biased by outliers

JIF does not correlate with quality metrics (e.g. statistical power)

But: JIF correlates with retractions

Current system is hugely wasteful

Kleptistan (Binjistan) - I

Kleptistan (Binjistan) - II

Kleptistan (Binjistan) - III

or How to milk the same cow multiple times?

Unacceptable practices of data tracking by publishers

The problem is the system

What would a better system look like?

One example - the Uniprot database

A history of Open Access — The Budapest Open Access Initiative

The launch of PLoS

Paywalls and the story of Aaron Swartz

The politicians weigh in - Plan S

OA has been hijacked by publishers

Towards some solutions…

What should we do now?

What should scientists and institutions do?

for the realist:

Safest bet: buy RELX stocks

We have the solution, but not the balls to implement it

What should scientists and institutions do?

for the idealist:

Taking back control

New approaches to research assessment

Chasing False Metrics — the Prestige Game

eLife

Sharing code in an ideal world - federated GitLab servers

Code with persistend DOI

What is the solution? - The Fediverse for Science

Towards a new, federated scholarly infrastructure

One example for publishing - Open Research Europe (ORE)

An European Infrastructure for Open Science

…still early days

While we wait…

Further reading

COS goes FOSS
The sorry state of scientific publishing and how we could move to an open and resilient infrastructure

or
How to milk the same cow multiple times?