Wide Open Source Science Hackathon Keynote Speech

This blog post is a slightly adapted transcript of a keynote speech I gave at the Wide Open Source Science hackathon opening session on 26 October 2018. The hackathon was co-organised by IT Center for Science CSC, National Library of Finland, Helsinki Think Company, and University of Helsinki, my employer. The views expressed in the speech are my personal opinions.

Openness as the foundation of responsible science

Examples of successful applications and challenges of open science

You are all familiar with the thought experiment “If a tree falls in the forest and no one is around to hear, does it make a sound?”. 

Let’s try an updated version: “If a research result is disseminated only through a paywalled journal, with subscription fee so high, even an Ivy League university struggles to pay it, with a long embargo for parallel archiving, and data as a non-machine-readable supplement, does that research make a sound?”.  

Sure it does. It makes a dull thump, when people hit that paywall.


It didn’t use to be that way. Scientific journals are the original open science. The first ones, the French Journal des sçavans and the English Philosophical Transactions of the Royal Society, were founded in 1665. Before journals, scientific discoveries were revealed in correspondence with colleagues or at society meetings.

To a large extent the research community is still living the long 17th century and acting like internet and computational methods are just a minor improvement. Journals are mostly online, sure, but they are artificially keeping up the ink and paper era scarcity by toll-access, limiting the number of accepted articles, demanding a print era lay-out, and blocking text-mining with restrictive copyright and non-existent or bad API’s.

Just like in the 17th century, open science today is about updating practices to match existing technologies. Because in order to solve the grand challenges facing humanity, grandest of which is the climate change, we need to get from thump to bang.


And we need to get there, like, yesterday.

What could the 21st century solutions look like then?

For example like the ArXiv. It is an open online repository for preprints boasting open access to 1,455,301 papers in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, Statistics, Electrical Engineering and Systems Science, and Economics. Annual fees from member institutions range from 1000 to 4000 USD. That’s the typical price range for one Gold Open Access journal article APC.

ArXiv doesn’t provide peer review, which is often named as the most important service that journals do. But on a closer look, it isn’t actually journals that do the peer review, it’s researchers from whatever field of research is in question. That’s where the term ‘peer’ comes from. What the journals do is coordinate the reviewing process. Which is fine, and can be a useful . But is that really a service that deserves a price tag in the thousands, considering that the journals get both the labor of the authors and reviewers free of charge? From the research community’s point of view it’s little like giving someone a gift and having them sell it back to you.


Overlay journals are one attempt to break up the marriage between publishers and peer review. They don’t produce content, but instead select texts that are already openly available and evaluate them.

Such journals are still marginal, but they have some notable champions, like Field medalist Timothy Gowers, who launched an ArXiv based overlay journal called ‘Discrete analysis’ in 2016.

Picture source: http://blogs.lse.ac.uk/impactofsocialsciences/2016/03/21/five-minutes-with-timothy-gowers-on-the-launch-of-discreet-analysis/

Discrete analysis wasn’t Gowers’ first venture into the fringes of academic practices. In 2009 he decided to experiment in crowdsourcing a solution to a mathematical problem. The initial blog post, titled ‘Is massively collaborative mathematics possible?” grew into what is now known as the first installment of the Polymath Project, a community of mathematicians and a particular method of doing mathematical research online, in an openly collaborative way. The method has since been applied also to other disciplines, the Finnish origin NMR Lipids Project among the most successful ones.

Both open collaboration and overlay journals have been proven to be cost effective ways to accelerate scientific discovery and increase community engagement in disseminating research results, but still I bet that most of you hadn’t heard about them before (I’d love to be wrong!).

Getting open science from the fringe to the mainstream isn’t blocked by a lack of technology or innovation, as I attempted to highlight with the previous examples, and there are many more, nor by lack of money, since many of the innovations cost a fraction of what the system is consuming at the moment. But the research community is pestered by vicious circles that are rooted deep in the ways of accumulating academic merit and social capital.

One way to demonstrate the dilemmas we are facing is to look at the issue of research data citation. It can be seen as a litmus test for the research data ecosystem: if all of the components exist and are where they need to be, citing data is fairly straightforward and it’s practices easily embeddable to research workflows. If not, it can be very hard to kick-start, because all of the components lean on each other.

Creative Commons License Data citation ecosystem by Heidi Laine is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This is what a data citation ecosystem should ideally look like, as visualized in the research data citation roadmap by the Finnish Committee for Research Data:

  • policy makers and funders provide the resources, oil the cogs if you will,
  • institutions create enforceable data policies that guide and direct researchers into managing their data in a responsible and sustainable way, meaning taking care of producing metadata and storing their data in a trustworthy repository,
  • repositories provide the data with landing pages, which act as a facade for the data, holding important information necessary for reuse and readability of the data, as well as an identifier, which acts as a durable and reliable link to the data,
  • publishers make sure, that researchers use these identifiers when they publish based on data, were it their own of someone else’s, so that anyone interested in verifying research results finds the underlying evidence. 

Because of the machine readable identifiers, data usage can be traced and tracked, researchers get credit for their data outputs, willingly nurture and whenever they can also share their data, and data is considered a first class research object, worthy of recognition when making funding decisions and determining who gets tenure.

Sounds like a wonderful dream doesn’t it? At least to a data policy geek like myself it does. Now let’s see where this house of cards crumbles.

A University of Helsinki survey from 2016 revealed, that 56% of respondents didn’t store their data in a specific digital data repository. Another research conducted among University of Helsinki revealed that creating metadata, a prerequisite for data citation and a fundament of responsible data management, was considered burdensome and thus often neglected. And there is no reason to believe that these challenges are specific to University of Helsinki, the leading research university in Finland and in many rankings among the some top 100 in the world. According to the Figshare 2018 state of open data report, only less then ten percent of the respondents are in the habit of creating data management plans for their research. About one in five among the respondents implement their data management plans more frequently than not, if they have happened to do one in the first place. 60% of the respondents hadn’t heard about the FAIR principles, which are findability, accessibility, interoperability and reusability. Even though the Figshare survey showed positive attitudes towards data sharing being on the rise, it is cold comfort, if they don’t know or care about the basics of data management.

Creative Commons License Data citation & FAIR data beneficial cycle by Heidi Laine is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

At the moment there just aren’t enough incentives for researchers to manage their data in a way that would support data citation, which would support incentivising data management, which would support data citation, which would support incentivising… You get the picture.

It is no great revelation, that incentivizing researchers is one of the central challenges of open science. Lack of incentives is an often repeated bottleneck. But usually the way the issue is framed focuses on incentivizing openness and sharing, when in my view it’s like trying to put the carriage in front of the horse. Instead the focus should be on incentivizing data stewardship and other sharing inducing practices such as cross-border collaboration.

In my dream, when we accomplish that, the openness, and the bang, will follow.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s