Published at MetaROR
September 9, 2025
Table of contents
Establishing trust in automated reasoning
1. Centre de Biophysique Moléculaire (UPR4301 CNRS).
Originally published on January 3, 2025 at:
Editors
Ludo Waltman
Adrian Barnett
Editorial assessment
by Adrian Barnett
This summary article does not present new data or experiments but instead takes a broad look at automated reasoning and software. Reviewer #1 thought the article needed much more detail, including citations, examples, screenshots and figures. They were concerned about strong generalisations that were lacking evidence and have provided places where they wanted these details. Reviewer #2 considers the differences between reviewability and the practicalities of reviewing everything, and how being easily able to build-on other software acts as a kind of reproducibility. In my own editorial review, I generally enjoyed reading the paper and it prompted some interesting thoughts on trade-offs with standardisation and the level of detail shown to users for statistical code.
Recommendations from the editor
As a statistician, I am in strong agreement on the widespread inappropriate use of statistical inference (page 2) and the importance of software. I also strongly agree that “independent critical inspection [is] particularly challenging” (page 3). I also strongly agree that “The main difficulty in achieving such audits is that none of today’s scientific institutions consider them part of their mission”, as this is everyone’s problem and nobody’s problem.
I also agree that automation has encouraged standardisation and I have personally supported standardisation because some practices are so bad that many authors need to be “standardised”. However, I’ve also felt frustration at the sometimes fussy requirements when uploading R packages to CRAN (https://cran.r-project.org/). Similarly, some blanket changes from CRAN seem pedantic. There’s likely a balance between reducing poor practice and becoming too prescriptive.
In terms of transparency (section 2.4) I did think about the “Verbose=TRUE” option that I sometimes see in R. I tend to turn this on, as it’s good to see more of the workings, but perhaps the default is off? I did look at some packages using the google search: “verbose site:cran.r-project.org/web/packages”. I was also reminded of the difference between Bayesian and frequentist statistical modelling. Frequentist modelling often uses maximum likelihood to create parameter estimates, which usually runs quickly to create the estimates. In contrast, Bayesian methods often use MCMC, which is often slow and creates long chains of estimates; however, the chains will show if the likelihood does not have a clear maximum, which is usually from a badly specified model, whereas the maximum likelihood simply finds any peak. Frustratingly, I often get more push back from reviewers when using Bayesian methods, whereas in my opinion it should be the other way around as the Bayesian estimates have shown far more of the inner workings.
Some reflection on the growing use of AI to write software may be worthwhile. Presumably this could be more standardised, but there are other concerns. Using automation to check code could also be worthwhile.
For section 3, I thought that more sharing of code would mean “more eyeballs”, but the sharing needs to be done in FAIR way.
I wondered if highly-used software should get more scrutiny. Peer review is a scarce resource, so is likely better directed towards high use software. Andrew Gelman recently put forward a similar argument for checking published papers when they reach 250 citations: https://statmodeling.stat.columbia.edu/2025/02/26/pp/.
I agreed with the need for effort (page 19) and wondered if this paper could call for more effort.
Minor comments:
-
typo “asses” on page 7.
-
“supercomputers are rare”, should this be “relatively rare” or am I speaking from a privileged university where I’ve always had access to supercomputers.
-
I did think about “testthat” at multiple points whilst reading the paper (https://testthat.r-lib.org/)
-
Can badges on github about downloads and maturity help (page 7)? Although, far from all software is on github.
Peer review 1
Anonymous User
Thank you for submitting this paper. I think the paper requires substantial, major revisions to be published. Throughout the paper I noted many instances where references or examples would help make the intent clear. I also think the message of the paper would benefit from several figures to demonstrate workflows or ideas. The figures presented are essentially tables, and I think the message could be made clearer for the reader if they were presented as flow charts or at least with clear numbering to hook the ideas to the reader – e.g., Figures 1 & 2 would benefit from having numbers on the key ideas.
The paper is lacking many instances of citation, and at times reads as though it is an essay delivering an opinion. I’m not sure if this is the type of article that the journal would like, but two examples of sentences missing citations are:
-
“Over the last two decades, an unexpectedly large number of peer-reviewed findings across many scientific disciplines have been found to be irreproducible upon closer inspection.” (Introduction, page 2)
-
“A large number of examples cited in this context involves faulty software or inappropriate use of software” (Introduction, page 3)
Two examples of sentences missing examples are:
-
Experimental software evolves at a much faster pace than mature software, and documentation is rarely up to date or complete (in Mature vs. experimental software, page 7). Could the author provide more examples of what “experimental software” is? There is also consistent use of universal terms like “…is rarely up to date or complete”, which would be better phrased as “is often not up to date or complete”
-
There are various techniques for ensuring or verifying that a piece of software conforms to a formal specification.
Overall the paper introduces many new concepts, and I think it would greatly benefit from being made shorter and more concise, with adding some key figures for the reader to refer back to to understand these new ideas. The paper is well written, and it is clear the author is a great writer, and has put a lot of thought into the ideas. However it is my opinion that because these ideas are so big and require so much unpacking, they are also harder to understand. The reader would benefit from having more guidance to come back to understand these ideas.
I hope this review is helpful to the author.
Review comments
Introduction
Highlight [page 2]: Ever since the beginnings of organized science in the 17th century, researchers are expected to put all facts supporting their conclusions on the table, and allow their peers to inspect them for accuracy, pertinence, completeness, and bias. Since the 1950s, critical inspection has become an integral part of the publication process in the form of peer review, which is still widely regarded as a key criterion for trustworthy results.
-
and Note [page 2]: Both of these statements feel like they should have some peer review, or reference on them, I believe. What was the beginnings of organised science in the 1600s? Why since the 1950s? Why not sooner? What happened then?
Highlight [page 2]: Over the last two decades, an unexpectedly large number of peer-reviewed findings across many scientific disciplines have been found to be irreproducible upon closer inspection.
-
and Note [page 2]: I would expect at least a couple of citations here, e.g., Stodden et al., https://www.pnas.org/doi/abs/10.1073/pnas.1708290115
Highlight [page 2]: In the quantitative sciences, almost all of today’s research critically relies on computational techniques, even when they are not the primary tool for investigation – and Note [page 2]: Again, it does feel like it would be great to acknowledge research in this space.
Highlight [page 2]: But then, scientists mostly abandoned doubting.
-
and Note [page 2]: This feels like an essay, where show me the evidence for where you can say something like this?
Highlight [page 2]: Automation bias
-
and Note [page 2]: What is automation bias?
Highlight [page 3]: A large number of examples cited in this context involves faulty software or inappropriate use of software
-
and Note [page 3]: Can you provide some examples of the examples cited that you are referring to here?
Highlight [page 3]: A particularly frequent issue is the inappropriate use of statistical inference techniques.
-
and Note [page 3]: Please provide citations to these frequent issues.
Highlight [page 3]: The Open Science movement has made a first step towards dealing with automated reasoning in insisting on the necessity to publish scientific software, and ideally making the full development process transparent by the adoption of Open Source practices – and Note [page 3]: Could you provide an example of one of these Open Science movements?
Highlight [page 3]: Almost no scientific software is subjected to independent review today.
-
and Note [page 3]: How can you justify this claim?
Highlight [page 3]: In fact, we do not even have established processes for performing such reviews
-
and Note [page 3]: I disagree, there is the Journal of Open Source Software: https://joss.theoj.org/, rOpenSci has a guide for development of peer review of statistical software: https://github.com/ropensci/statistical-software-review-book, and also maintain a very clear process of software review: https://ropensci.org/software-review/
Highlight [page 3]: as I will show
-
and Note [page 3]: How will you show this?
Highlight [page 3]: is as much a source of mistakes as defects in the software itself
-
and Note [page 3]: Again, this feels like a statement of fact without evidence or citation.
Highlight [page 3]: This means that reviewing the use of scientific software requires particular attention to potential mismatches between the software’s behavior and its users’ expectations, in particular concerning edge cases and tacit assumptions made by the software developers. They are necessarily expressed somewhere in the software’s source code, but users are often not aware of them.
-
and Note [page 3]: The same can be said of assumptions for equations and mathematics – the problem here is dealing with abstraction of complexity and the potential unintended consequences.
Highlight [page 4]: the preservation of epistemic diversity
-
and Note [page 4]: Please define epistemic diversity
Reviewability of automated reasoning systems
Highlight [page 5]: The five dimensions of scientific software that influence its reviewability.
-
and Note [page 5]: It might be clearer to number these in the figure, and also I might suggest changing the “convivial” – it’s a pretty unusual word?
Wide-spectrum vs. situated software
Highlight [page 6]: In between these extremes, we have in particular domain libraries and tools, which play a very important role in computational science, i.e. in studies where computational techniques are the principal means of investigation
-
and Note [page 6]: I’m not very clear on this example – can you provide an example of a “domain library” or “domain tool” ?
Highlight [page 6]: Situated software is smaller and simpler, which makes it easier to understand and thus to review.
-
and Note [page 6]: I’m not sure I agree it is always smaller and simpler – the custom code for a new method could be incredibly complicated.
Highlight [page 6]: Domain tools and libraries
-
and Note [page 6]: Can you give an example of this?
Mature vs. experimental software
Highlight [page 7]: Experimental software evolves at a much faster pace than mature software, and documentation is rarely up to date or complete
-
and Note [page 7]: Could the author provide more examples of what “experimental software” is? There is also consistent use of universal terms like “…is rarely up to date or complete”, which would be better phrased as “is often not up to date or complete”
Highlight [page 7]: An extreme case of experimental software is machine learning models that are constantly updated with new training data.
-
and Note [page 7]: Such as…
Highlight [page 7]: interlocutor
-
and Note [page 7]: suggest “middle man” or “mediator”, ‘interlocutor’ isn’t a very common word
Highlight [page 7]: A grey zone
-
and Note [page 7]: I think it would be helpful to discuss black and white zones before this.
Highlight [page 7]: The libraries of the scientific Python ecosystem
-
and Note [page 7]: Do you mean SciPy? https://scipy.org/. Can you provide an example of the frequent changes that break backward compatibility?
Highlight [page 7]: too late that some of their critical dependencies are not as mature as they seemed to be
-
and Note [page 7]: Again, can you provide some evidence for this?
Highlight [page 7]: The main difference in practice is the widespread use of experimental software by unsuspecting scientists who believe it to be mature, whereas users of instrument prototypes are usually well aware of the experimental status of their equipment.
-
and Note [page 7]: Again this feels like an assertion without evidence. Is this an essay, or a research paper?
Convivial vs. proprietary software
Highlight [page 8]: Convivial software [Kell 2020], named in reference to Ivan Illich’s book “Tools for conviviality” [Illich 1973], is software that aims at augmenting its users’ agency over their computation
-
and Note [page 8]: It would be really helpful if the author would define the word, “convivial” here. It would also be very useful if they went on to give an example of what they meant by: “…software that aims at augmenting its users’ agency over their computation.” How does it augment the users agency?
Highlight [page 8]: Shaw recently proposed the less pejorative term vernacular developers [Shaw 2022]
-
and Note [page 8]: Could you provide an example of what makes “vernacular developers” different, or just what they mean by this term?
Highlight [page 8]: which Illich has described in detail
-
and Note [page 8]: Should this have a citation to Illich then in this sentence?
Highlight [page 8]: what has happened with computing technology for the general public
-
and Note [page 8]: Can you give an example of this. Do you mean the rise of Apple and Windows? MS Word? Facebook? A couple of examples would be really useful to make this point clear.
Highlight [page 8]: tech corporations
-
and Note [page 8]: Suggest “tech corporations” be “technology corporations”.
Highlight [page 8]: Some research communities have fallen into this trap as well, by adopting proprietary tools such as MATLAB as a foundation for their computational tools and models.
-
and Note [page 8]: Can you provide an example of the alternative here, what would be the way to avoid this trap – use software such as Octave, or?
Highlight [page 8]: Historically, the Free Software movement was born in a universe of convivial technology.
-
and Note [page 8]: If it is historic, can you please provide a reference to this?
Highlight [page 8]: most of the software they produced and used was placed in the public domain
-
and Note [page 8]: Can you provide an example of this? I’m also curious how the software was placed in the public domain if there was no way to distribute it via the internet.
Highlight [page 8]: as they saw legal constraints as the main obstacle to preserving conviviality
-
and Note [page 8]: Again, these are conjectures that are lacking a reference or example, can you provide some examples of references of this?
Highlight [page 9]: Software complexity has led to a creeping loss of user agency, to the point that even building and installing Open Source software from its source code is often no longer accessible to non-experts, making them dependent not only on the development communities, but also on packaging experts. An experience report on building the popular machine learning library PyTorch from source code nicely illustrates this point [Courtès 2021].
-
and Note [page 9]: Can you summarise what makes it difficult to install Open Source Software? Again, this statement feels like it is making a strong generalisation without clear evidence to support this. The article by Courtès (https://hpc.guix.info/blog/2021/09/whats-in-a-package/), actually notes that it’s straightforward to install PyTorch via pip, but using an alternative package manager causes difficulty. The point you are making here seems to be that building and installing most open source software is almost prohibitive, but I think you’ve given strong evidence for this claim, and I don’t understand how this builds into your overall argument.
Highlight [page 9]: It survives mainly in communities whose technology has its roots in the 1980s, such as programming systems inheriting from Smalltalk (e.g. Squeak, Pharo, and Cuis), or the programmable text editor GNU Emacs.
-
and Note [page 9]: Can you give an example of how it survives in these communities?
Highlight [page 9]: FLOSS has been rapidly gaining in popularity, and receives strong support from the Open Science movement
-
and Note [page 9]: Can you provide some evidence to back this statement up?
Highlight [page 9]: the traditional values of scientific research.
-
and Note [page 9]: Can you state what you mean by “traditional values of scientific research”
Highlight [page 9]: always been convivial
-
and Note [page 9]: Can you provide a further explanation of what makes them convivial?
Transparent vs. opaque software
Highlight [page 9]: Transparent software
-
and Note [page 9]: It might be useful to explain a distinction between transparent and open software – or to perhaps open with a statement for why we are talking about transparent and opaque software.
Highlight [page 9]: Large language models are an extreme example.
-
and Note [page 9]: Based on your definition of transparent software – every action produces a visible result. If I type something into an LLM and get an immediate and visible result, how is this different? It is possible you are stating that the behaviour is able to be easily interpreted, or perhaps the behaviour is easy to understand?
Highlight [page 10]: Even highly interactive software, for example in data analysis, performs nonobvious computations, yielding output that an experienced user can perhaps judge for plausibility, but not for correctness.
-
and Note [page 10]: Could you give a small example of this?
Highlight [page 10]: It is much easier to develop trust in transparent than in opaque software.
-
and Note [page 10]: Can you state why it is easier to develop this trust?
Highlight [page 10]: but also less important
-
and Note [page 10]: Can you state why it is less important?
Highlight [page 10]: even a very weak trustworthiness indicator such as popularity becomes sufficient
-
and Note [page 10]: becomes sufficient for what? Reviewing? Why does it become sufficient?
Highlight [page 10]: This is currently a much discussed issue with machine learning models,
-
and Note [page 10]: Given it is currently much discussed, could you link to at least 2 research articles discussing this point?
Highlight [page 10]: treated extensively in the philosophy of science.
-
and Note [page 10]: Given that is has been treated extensively, can you please provide some key references after this statement? You do go on to cite one paper, but it would be helpful to mention at least a few key articles.
Size of the minimal execution environment
Highlight [page 11]: The importance of this execution environment is not sufficiently appreciated by most researchers today, who tend to consider it a technical detail
-
and Note [page 11]: This statement is a bit of a sweeping generalisation – why is it not sufficiently appreciated? What evidence do you have of this?
Highlight [page 11]: Software environments have only recently been recognized as highly relevant for automated reasoning in science and beyond
-
and Note [page 11]: Where have they been only recently recognised?
Highlight [page 11]: However, they have not yet found their way into mainstream computational science.
-
and Note [page 11]: Could you provide an example of what it might look like if they were in mainstream computational science? For example, https://github.com/ropensci/rix implements using reproducible environments for R with NIX. What makes this not mainstream? Are you talking about mainstream in the sense of MS Excel? SPSS/SAS/STATA?
Analogies in experimental and theoretical science
Highlight [page 12]: Non-industrial components are occasionally made for special needs, but this is discouraged by their high manufacturing cost
-
and Note [page 12]: Can you provide an example of this?
Highlight [page 12]: cables
-
and Note [page 12]: What do you mean by a cable? As in a computer cable? An electricity cable?
Highlight [page 13]: which an experienced microscopist will recognize. Software with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diagnose easily.
-
and Note [page 13]: I don’t think this is a fair comparison. Surely there must be instances of experiences microscopists not identifying defects? Similarly, why can’t there be examples of domain expert or professional programmer/computer scientist identifying errors. Don’t unit tests help protect us against some of our errors? Granted, they aren’t bullet proof, and perhaps act more like guard rails.
Highlight [page 13]: where “traditional” means not relying on any form of automated reasoning.
-
and Note [page 13]: Can you give an example of what a “traditional” scientific model or theory
Improving the reviewability of automated reasoning systems
Highlight [page 14]: Figure 2: Four measures that can be taken to make scientific software more trustworthy.
-
and Note [page 14]: Could the author perhaps instead call these “four measures” or perhaps give them a better name, and number them?
Review the reviewable
Highlight [page 14]: mature wide-spectrum software
-
and Note [page 14]: Can you give an example of what “mature wide-spectrum software” is?
Highlight [page 15]: The main difficulty in achieving such audits is that none of today’s scientific institutions consider them part of their mission.
-
and Note [page 15]: I disagree. Monash provides an example here where they view software as a first class research output: https://robjhyndman.com/files/EBS_research_software.pdf
Science vs. the software industry
Highlight [page 15]: Many computers, operating systems, and compilers were designed specifically for the needs of scientists.
-
and Note [page 15]: Could you give an example of this? E.g., FORTRAN? COBAL?
Highlight [page 15]: Today, scientists use mostly commodity hardware
-
and Note [page 15]: Can you explain what you mean by “commodity hardware”, and give an example.
Highlight [page 15]: even considered advantageous if it also creates a barrier to reverse- engineering of the software by competitors
-
and Note [page 15]: Can you give an example of this?
Highlight [page 15]: few customers (e.g. banks, or medical equipment manufacturers) are willing to pay for
-
and Note [page 15]: What about software like SPSS/STATA/SAS – surely many many industries, and also researchers will pay for software like this that is considered mature?
Emphasize situated and convivial software
Highlight [page 16]: a convivial collection of more situated modules, possibly supported by a shared wide-spectrum layer.
-
and Note [page 16]: Could you give an example of what this might look like practically? Are you saying things like SciPy would be restructured into many separate modules, or?
Highlight [page 16]: In terms of FLOSS jargon, users make a partial fork of the project. Version control systems ensure provenance tracking and support the discovery of other forks. Keeping up to date with relevant forks of one’s software, and with the motivations for them, is part of everyday research work at the same level as keeping up to date with publications in one’s wider community. In fact, another way to describe this approach is full integration of scientific software development into established research practices, rather than keeping it a distinct activity governed by different rules.
-
and Note [page 16]: Could the author provide a diagram or schematic to more clearly show how such a system would work with forks etc?
Highlight [page 17]: a universe is very
-
and Note [page 17]: Perhaps this could be “would be very different” – since this doesn’t yet exist, right?
Highlight [page 17]: Improvement thus happens by small-step evolution rather than by large-scale design. While this may look strange to anyone used to today’s software development practices, it is very similar to how scientific models and theories have evolved in the pre-digital era.
-
and Note [page 17]: I think some kind of schematic or workflow to compare existing practices to this new practice would be really useful to articulate these points. I also think this new method of development you are proposing should have a concrete name.
Highlight [page 17]: Existing code refactoring tools can probably be adapted to support application-specific forks, for example via code specialization. But tools for working with the forks, i.e. discovering, exploring, and comparing code from multiple forks, are so far lacking. The ideal toolbox should support both forking and merging, where merging refers to creating consensual code versions from multiple forks. Such maintenance by consensus would probably be much slower than maintenance performed by a coordinated team.
-
and Note [page 17]: Perhaps an example of screenshot of a diff could be used to demonstrate that we can make these changes between two branches/commits, but comparing multiple is challenging?
Make scientific software explainable
Highlight [page 18]: An interesting line of research in software engineering is exploring possibilities to make complete software systems explainable [Nierstrasz and Girba 2022]. Although motivated by situated business applications, the basic ideas should be transferable to scientific computing
-
and Note [page 18]: Is this similar to concepts such as “X-AI” or “X-ML” – that is, “Explainable” Artificial Intelligence or Machine Learning?
Highlight [page 18]: Unlike traditional notebooks, Glamorous Toolkit [feenk.com 2023],
-
and Note [page 18]: It appears that you have introduced “Glamorous Toolkit” as an example of these three principles? It feels like it should be introduced earlier in this paragraph?
Highlight [page 18]: In Glamorous Toolkit, whenever you look at some code, you can access corresponding examples (and also other references to the code) with a few mouse clicks
-
and Note [page 18]: I think it would be very beneficial to show screenshots of what the author means – while I can follow the link to Glamorous Toolkit, bitrot is a thing, and that might go away, so it would good to see exactly what the author means when they discuss these examples.
Use Digital Scientific Notations
Highlight [page 18]: There are various techniques for ensuring or verifying that a piece of software conforms to a formal specification
-
and Note [page 18]: Can you give an example of these techniques?
Highlight [page 18]: The use of these tools is, for now, reserved to software that is critical for safety or security,
-
and Note [page 18]: Again, could you give an example of this point? Which tools, and which software is critical for safety or security?
Highlight [page 19]: formal specifications
-
and Note [page 19]: It would be really helpful if you could demonstrate an example of a formal specification so we can understand how they could be considered constraints.
Highlight [page 19]: All of them are much more elaborate than the specification of the result they produce. They are also rather opaque.
-
and Note [page 19]: It isn’t clear to me how these are opaque – if the algorithm is defined, it can be understood, how is it opaque?
Highlight [page 19]: Moreover, specifications are usually more modular than algorithms, which also helps human readers to better understand what the software does [Hinsen 2023]
-
and Note [page 19]: A tight example of this would be really useful to make this point clear. Perhaps with a figure of a specification alongside an algorithm.
Highlight [page 19]: In software engineering, specifications are written to formalize the expected behavior of the software before it is written. The software is considered correct if it conforms to the specification.
-
and Note [page 19]: Is an example of this test drive development?
Highlight [page 19]: A formal specification has to evolve in the same way, and is best seen as the formalization of the scientific knowledge. Change can flow from specification to software, but also in the opposite direction.
-
and Note [page 19]: Again, I think a good figure here would be very helpful in articulating this clearly.
Highlight [page 19]: My own experimental Digital Scientific Notation, Leibniz [Hinsen 2024], is intended to resemble traditional mathematical notation as used e.g. in physics. Its statements are embeddable into a narrative, such as a journal article, and it intentionally lacks typical programming language features such as scopes that do not exist in natural language, nor in mathematical notation.
-
and Note [page 19]: Could we see an example of what this might look like?
Conclusion
Highlight [page 20]: Situated software is easy to recognize.
-
and Note [page 20]: Could you provide some examples?
Highlight [page 20]: Examples from the reproducibility crisis support this view
-
and Note [page 20]: Can you provide some example papers that you mention here?
Highlight [page 21]: The ideal structure for a reliable scientific software stack would thus consist of a foundation of mature software, on top of which a transparent layer of situated software, such as a script, a notebook, or a workflow, orchestrates the computations that together answer a specific scientific question. Both layers of such a stack are reviewable, as I have explained in section 3.1, but adequate reviewing processes remain to be enacted.
-
and Note [page 21]: Again, I think it would be very insightful for the reader to have a clear figure to rest these ideas upon.
Highlight [page 21]: has been neglected by research institutions all around the world
-
and Note [page 21]: I do not think this is true – could you instead say “neglected my most/many” perhaps?
Peer review 2
In his article Establishing trust in automated reasoning (Hinsen, 2023) Hinsen argues that much of current scientific software lacks reviewability. Because scientific software has become such a central part of many scientific endeavors he worries that unreviewed software might contain mistakes which will never be spotted and consequently taint the scientific record. To illustrate this worry he cites issues with reproductions in different fields of science, which are often subsumed under the umbrella term of reproducibility crises. These crises, though not uncontested, have varied sources. In the field of social psychology reproducibility issues can for example often be traced to errors in statistical analyses, while shifting baselines and data leakage lead to problems in ML. Hinsen is only concerned with errors in scientific software. He suggests that potential errors could be spotted more easily if scientific software would be more reviewable. Thus he proposes five criteria against which reviewability could be judged. I will not discuss them in detail in this commentary and refer the interested reader to Hinsen (2023, section 2) for an extensive discussion. I note though, that the five criteria are meant to ensure an ideal type of reproducibility which Hinsen defines as follows: “Ideally, each piece of software should perform a well-defined computation that is documented in sufficient detail for its users and verifiable by independent reviewers.” (Hinsen, 2023, p.2). I take the upshot of these criteria to be that one could assert the reviewability of a piece of software before actually doing the review. They could thus function, perhaps contrary to Hinsen’s open science convictions, as a gatekeeping device in a peer review process for software. An editor could ”desk reject” software for not fulfilling the criteria before even sending it out to potential reviewers. If I am correct in this interpretation then we should entertain the same caution with them as we do with preregistration.
To be fair, Hinsen envisions a software review process which differs from current peer review with its acknowledged defects in several ways. He says, ”Developing suitable intermediate processes and institutions for reviewing such software is perhaps possible, but I consider it scientifically more appropriate to restructure such software into a convivial collection of more situated modules, possibly supported by a shared wide-spectrum layer.” (Hinsen, 2023, p.16).
Convivial software in turn is supposed to augment ”its users’ agency over their computation.” (Hinsen, 2023, p.16). This gives us a hint about the kind of user Hinsen has in mind – it is the software developer as a user. His concept of reviewability aims to make software transparent only to this kind of user (see Hinsen, 2023, p.20). In one of his many comparisons of scientific software to science, he notes that ”[. . . ] the main intellectual artifacts of science, i.e. theories and models, have always been convivial.” (Hinsen, 2023, p.9) and we can guess that he wants this to be the case for software too. But, if at all, scientific theories and models only have ever been convivial for scientists. The comparison also works the other way around, science as much as software is heavily fragmented into modules (disciplines). Scientists have always relied on the results of other scientists – they often have done and still do so without reviewing them. Has this hindered progress? I think one would be hard pressed to answer such a question in general for science, and perhaps it is the same for scientific software.
As Hinsen admits formal peer review is a quite novel addition to scientific methodology, being enforced on a larger scale only since the past fifty years or so. Science has progressed many years without, so we could ask why scientific software should not do likewise. Hinsen’s answer of course has to do with how he grades such software with respect to his reviewability criteria – obviously, most of it scores badly. Most scientific software is neither reviewed nor reviewable, Hinsen claims. This he considers a defect, because only reviewable software has to potential of being reviewed. Many practical considerations he discusses actually speak against the hope that most reviewable software will actually be reviewed. Still, without reviewability, it is hard, if not impossible, to spot mistakes. A case that was recently brought to my attention emphasizes this point. In Beheim et al. (2021) it is pointed out that a statistical analysis imputed missing values in an archaeo-historical database with the number 0. But for the statistical model (and software!) in use 0 had a different meaning than not available. This casts doubt on the conclusion that was drawn from the model. Beheim et al. were only able to spot this assumption because the code and data were available for review1. Cases like this abound and are examples for invisible programming values that philosopher James Moor discussed in the context of computer ethics (see Moor, 1985, The invisibility factor). Hinsen calls such values “tacit assumptions made by software developers” (Hinsen, 2023, p.3). We might speculate though, what would have happened if this questionable result had been incorporated into the scientific canon. Would later scientists really have continued building on it without ever realizing their shaky foundations? Or would the whole edifice have had to face the tribunal of experience at some point and crumbled? Perhaps the originating problem would never have been found and a whole research program would have been abandoned, perhaps a completely different part would have been blamed and excised – hard to say!
But maybe reviewability can also serve a different aim than establishing trust in the results of certain pieces of scientific software. Perhaps, it facilitates building on and incorporating pieces of such software in other projects. Its purpose could be more instrumental than epistemic. Although Hinsen seems to worry more about the epistemic problems coming with lack of reviewability, many points he makes implicitly deal with practical problems of software engineering. Whoever has fought against jupyter notebooks with legacy python requirements can immediately relate to his wish for keeping the execution environment as small as possible. For Hinsen software is actually defined by its execution environment (Hinsen, 2023, p.11), thus the complete environment must be available for its reviewability2. Software cannot be really seen as a separate entity and a review always reviews the whole environment. Analogously to Quine-Duhem we could call this situation review holism. But review holism might be less problematic than its scientific cousin suggests. We might not actually need to explicitly review the whole system. Perhaps it is sufficient if we achieve frictionless reproducibility (see Donoho, 2024), that is, other people can more or less easily incorporate and built on the software in question. Firstly if other software which incorporates the software in question works, it already is a type of successful reproduction. Secondly, the process of how software evolves might weed out any major errors, whatever errors remain are perhaps just irrelevant. In all fairness it has to be said that Hinsen does not think this is the case with current software. He argues that ”Software with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diagnose easily.” (Hinsen, 2023, p.13). But if that is the case then Hinsen’s later recourse to reliabilist-style justifications for software correctness is blocked too. We are in a situation for which the late Humphreys coined the term strange error (Rathkopf & Heinrichs, 2023, p.5). Strange errors are a challenge for any reliabilist account of justification because their magnitude can easily overwhelm arduously collected reliability assurances. If computational reliabilism was just reliabilism, and Hinsen seems to take it as such3, it would suffer from this problem too. But computational reliabilism has an additional internalist component, which explicitly allows for the whole toolbox of ”rationalist” software verification methods. If possible we should learn something about our tools other than their mere reliability. As Hacking said, ”[To understand] whether one sees through a microscope, one needs to know quite a lot about the tools.” (Hacking, 1981, p.135).
I would go so far and say that, if available, internalist justificiations are preferable to reliabilistic guarantees. It is only the case that often they are not and then we might content ourselves with the guarantees reliabilism provides. I said might content here, because such guarantees are unlikely to satisfy the skeptic. Obviously strange errors are always a possibility and no finite observation of correct software behaviour can completely rule them out. But in practice such concerns tend to fade over time, although they provide opportunity for unchecked philosophically skepticism. Many discussions about software opacity feed from such skepticism and this is what I tried to balance with computational reliabilism. In this spirit computational reliabilism was an attempt to temper theoretical skeptics in philosophy, not to give normative guidance to software engineering practice. My view was always that practice has the last say over philosophical concerns. If the emerging view in software engineering practice now is that more skepticism is appropriate, I will happily concur. But I should like to remind the practitioner that evidence for such skepticism has to be given in practice too, mere theoretical possibilities are not sufficient to establish it.
Reviewability does not mean reviewed. And only reviews can give us trust – or so we might think. As Hinsen acknowledges we should not expect that a majority of scientific software will ever be reviewed. Does this mean we cannot trust the results from such software? Above I tried to sketch a way out of this conundrum: We can view reviewability as advocated by Hinsen as a way to enable frictionless reproducibility, which in turn lets us built upon software, incorporate it in our own projects and use its results. As long as it works in a practically fulfilling way, this might be all the reviewing we need.
Notes
1A statistician once told me, that one glance at the raw data of this example immediately made clear to him that whatever problem there was with imputation, the data would never have supported the desired conclusions in any way. One man’s glance is another’s review.
2Hinsen’s definition of software closely parallels that of Moor, who argued that computer programs are a relation between a computer, a set of instructions and an activity (Moor, 1978, p.214).
3Hinsen characterizes computational reliabilism as follows, ”As an alternative source of trust, they propose computational reliabilism, which is trust derived from the experience that a computational procedure has produced mostly good results in a large number of applications.” (Hinsen, 2023, p.10)
References
Beheim, B., Atkinson, Q. D., Bulbulia, J., Gervais, W., Gray, R. D., Henrich, J., Lang, M., Monroe, M. W., Muthukrishna, M., Norenzayan, A., Purzy- cki, B. G., Shariff, A., Slingerland, E., Spicer, R., & Willard, A. K. (2021). Treatment of missing data determined conclusions regarding moralizing gods. Nature, 595 (7866), E29–E34. https://doi.org/10.1038/s41586-021-03655-4
Donoho, D. (2024). Data Science at the Singularity. Harvard Data Science Re- view, 6 (1). https://doi.org/10.1162/99608f92.b91339ef
Hacking, I. (1981). Do We See Through a Microscope? Pacific Philosophical Quarterly, 62 (4), 305–322. https://doi.org/10.1111/j.1468-0114.1981.tb00070.x
Hinsen, K. (2023, July). Establishing trust in automated reasoning. https:// doi.org/10.31222/osf.io/nt96q
Moor, J. H. (1978). Three Myths of Computer Science. The British Journal for the Philosophy of Science, 29 (3), 213–222. https://doi.org/10.1093/bjps/29.3.213
Moor, J. H. (1985). What is computer ethics? Metaphilosophy, 16 (4), 266–275. https://doi.org/10.1111/j.1467-9973.1985.tb00173.x
Rathkopf, C., & Heinrichs, B. (2023). Learning to Live with Strange Error: Be- yond Trustworthiness in Artificial Intelligence Ethics. Cambridge Quarterly of Healthcare Ethics, 1–13. https://doi.org/10.1017/S0963180122000688
Author response
Dear editors and reviewers, Thank you for your careful reading of my manuscript and the detailed and insightful feedback. It has contributed significantly to the improvements in the revised version. Please find my detailed responses below.
1 Reviewer 1
Thank you for this helpful review, and in particular for pointing out the need for more references, illustrations, and examples in various places of my manuscript. In the case of the section on experimental software, the search for examples made clear to me that the label was in fact badly chosen. I have relabeled the dimension as “stable vs. evolving software”, and rewritten the section almost entirely. Another major change motivated by your feedback is the addition of a figure showing the structure of a typical scientific software stack (Fig. 2), and of three case studies (section 2.7) in which I evaluate scientific software packages according to my five dimensions of reviewability. The discussion of conviviality (section 2.4), a concept that is indeed not widely known yet, has been much expanded. I have followed the advice to add references in many places. I have been more hesitant to follow the requests for additional examples and illustrations, because of the inevitable conflict with the equally understandable request to make the paper more compact. In many cases, I have preferred to refer to examples discussed in the literature. A few comments deserve a more detailed reply:
Introduction
Highlight [page 3]: In fact, we do not even have established processes for performing such reviews
and Note [page 3]: I disagree, there is the Journal of Open Source Software: https://joss.theoj.org/, rOpenSci has a guide for development of peer review of statistical software: https://github.com/ropensci/statistical software-review-book, and also maintain a very clear process of software review: https://ropensci.org/software-review/
As I say in the section “Review the reviewable”, these reviews are not independent critical examination of the software as I define it. Reviewers are not asked to evaluate the software’s correctness or appropriateness for any specific purpose. They are expected to comment only on formal characteristics of the software publication process (e.g. “is there a license?”), and on a few software engineering quality indicators (“is there a test suite?”).
Highlight [page 3]: This means that reviewing the use of scientific software requires particular attention to potential mismatches between the software’s behavior and its users’ expectations, in particular concerning edge cases and tacit assumptions made by the software developers. They are necessarily expressed somewhere in the software’s source code, but users are often not aware of them.
and Note [page 3]: The same can be said of assumptions for equations and mathematics- the problem here is dealing with abstraction of complexity and the potential unintended consequences.
Indeed. That’s why we need someone other than the authors to go through mathematical reasoning and verify it. Which we do.
Reviewability of automated reasoning systems
Wide-spectrum vs. situated software
Highlight [page 6]: Situated software is smaller and simpler, which makes it easier to understand and thus to review.
and Note [page 6]: I’m not sure I agree it is always smaller and simpler- the custom code for a new method could be incredibly complicated.
The comparison is between situated software and more generic software performing the same operation. For example, a script reading one specific CSV file compared to a subroutine reading arbitrary CSV files. I have yet to see a case in which abstraction from a concrete to a generic function makes code smaller or simpler.
Convivial vs. proprietary software
Highlight [page 8]: most of the software they produced and used was placed in the public domain
and Note [page 8]: Can you provide an example of this? I’m also curious how the software was placed in the public domain if there was no way to distribute it via the internet.
Software distribution in science was well organized long before the Internet, it was just slower and more expensive. Both decks of punched cards and magnetic tapes were routinely sent by mail. The earliest organized software distribution for science I am aware of was the DECUS Software Library in the early 1960s.
Size of the minimal execution environment
Note [page 11]: Could you provide an example of what it might look like if they were in mainstream computational science? For example, https://github.com/ropensci/rix implements using reproducible environments for R with NIX. What makes this not mainstream? Are you talking about mainstream in the sense of MS Excel? SPSS/SAS/STATA?
I have looked for quantitative studies on software use in science that would allow to give a precise meaning to “mainstream”, but I have not been able to find any. Based on my personal experience, mostly with teaching MOOCs on computational science in which students are asked about the software they use, the most widely used platform is Microsoft Windows. Linux is already a minority platform (though overrepresented in computer science), and Nix users are again a small minority among Linux users.
Analogies in experimental and theoretical science
Highlight [page 13]: which an experienced microscopist will recognize. Soft ware with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diag- nose easily.
and Note [page 13]: I don’t think this is a fair comparison. Surely there must be instances of experiences microscopists not identifying defects? Similarly, why can’t there be examples of domain expert or professional program mer/computer scientist identifying errors. Don’t unit tests help protect us against some of our errors? Granted, they aren’t bullet proof, and perhaps act more like guard rails.
There are probably cases of microscopists not noticing defects, but my point is that if you ask them to look for defects, they know what to do (and I have made this clearer in my text). For contrast, take GROMACS (one of my case studies in the revised manuscript) and ask either an expert programmer or an experienced computational biophysicist if it correctly implements, say, the AMBER force field. They wouldn’t know what to do to answer that question, both because it is ill-defined (there is no precise definition of the AMBER force field) and because the number of possible mistakes and symptoms of mistakes is enormous. I have seen a protein simulation program fail for proteins whose number of atoms was in a narrow interval, defined by the size that a compiler attributed to a specific data structure. I was able to catch and track down this failure only because a result was obviously wrong for my use case. I have never heard of similar issues with microscopes.
Improving the reviewability of automated reasoning systems
Review the reviewable
Highlight [page 15]: The main difficulty in achieving such audits is that none of today’s scientific institutions consider them part of their mission.
and Note [page 15]: I disagree. Monash provides an example here where they view software as a first class research output: https://robjhyndman.com/files/EBS_research_software.pdf
This example is about superficial reviews in the context of career evaluation. Other institutions have similar processes. As far as I know, none of them ask reviewers to look at the actual code and comment on its correctness or its suitability for some specific purpose.
Science vs. the software industry
Highlight [page 15]: few customers (e.g. banks, or medical equipment manufacturers) are willing to pay for
and Note [page 15]: What about software like SPSS/STATA/SAS- surely many many industries, and also researchers will pay for software like this that is considered mature?
I could indeed extend the list of examples to include various industries. Compared to the huge number of individuals using PCs and smartphones, that’s still few customers.
Emphasize situated and convivial software
Note [page 16]: Could the author provide a diagram or schematic to more clearly show how such a system would work with forks etc?
I have decided the contrary: I have significantly shortened this section, removing all speculation about how the ideas could be turned into concrete technology. The reason is that I have been working on this topic since I wrote the reviewed version of this manuscript, and I have a lot more to say about it than would be reasonable to include in this work. This will become a separate article.
Make scientific software explainable
Note [page 18]: I think it would be very beneficial to show screenshots of what the author means- while I can follow the link to Glamorous Toolkit, bitrot is a thing, and that might go away, so it would good to see exactly what the author means when they discuss these examples.
Unfortunately, static screenshots can only convey a limited impression of Glamorous Toolkit, but I agree that they have are a more stable support than the software itself. Rather than adding my own screenshots, I refer to a recent paper by the authors of Glamorous Toolkit that includes many screenshots for illustration.
Use Digital Scientific Notations
Highlight [page 19]: formal specifications and Note [page 19]: It would be really helpful if you could demonstrate an example of a formal specification so we can understand how they could be considered constraints.
Highlight [page 19]: Moreover, specifications are usually more modular than algorithms, which also helps human readers to better understand what the software does [Hinsen 2023]
and Note [page 19]: A tight example of this would be really useful to make this point clear. Perhaps with a figure of a specification alongside an algorithm.
I do give an example: sorting a list. To write down an actual formalized version, I’d have to introduce a formal specification language and explain it, which I think goes well beyond the scope of this article. Illustrating modularity requires an even larger example. This is, however, an interesting challenge which I’d be happy to take up in a future article.
Highlight [page 19]: In software engineering, specifications are written to formalize the expected behavior of the software before it is written. The software is considered correct if it conforms to the specification.
and Note [page 19]: Is an example of this test drive development?
Not exactly, though the underlying idea is similar: provide a condition that a result must satisfy as evidence for being correct. With testing, the condition is spelt out for one specific input. In a formal specification, the condition is written down for all possible inputs.
2 Reviewer 2
First of all, I would like to thank the reviewer for this thoughtful review. It addresses many points that require clarifications in the my article, which I hope to have done adequately in the revised version.
One such point is the role and form of reviewing processes for software. I have made it clearer that I take “review” to mean “critical independent inspection”. It could be performed by the user of a piece of software, but the standard case should be a review performed by experts at the request of some institution that then publishes the reviewer’s findings. There is no notion of gatekeeping attached to such reviews. Users are free to ignore them. Given that today, we publish and use scientific software without any review at all, the risk of shifting to the opposite extreme of having reviewers become gatekeepers seems unlikely to me.
Your comment on users being software developers addresses another important point that I had failed to make clear: conviviality is all about diminishing the distinction between developers and users. Users gain agency over their computations at the price of taking on more of a developer role. This is now stated explicitly in the revised article. Your hypothesis that I want scientific software to be convivial is only partially true. I want convivially structured software to be an option for scientists, with adequate infrastructure and tooling support, but I do not consider it to be the best approach for all scientific software.
The paragraph on the relevance and importance of reviewing in your comment is a valid point of view but, unsurprisingly, not mine. In the grand scheme of science, no specific quality assurance measure is strictly necessary. There is always another layer above that will catch mistakes that weren’t detected in the layer below. It is thus unlikely that unreliable software will cause all of science to crumble. But from many perspectives, including overall efficiency, personal satisfaction of practitioners, and insight derived from the process, it is preferable to catch mistakes as closely as possible to their source. Pre-digital theoreticians have always double-checked their manual calculations before submitting their papers, rather than sending off unchecked results and count on confrontation with experiment for finding mistakes. I believe that we should follow this same approach with software. The cost of mistakes can be quite high. Consider the story of the five retracted protein structures that I cite in my article (Miller, 2006, 10.1126/science.314.5807.1856). The five publications that were retracted involved years of work by researchers, reviewers, and editors. In between their publication and their retraction, other protein crystallographers saw their work rejected because it was in contradiction with the high-profile articles that later turned out to be wrong. The whole story has probably involved a few ruined careers in addition to its monetary cost. In contrast, independent critical examination of the software and the research processes in which it was used would likely have spotted the problem rather quickly (Matthews, 2007).
You point out that reviewability is also a criterion in choosing software to build on, and I agree. Building on other people’s software requires trusting it. Incorporating it into one’s own work (the core principle of convivial software) requires understanding it. This is in fact what motivated my reflections on this topic. I am not much interested in neatly separating epistemic and practical issues. I am a practitioner, my interest in epistemology comes from a desire for improving practices.
Review holism is something I have not thought about before. I consider it both impossible to apply in practice and of little practical value. What I am suggesting, and I hope to have made this clearer in my revision, is that reviewing must take into account the dependency graph. Reviewing software X requires a prior review of its dependencies (possibly already done by someone else), and a consideration of how each dependency influences the software under consideration. However, I do not consider Donoho’s “frictionless reproducibility” a sufficient basis for trust. It has the same problem as the widespread practice of tacitly assuming a piece of software to be correct because it is widely used. This reasoning is valid only if mistakes have a high chance of being noticed, and that’s in my experience not true for many kinds of research software. “It works”, when pronounced by a computational scientist, really means “There is no evidence that it doesn’t work”.
This is also why I point out the chaotic nature of computation. It is not about Humphreys’ “strange errors”, for which I have no solution to offer. It is about the fact that looking for mistakes requires some prior idea of what the symptoms of a mistake might be. Experienced researchers do have such prior ideas for scientific instruments, and also e.g. for numerical algorithms. They come from an understanding of the instruments and their use, including in particular a knowledge of how they can go wrong. But once your substrate is a Turing-complete language, no such understanding is possible any more. Every programmer has made the experience of chasing down some bug that at first sight seems impossible. My long-term hope is that scientific computing will move towards domain-specific languages that are explicitly not Turing-complete, and offer useful guarantees in exchange. Unfortunately, I am not aware of any research in this space.
I fully agree with you that internalist justifications are preferable to reliabilistic ones. But being fundamentally a pragmatist, I don’t care much about that distinction. Indisputable justification doesn’t really exist anywhere in science. I am fine with trust that has a solid basis, even if there remains a chance of failure. I’d already be happy if every researcher could answer the question “why do you trust your computational results?” in a way that shows signs of critical reflection.
What I care about ultimately is improving practices in computational science. Over the last 30 years, I have seen numerous mistakes being discovered by chance, often leading to abandoned research projects. Some of these mistakes were due to software bugs, but the most common cause was an incorrect mental model of what the software does. I believe that the best technique we have found so far to spot mistakes in science is critical independent inspection. That’s why I am hoping to see it applied more widely to computation.
2.1 References
Miller, G. (2006) A Scientist’s Nightmare: Software Problem Leads to Five Retractions. Science 314, 1856. https://doi.org/10.1126/science.314.5807.1856
Matthews, B.W. (2007) Five retracted structure reports: Inverted or incorrect? Protein Science 16, 1013. https://doi.org/10.1110/ps.072888607
3 Editor
Bayesian methods often use MCMC, which is often slow and creates long chains of estimates; however, the chains will show if the likelihood does not have a clear maximum, which is usually from a badly specified model…
That is an interesting observation I haven’t seen mentioned bedore. I agree that Bayesian inference is particularly amenable to inspection. One more reason to normalize inspection and inspectability in computational science.
Some reflection on the growing use of AI to write software may be worthwhile.
The use of AI in writing and reviewing software is a topic I have considered for this review, since the technology has evolved enormously since I wrote the current version of the manuscript. However, in view of reviewer 1’s constant admonition to back up statements with citations, I refrained from delving into this topic. We all know it’s happening, but it’s too early to observe a clear impact on research software. I have therefore limited myself to a short comment in the Conclusion section.
I wondered if highly-used software should get more scrutiny.
This is an interesting suggestion. If and when we get serious about reviewing code, resource allocation will become an important topic. For getting started, it’s probably more productive to review newly published code than heavily used code, because there is a better chance that authors actually act on the feedback and improve their code before it has many users. That in turn will help improve the reviewing process, which is what matters most right now, in my opinion.
“supercomputers are rare”, should this be “relatively rare” or am I speaking from a privileged university where I’ve always had access to supercomputers.
If you have easy access to supercomputer, you should indeed consider yourself privileged. But did you ever use supercomputer time for reviewing someone else’s work? I have relatively easy access to supercomputers as well, but I do have to make a request and promise to do innovative research with the allocated resources.
I did think about “testthat” at multiple points whilst reading the paper (https://testthat.r-lib.org/)
I hadn’t seen “testthat” before, not being much of a user of R. It looks interesting, and reminds me of similar test support features in Smalltalk which I found very helpful. Improving testing culture is definitely a valuable contribution to improving computational practices.
Can badges on github about downloads and maturity help (page 7)?
Badges can help, on GitHub or elsewhere, e.g. in scientific software catalogs. I see them as a coarse-grained output of reviewing. The right balance to find is between the visibility of a badge and the precision of a carefully written review report. One risk with badges is the temptation to automate the evaluation that leads to it. This is fine for quantitative measures such as test coverage, but what we mostly lack today is human expert judgement on software.



