Establishing trust in automated reasoning

Konrad Hinsen¹

1. Centre de Biophysique Moléculaire (UPR4301 CNRS).

Originally published on January 3, 2025 at:

https://doi.org/10.31222/osf.io/nt96q

This is an evaluation of an older version of this article. MetaROR has also evaluated a

more recent versionmore recent version

of the article.

Abstract

Since its beginnings in the 1940s, automated reasoning by computers has become a tool of ever growing importance in scientific research. So far, the rules underlying automated reasoning have mainly been formulated by humans, in the form of program source code. Rules derived from large amounts of data, via machine learning techniques, are a complementary approach currently under intense development. The question of why we should trust these systems, and the results obtained with their help, has been discussed by early practitioners of computational science, but was later forgotten. The present work focuses on independent reviewing, an important source of trust in science, and identifies the characteristics of automated reasoning systems that affect their reviewability. It also discusses possible steps towards increasing reviewability and trustworthiness via a combination of technical and social measures.

Full text

1 Introduction

Like all social processes, scientific research builds on trust. In order to increase humanity’s knowledge and understanding, scientists need to trust their colleagues, their institutions, their tools, and the scientific record. Moreover, science plays an increasingly important role in industry and public policy. Decision makers in these spheres must therefore be able to judge which of the scientific findings that matter for them are actually trustworthy.

In addition to the trust-forming mechanisms present in all social relationships, the scientific method is built in particular on transparency and independent critical inspection, which serve to remove the inevitable mistakes and biases in individual contributions as they enter the scientific record. Ever since the beginnings of organized science in the 17th century, researchers are expected to put all facts supporting their conclusions on the table, and allow their peers to inspect them for accuracy, pertinence, completeness, and bias. Since the 1950s, critical inspection has become an integral part of the publication process in the form of peer review, which is still widely regarded as a key criterion for trustworthy results.

Over the last two decades, an unexpectedly large number of peer-reviewed findings across many scientific disciplines have been found to be irreproducible upon closer inspection. This so-called “reproducibility crisis” has shown that our practices for performing, publishing, reviewing, and interpreting scientific studies are no longer adequate in today’s scientific research landscape, whose social, technological, and economic contexts have changed dramatically. Updating these processes is a major aspect of the nascent Open Science movement.

The topic of this article is a particularly important recent change in research practices: the increasing use of automated reasoning. Computers and software have led to the development of completely new techniques for scientific investigation, and permitted existing ones to be applied at larger scales and by a much larger number of researchers. In the quantitative sciences, almost all of today’s research critically relies on computational techniques, even when they are not the primary tool for investigation. Simulation, data analysis, and statistical inference have found their place in almost every researcher’s toolbox. Machine learning techniques, currently under intense development, may well become equally ubiquitous in the near future.

From the point of view of transparency and critical inspection, these new tools are highly problematic. Ideally, each piece of software should perform a well-defined computation that is documented in sufficient detail for its users and verifiable by independent reviewers. Furthermore, users of software should receive adequate training to ensure that they understand the software’s operation and in particular its limitations. When affordable desktop workstations brought computation into the hands of a rapidly increasing number of scientists, in the 1980s, this was a topic of debate, summarized by Turkle [Turkle 2009] as “the tension between doing and doubting”. But then, scientists mostly abandoned doubting. Automation bias [Parasuraman and Riley 1997] is one possible cause, but also resignation to the impossibility of constructive doubt. The reproducibility crisis could have been a wakeup call, but the contribution of automated reasoning to this crisis has not been widely recognized. A large number of examples cited in this context involves faulty software or inappropriate use of software. A particularly frequent issue is the inappropriate use of statistical inference techniques. They are available at the click of a button to a large number of researchers, many of which do not even know what they would need to learn in order to use these techniques correctly. Beyond reproducibility, the documented cases of faulty automated reasoning [e.g. Merali 2010] are probably just the tip of the iceberg, and pessimistic but not unrealistic estimates suggest that most computational results in science are to some degree wrong because of defects in automated reasoning techniques [Soergel 2015; Thimbleby 2023].

The Open Science movement has made a first step towards dealing with automated reasoning in insisting on the necessity to publish scientific software, and ideally making the full development process transparent by the adoption of Open Source practices. While this level of transparency is a necessary condition for critical inspection, it is not sufficient. Almost no scientific software is subjected to independent review today. In fact, we do not even have established processes for performing such reviews. Moreover, as I will show, much of today’s scientific software is written in a way that makes independent critical inspection particularly challenging if not impossible. If we want scientific software to become trustworthy, we therefore have to develop reviewing practices in parallel with software architectures that make reviewing actually feasible in practice. And where reviewing is not possible, we must acknowledge the experimental nature of automated reasoning processes and make sure that everyone looking at their results is aware of their uncertain reliability.

As for all research tools, it is not only the software itself that requires critical inspection, but also the way the software is used in a specific research project. Improper use of software, or inappropriateness of the methods implemented by the software, is as much a source of mistakes as defects in the software itself. However, the distinction between a defect and inappropriate use is not as obvious as it may seem. A clear distinction would require a well-defined interface between software and users, much like a written contract. If the software’s behavior deviates from this contract, it’s a defect. If the user’s needs deviate from the contract, it’s inappropriate use. But such detailed contracts, called specifications in the context of software, rarely exist. Even outside of science, the cost of writing, verifying, and maintaining specifications limits their use to particularly critical applications. This means that reviewing the use of scientific software requires particular attention to potential mismatches between the software’s behavior and its users’ expectations, in particular concerning edge cases and tacit assumptions made by the software developers. They are necessarily expressed somewhere in the software’s source code, but users are often not aware of them.

The scientific requirement of independent reviewing is related to another aspect of automated reasoning that I will address, in particular in my proposals for improving our current practices: the preservation of epistemic diversity. As Leonelli has pointed out [Leonelli 2022], the Open Science movement has so far largely neglected this aspect. Epistemic diversity is about different perspectives and research methodologies coexisting, enriching and critiquing each other. Automation, be it in industry or in research, tends to reduce diversity by encouraging standardization that enables economies of scale. In the Open Science movement, this tendency is implicit in the quest for reusability, one of the four FAIR principles Barker et al. [2022]. Reusing someone else’s code or data requires adopting the authors’ methodologies, and to some degree their general perspective on the phenomenon under study. In the extreme case of a single software package being used by everyone in a research community, there is nobody left who could provide critical feedback.

This article has two main parts. In the first part (section 2), I look at the factors that make automated reasoning more or less reviewable. It is a critical examination of the state of the art in scientific software and its application, which should help scientists to get a better grasp of how reliable automated reasoning can be expected to be. In the second part (section 3), I consider how the reviewability of automated reasoning can be improved, both through better reviewing processes and by restructuring software for better reviewability.

2 Reviewability of automated reasoning systems

Automated reasoning can play different roles in scientific research, with different reliability requirements.¹ The numerical preprocessing of observational data before scientific analysis, nowadays often integrated into scientific instruments, is an example where high reliability is required, because its outputs are used without any further verification. On the other hand, protein structure prediction by AlphaFold [Jumper et al. 2021] is known to be unreliable, but it is nevertheless very useful if coupled with experimental validation of its predictions [Nielsen 2023]. Traditional computer simulation is often used similarly in biology as a hypothesis generator whose outputs require subsequent validation, whereas in engineering, simulations of mechanical systems are routinely performed to support critical decisions, thus requiring high reliability.

What these examples illustrate is that tools, processes, and results in science do not necessarily have to be perfectly reliable. Higher-level validation processes act much like error correction protocols in engineering. The coherence of multiple approaches to a question, coming from different perspectives, is another higher-level source of reliability, indicating robustness. This again illustrates the importance of epistemic diversity that I have mentioned in the introduction. What matters, however, is a clear understanding of the reliability of individual scientific contributions, which in turn requires a clear understanding of the reliability of the tools and processes on which those contributions are based.

In this section, I discuss five characteristics (summarized in Fig. 1) of automated reasoning systems that influence how their reliability can be assessed by independent critical inspection, which in the following I will call review for brevity. This use of review, inspired by the tradition of scientific peer review, should not be confused with the software engineering technique of code review, which is a quality control step performed internally by a development team. Also for brevity, I will use the term software instead of “automated reasoning system”, extending its usual meaning to include trained neural networks and other models obtained via machine learning techniques. The difference between these two categories in science is a difference in degree rather than kind, as already long before machine learning, many computational models relied on parameters fitted to large datasets.

Figure 1: The five dimensions of scientific software that influence its reviewability.

2.1 Wide-spectrum vs. situated software

Wide-spectrum software provides fundamental computing functionality to a large number of users. In order to serve a large user base, it addresses a wide range of application scenarios, each of which requiring only a part of the software’s functionality. Word processors are a well-known example: a package like LibreOffice can be used to write a simple letter, but also a complex book. LibreOffice has huge menus filled with various functions, of which most users only know the handful that matters to them. General-purpose large language models are another example of wide-spectrum software.

Situated software (a term introduced by Shirky [Shirky 2004]) is software written for a specific use case or a specific user group. It addresses a specific need very well, but is not transferable to other application scenarios. Spreadsheets are usually situated, as are games, and many shell scripts.

A useful numerical proxy for estimating a software package’s location on this scale is the ratio of the number of users to the number of developers, although there are exceptions. Games, for example, are situated software with few developers but many users.

In scientific computing, the wide-spectrum end of the scale is well illustrated by mathematical libraries such as BLAS or visualization libraries such as matplotlib, which provide a large collection of functions from which application developers pick what they need. At the situated end, we have the code snippets and scripts that generate the plots shown in a paper, as well as computational notebooks and computational workflows. In between these extremes, we have in particular domain libraries and tools, which play a very important role in computational science, i.e. in studies where computational techniques are the principal means of investigation. Many ongoing discussions of scientific software, in particular concerning its sustainability [Hettrick 2016], concentrate on these domain libraries and tools, but are not always explicit about this focus.

Reviewing wide-spectrum software represents a major effort, because of its size and functional diversity. Moreover, since wide-spectrum software projects tend to be long-lived, with the software evolving to adapt to new use cases and new computing platforms, its critical examination must be an ongoing process as well. On the other hand, this effort can be considered a good investment, because of the large user base such software has.

Situated software is smaller and simpler, which makes it easier to understand and thus to review. However, its evaluation can only be done in the specific context for which the software was written. This suggests integrating it into the existing scientific peer reviewing process, along with papers and other artifacts that result from a research project.

It is the intermediate forms of software that are most difficult to review. Domain tools and libraries are too large and complex to be evaluated in a single session by a single person, as is expected in peer review as it is practiced today by journals. However, they don’t have a large enough user base to justify costly external audits, except in contexts such as high-performance computing where the importance of the application and the high cost of the invested resources also justify more careful verification processes.

2.2 Mature vs. experimental software

Mature software is developed and maintained with the goal of providing a reliable tool. Signs of maturity in software are its age, a clear definition of its purpose, respect of standards, respect of software engineering practices, detailed documentation, and a low frequency of compatibility-breaking changes. The Linux kernel and the text editor Emacs are examples of very mature software.

Experimental software is developed and maintained to test new ideas, be they technical (software architecture etc.) or related to the application domain. Experimental software evolves at a much faster pace than mature software, and documentation is rarely up to date or complete. Users therefore have to stay in touch with the developer community, both to be informed about changes and to have an interlocutor in case of unexpected behavior. An extreme case of experimental software is machine learning models that are constantly updated with new training data.

Infrastructure software, i.e. packages that much other software depends on, should by definition be mature, and much of it is. This applies both to general-purpose infrastructure, such as the Linux kernel or the GNU Compiler Collection, and to scientific infrastructure, such as BLAS or HDF5. A grey zone is occupied by prototypes for future infrastructure software, such as the Julia programming language, which typically don’t advertise their experimental nature and are easily taken to be mature by inexperienced users. There is also software that clearly positions itself as infrastructure but lacks the required maturity. Such software is often a cause of computational irreproducibility. The libraries of the scientific Python ecosystem are an example, suffering from frequent changes that break backward compatibility. With most users of these tools being unaware of these issues, they often find out too late that some of their critical dependencies are not as mature as they seemed to be.

Scientific domain libraries and tools tend to be in the middle of the spectrum, or try to cover a large part of the spectrum. There is an inevitable tension between providing reliable code for others to build on and implementing innovative computational techniques. Often the targeted user community for these two goals is the same. Pursuing both goals in the same software project can then be the least-effort approach, but it also makes the reliability of the software difficult to assess both by users and by outside reviewers.

Experimental software is, by its very nature, very difficult to review independently, unless it is small. This is not very different in principle from evaluating experiments that use prototypes for scientific instrumentation. The main difference in practice is the widespread use of experimental software by unsuspecting scientists who believe it to be mature, whereas users of instrument prototypes are usually well aware of the experimental status of their equipment.

2.3 Convivial vs. proprietary software

Convivial software [Kell 2020], named in reference to Ivan Illich’s book “Tools for conviviality” [Illich 1973], is software that aims at augmenting its users’ agency over their computation. Malleable software is a very similar concept, as is re-editable software, a term introduced by Donald Knuth in an interview in opposition to reusable, i.e. off-the-shelf, software [Hinsen 2018a]. In contrast, proprietary software offers users fixed functionality and therefore limited agency. At first sight this looks like users should always prefer convivial software, but agency comes at a price: users have to invest more learning effort and assume responsibility for their modifications. Just like most people prefer to choose from a limited range of industrially-made refrigerators, rather than build their own precisely to their needs, most computer users are happy to use ready-made e-mail software rather then writing their own.

In the academic literature on software enginering, convivial software is discussed with the focus on its developers, most commonly referred to as end user programmers [Nardi 1993; Ko et al. 2011]. Shaw recently proposed the less pejorative term vernacular developers [Shaw 2022]. The subfield of end user software engineering aims at providing vernacular developers with methods and tools to improve the quality of their software, recognizing that the methods and tools designed for software professionals are usually not adapted to their needs.

The risk of proprietary technology, which Illich has described in detail, is that widespread adoption makes society as a whole dependent on a small number of people and organizations who control the technology. This is exactly what has happened with computing technology for the general public. You may not want to let tech corporations spy on you via your smartphone, but the wide adoption of these devices means that you are excluded from more and more areas of life if you decide not to use one. Some research communities have fallen into this trap as well, by adopting proprietary tools such as MATLAB as a foundation for their computational tools and models.

In between convivial and proprietary software, we have Free, Libre, and Open Source software (FLOSS). Historically, the Free Software movement was born in a universe of convivial technology. The few computer users in academia in the 1980s typically also had programming skills, and most of the software they produced and used was placed in the public domain. The arrival of proprietary software in their lives, exemplified by the frequently cited proprietary printer driver at MIT [2002], pushed them towards formalizing the concept of Free Software in terms of copyright and licensing, as they saw legal constraints as the main obstacle to preserving conviviality.

With the enormous complexification of software over the following decades, a license is no longer sufficient to keep software convivial in practice. The right to adapt software to your needs is of limited value if the effort to do so is prohibitive. Software complexity has led to a creeping loss of user agency, to the point that even building and installing Open Source software from its source code is often no longer accessible to non-experts, making them dependent not only on the development communities, but also on packaging experts. An experience report on building the popular machine learning library PyTorch from source code nicely illustrates this point [Courtès 2021]. Conviviality has become a marginal subject in the FLOSS movement, with the Free Software subcommunity pretending that it remains ensured by copyleft licenses and much of the Open Source subcommunity not considering it important. It survives mainly in communities whose technology has its roots in the 1980s, such as programming systems inheriting from Smalltalk (e.g. Squeak, Pharo, and Cuis), or the programmable text editor GNU Emacs.

In scientific computing, there is a lot of diversity on this scale. Fully proprietary software is common, but also variants that do allow users to look at the source code, but don’t allow them to compile it, or don’t allow the publication of reviews. In computational chemistry, the widely used Gaussian software is an example for such legal constraints [Hocquet and Wieber 2017]. FLOSS has been rapidly gaining in popularity, and receives strong support from the Open Science movement. Somewhat surprisingly, the move beyond FLOSS to convivial software is hardly ever envisaged, in spite of it being aligned with the traditional values of scientific research. Before the arrival of computers, the main intellectual artifacts of science, i.e. theories and models, have always been convivial.

Concerning reviewing, the convivial-to-open part of the scale is similar to the situated-to-wide-spectrum scale: convivial software is easier to understand and therefore easier to review, but each specific adaptation of convivial software requires its own review, whereas open but not convivial software makes reviewing a better investment of effort. Fully proprietary software is very hard to review, because only its observed behavior and its documentation are available for critical inspection.

2.4 Transparent vs. opaque software

Transparent software is software whose behavior is readily observable. In a word processor, or a graphics editor, every user action produces an immediately visible result. In contrast, opaque software operates behind the scenes and produces output whose interpretation and correctness are not obvious, nor easily related to the inputs. Large language models are an extreme example.

Strictly speaking, transparency is not a characteristic of a piece of software, but of a computational task. A single piece of software may contain both transparent and opaque functionality. Taking the word processor as an example, inserting a character is highly transparent, whereas changing the page layout is more opaque, creating the possibility of subtle bugs whose impact is not readily observable. I use the term “opaque software” as a shorthand for “software implementing at least one opaque operation”.

Most scientific software is closer to the opaque end of the spectrum. Even highly interactive software, for example in data analysis, performs non-obvious computations, yielding output that an experienced user can perhaps judge for plausibility, but not for correctness. As a rough guideline, the more scientific models or observational data have been integrated into a piece of software, the more opaque the behavior of the software is to its users. Since these are also the ingredients that make a piece of software scientific, it is not surprising that opacity is the norm rather than the exception.

It is much easier to develop trust in transparent than in opaque software. Reviewing transparent software is therefore easier, but also less important. When most users can understand and judge the results produced by a piece of software, even a very weak trustworthiness indicator such as popularity becomes sufficient.²

The more opaque a computation is, the more important its documentation becomes. Inadequately documented opaque software is inherently not trustworthy, because users don’t know what the software does exactly, nor what its limitations are. This is currently a much discussed issue with machine learning models, but it is not sufficiently recognized that traditional computer software can be just as opaque from a user’s point of view, if source code is the only available documentation of its behavior.

Opacity is an aspect of automated reasoning that has been treated extensively in the philosophy of science. Durán and Formanek [Durán and Formanek 2018] discuss epistemic opacity (which is not exactly the same as my pragmatic definition of opacity in this section) in the context of trust in the results of computer simulations, but much of their discussion equally applies to other uses of scientific software. They focus in particular on essential epistemic opacity, which is the degree of ignorance about an automated reasoning process that is due to the gap between the complexity of computer hardware and software and the limited cognitive capacities of a scientist. As an alternative source of trust, they propose computational reliabilism, which is trust derived from the experience that a computational procedure has produced mostly good results in a large number of applications. However, the accumulation of a sufficiently large body of validated applications is possible in practice only for mature or transparent software.

2.5 Size of the minimal execution environment

Each piece of software requires an execution environment, consisting of a computer and other pieces of software. The importance of this execution environment is not sufficiently appreciated by most researchers today, who tend to consider it a technical detail. However, it is the execution environment that defines what a piece of software actually does. The meaning of a Python script is defined by the Python interpreter. The Python interpreter is itself a piece of software written in the C language, and therefore the meaning of its source code is defined by the C compiler and by the processor which ultimately executes the binary code produced by the C compiler. As an illustration for the importance of the execution environment, it is an easy exercise to write a Python script that produces different results when run with version 2 or version 3 of the Python interpreter, exploiting the different semantics of integer division between the two versions.

In addition to this semantic importance of execution environments, reviewability implies a pragmatic one: reviewers of software or its results need access to an adequate hardware and software environment in order to perform their review. Scientific computing mostly relies on commodity hardware today, with two important exceptions: supercomputers and Graphical Processing Units (GPUs). Supercomputers are rare and expensive, and thus not easily accessible to a reviewer. GPUs are evolving rapidly, making it challenging to get access to an identical configuration for reviewing. Supercomputers often include GPUs, combining both problems. Resource access issues are manageable for wide-spectrum software if they are deemed sufficiently important to warrant the cost of performing audits on non-standard hardware.

Software environments have only recently been recognized as highly relevant for automated reasoning in science and beyond. They play a key role in computational reproducibility, but also for privacy and security, which are the prime motivations for the Reproducible Builds movement³. The issues of managing software environments are now well understood, and two software management systems (Nix and Guix) implement a comprehensive solution. However, they have not yet found their way into mainstream computational science. In addition to ease of use issues that can be overcome with time, a major obstacle is that such management systems must control the complete software stack, which excludes the use of popular proprietary platforms such as Windows⁴ or macOS⁵.

Assuming that the proper management of scientific software envronments will be achieved not only in theory, but also in practice, it is the size of this environment that remains a major characteristic for reviewability. The components of the execution environment required by a piece of software are called its dependencies in software engineering. This term expresses their importance very well: every single quality expected from a software system is limited by the quality of the components that enter in its construction. For example, no software can be more mature than its dependencies, because of the risk of software collapse [Hinsen 2019]. Reviewing software therefore requires a review of its dependencies as well. This can become an obstacle for software that has hundreds or even thousands of dependencies.

2.6 Analogies in experimental and theoretical science

For developing a better understanding of the reviewability characteristics described above, it is helpful to consider analogies from the better understood experimental and theoretical techniques in scientific research. In particular, it is helpful to examine where such analogies fail due to the particularities of software.

Experimental setups are situated. They are designed and constructed for a specific experiment, described in a paper’s methods section, and reviewed as part of the paper review. Most of the components used in an experimental setup are mature industrial products, ranging from commodities (cables, test tubes, etc.) to complex and specialized instruments, such as micro- scopes and NMR spectrometers. Non-industrial components are occasionally made for special needs, but this is discouraged by their high manufacturing cost. The use of prototype components is exceptional, and usually has the explicit purpose of testing the prototype. Some components are very transparent (e.g. cables), others are very opaque (e.g. NMR spectrometers). The equivalent of the execution environment is the physical environment of the experimental setup. Its impact on the observations tends to be well understood in the physical sciences, but less so in the life sciences, where it is a common source of reproducibility issues (e.g. [Kortzfleisch et al. 2022] or [Georgiou et al. 2022]).

The main difference to software is thus the much lower prevalence of prototype components. A more subtle difference between instruments and software is that the former are carefully designed to be robust under perturbations, whereas computation is chaotic [Hinsen 2016]. A microscope with a small defect may show a distorted image, which an experienced microscopist will recognize. Software with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diagnose easily. The increasing integration of computers and software into scientific instruments may lead to experimental setups becoming less robust as well in the future.

Analogies with traditional scientific models and theories are instructive as well, where “traditional” means not relying on any form of automated reasoning. Wide-spectrum theories exist in the form of abstract reasoning frameworks, in particular mathematics. The analogue of situated software are concrete models for specific observational contexts. In between, we have general theoretical frameworks, such as evolutionary theory or quantum mechanics, and models that intentionally capture only the salient features of a system under study, pursuing understanding rather than precise prediction. Examples for the latter are the Ising model in physics or the Lotka-Volterra equations in ecology.

Abstract frameworks and general theories are the product of a long knowledge consolidation process, in which individual contributions have been reviewed, verified on countless applications, reformulated from several perspectives, and integrated into a coherent whole. This process ensures both reviewability and maturity in a way that has so far no equivalent in software development.

Opacity is an issue for theories and models as well: they can be so complex and specialized that only a handful of experts understand them. It also happens that people apply such theories and models inappropriately, for lack of sufficient understanding. However, automation via computers has amplified the possibility to deploy opaque sets of rules so much that it makes a qualitative difference: scientists can nowadays use software whose precise function they could not understand even if they dedicated the rest of their career to it.

The execution environment for theories and models is the people who work with them. Their habits, tacit assumptions, and metaphysical beliefs play a similar role to hardware and software dependencies in computation, and they are indeed also a common cause of mistakes and misunderstandings.

3 Improving the reviewability of automated reasoning systems

The analysis presented in the previous section can by itself improve the basis for trust in automated reasoning, by providing a vocabulary for discussing reviewability issues. Ensuring that both developers and users of scientific software are aware of where the software is located on the different scales I have described makes much of today’s tacit knowledge about scientific software explicit, avoiding misplaced expectations.

However, scientists can also work towards improving their computational practices in view of more reviewable results. These improvements include both new reviewing processes, supported by institutions that remain to be created, and new software engineering practices that take into account the specific roles of software in science, which differ in some important respects from the needs of the software industry. The four measures I will explain in the following are summarized in Fig. 2.

**Figure 2.** Four measures that can be taken to make scientific software more trustworthy.
Review the reviewable
Emphasize situated and convivial software
Make scientific software explainable
Use Digital Scientific Notations

3.1 Review the reviewable

As my analysis has shown, some types of scientific software are reviewable, but not reviewed today. Several scientific journals encourage authors to submit code along with their articles, but only a small number of very specialized journals (e.g., Computo, the Journal of Digital History, ReScience C) actually review the submitted code, which tends to be highly situated. Other journals, first and foremost the Journal of Open Source Software, review software according to generally applicable criteria of usability and software engineerging practices, but do not expect reviewers to judge the correctness of the software nor the accuracy or completeness of its documentation. This would indeed be unrealistic in the standard journal reviewing process that asks a small number of individual researchers to evaluate, as volunteers and within short delays, submissions that are often only roughly in their field of expertise and represent the work of large teams over many years.

The first category of software that is reviewable but not yet reviewed is mature wide-spectrum software. Reviewing could take the form of regular audits, performed by experts working for an institution dedicated to this task. In view of the wide use of the software by non-experts in its domain, the audit should also inspect the software’s documentation, which needs to be up to date and explain the software’s functionality with all the detail that a user must understand. Specifications would be particularly valuable in this scenario, as the main interface between developers, users, and auditing experts. For opaque software, formal specifications could even be made a requirement, in the interest of an efficient audit. The main difficulty in achieving such audits is that none of today’s scientific institutions consider them part of their mission.

The second category of reviewable software contains situated software, which can and should be reviewed together with the other outputs of a research project. For small projects, in terms of the number of co-authors and the disciplinary spread, situated software could be reviewed as part of today’s peer review process, managed by scientific journals. The experience of pioneering journals in this activity could be the basis for elaborating more widely applied reviewing guidelines. For larger or multidisciplinary projects, the main issue is that today’s peer review process is not adequate at all, even in the (hypothetical) complete absence of software. Reviewing research performed by a multidisciplinary team requires another multidisciplinary team, rather than a few individuals reviewing independently. The integration of situated software into the process could provide the occasion for a more general revision of the peer review process.

3.2 Science vs. the software industry

In the first decades of computing technology, scientific computing was one of its main application domains, alongside elaborate bookkeeping tasks in commerce, finance, and government. Many computers, operating systems, and compilers were designed specifically for the needs of scientists. Today, scientists use mostly commodity hardware. Even supercomputers are constructed to a large degree from high-grade commodity components. Much infrastructure software, such as operating systems or compilers, are also commodity products developed primarily for other application domains.

From the perspective of development costs, this evolution makes economic sense. However, as with any shift towards fewer but more general products serving a wider client base, the needs of the larger client groups take priority over those of the smaller ones. Unfortunately for science, it is today a relative small application domain for software technology.

In terms of my analysis of reviewability in section 2, the software industry has a strong focus on proprietary wide-spectrum software, with a clear distinction between developers and users. Opacity for users is not seen as a problem, and sometimes even considered advantageous if it also creates a barrier to reverse-engineering of the software by competitors. Maturity is an expensive characteristic that only few customers (e.g. banks, or medical equipment manufacturers) are willing to pay for. In contrast, novelty is an important selling argument in many profitable application domains, leading to attitudes such as “move fast and break things” (the long-time motto of Facebook founder Mark Zuckerberg), and thus favoring experimental software.

As a consequence of the enormous growth of non-scientific compared to scientific software, today’s dominant software development tools and software engineering practices largely ignore situated and convivial software, the impact of dependencies, and the scientific method’s requirement for transparency. However, it can be expected that the ongoing establishment of Research Software Engineers as a specialization at the interface between scientific research and software engineering will lead to development practices that are better aligned with the specific needs of science. It is such practices that I will propose in the following sections.

3.3 Emphasize situated and convivial software

As I have explained in section 2.1, many important scientific software packages are domain-specific tools and libraries, which have neither the large user base of wide-spectrum software that justifies external audits, nor the narrow focus of situated software that allows for a low-effort one-time review by domain experts. Developing suitable intermediate processes and institutions for reviewing such software is perhaps possible, but I consider it scientifically more appropriate to restructure such software into a convivial collection of more situated modules, possibly supported by a shared wide-spectrum layer. However, this implies assigning a lower priority to reusability, in conflict with both software engineering traditions and more recent initiatives to apply the FAIR principles to software [Barker et al. 2022].

In such a scenario, a domain library becomes a collection of source code files that implement core models and methods, plus ample documentation of both the methods and implementation techniques. The well-known book “Numerical Recipes” [Press et al. 2007] is a good example for this approach. Users make a copy of the source code files relevant for their work, adapt them to the particularities of their applications, and make them an integral part of their own project. In terms of FLOSS jargon, users make a partial fork of the project. Version control systems ensure provenance tracking and support the discovery of other forks. Keeping up to date with relevant forks of one’s software, and with the motivations for them, is part of everyday research work at the same level as keeping up to date with publications in one’s wider community. In fact, another way to describe this approach is full integration of scientific software development into established research practices, rather than keeping it a distinct activity governed by different rules. Yet another perspective is giving priority to the software’s role as a representation of scientific knowledge over its role as a tool.

The evolution of software in such a universe is very different from what we see today. There is no official repository, no development timeline, no releases. There is only a network of many variants of some code, connected by forking relations. Centralized maintenance as we know it today does not exist. Instead, the community of scientists using the code improves it in small steps, with each team taking over improvements from other forks if they consider them advantageous. Improvement thus happens by small-step evolution rather than by large-scale design. While this may look strange to anyone used to today’s software development practices, it is very similar to how scientific models and theories have evolved in the pre-digital era.

Since this approach differs radically from anything that has been tried in practice so far, it is premature to discuss its advantages and downsides. Only practical experience can show to what extent pre-digital and pre-industrial forms of collaborative knowledge work can be adapted to automated reasoning. Nevertheless, I will indulge in some speculation on this topic, to give an idea of what we can fear or hope for.

On the benefit side, the code supporting a specific research project becomes much smaller and more understandable, mitigating opacity. Its execution environment is smaller as well, and entirely composed of mature wide-spectrum software. Reviewability is therefore much improved. Moreover, users are encouraged to engage more intensely with the software, ensuring a better understanding of what it actually does. The lower entry barrier to appropriating the code makes inspection and modification of the code accessible to a wider range of researchers, increasing inclusiveness and epistemic diversity.

The main loss I expect is in the efficiency of implementing and deploying new ideas. A strongly coordinated development team whose members specialize on specific tasks is likely to advance more quickly in a well-defined direction. This can be an obstacle in particular for software whose complexity is dominated by technical rather than scientific aspects, e.g. in high-performance computing or large-scale machine learning applications.

The main obstacle to trying out this approach in practice is the lack of tooling support. Existing code refactoring tools can probably be adapted to support application-specific forks, for example via code specialization. But tools for working with the forks, i.e. discovering, exploring, and comparing code from multiple forks, are so far lacking. The ideal toolbox should support both forking and merging, where merging refers to creating consensual code versions from multiple forks. Such maintenance by consensus would probably be much slower than maintenance performed by a coordinated team. This makes it even more important to base such convivial software ecosystems on a foundation of mature software components, in order to reduce maintenance work necessitated by software collapse [Hinsen 2019].

3.4 Make scientific software explainable

Opacity is a major obstacle to the reviewability of software and results obtained with the help of software, as I have explained in section 2.4. Depending on one’s precise definition of opacity, it may be impossible to reduce it. Pragmatically, however, opacity can be mitigated by explaining what the software does, and providing tools that allow a scientist to inspect intermediate or final results of a computation.

The popularity of computational notebooks, which can be seen as scripts with attached explanations and results, shows that scientists are indeed keen on making their work less opaque. But notebooks are limited to the most situated top layer of a scientific software stack. Code cells in notebooks refer to library code that can be arbitrarily opaque, difficult to access, and to which no explanations can be attached.

An interesting line of research in software engineering is exploring possibilities to make complete software systems explainable [Nierstrasz and Girba 2022]. Although motivated by situated business applications, the basic ideas should be transferable to scientific computing. The approach is based on three principles. The first one is the same as for computational notebooks: the integration of code with explanatory narratives that also contain example code and computed results. Unlike traditional notebooks, Glamorous Toolkit [feenk.com 2023], the development environment built to explore these ideas, allows multiple narratives to reference a shared codebase of arbitrary structure and complexity. The second principle is the generous use of examples, which serve both as an illustration for the correct use of the code and as test cases. In Glamorous Toolkit, whenever you look at some code, you can access corresponding examples (and also other references to the code) with a few mouse clicks. The third principle is what the authors call moldable inspectors: situated views on data that present the data from a domain perspective rather than in terms of its implementation. These three techniques can be used by software developers to facilitate the exploration of their systems by others, but they also support the development process itself by creating new feedback loops.

3.5 Use Digital Scientific Notations

As I have briefly mentioned in the introduction, specifications are contracts between software developers and software users that describe the expected behaviour of the software. Formal specifications are specifications written in a formal language, i.e. a language amenable to automated processing. There are various techniques for ensuring or verifying that a piece of software conforms to a formal specification. The use of these tools is, for now, reserved to software that is critical for safety or security, because of the high cost of developing specifications and using them to verify implementations.

Technically, formal specifications are constraints on algorithms and programs, in much the same way as mathematical equations are constraints on mathematical functions [Hinsen 2023]. Such constraints are often much simpler than the algorithms they define. As an example, consider the task of sorting a list. The (informal) specification of this task is: produce a new list whose elements are (1) the same as those of the input list and (2) sorted. A formal version requires some additional details, in particular a definition of what it means for two lists to have “the same” elements, given that elements can appear more than once in a list. There are many possible algorithms conforming to this specification, including well-known sorting algorithms such as quicksort or bubble sort. All of them are much more elaborate than the specification of the result they produce. They are also rather opaque. The specification, on the other hand, is immediately understandable. Moreover, specifications are usually more modular than algorithms, which also helps human readers to better understand what the software does [Hinsen 2023].

The software engineering contexts in which formal specifications are used today are very different from the potential applications in scientific computing that I outline here. In software engineering, specifications are written to formalize the expected behavior of the software before it is written. The software is considered correct if it conforms to the specification. In scientific research, software evolves in parallel with the scientific knowledge that it encodes or helps to produce. A formal specification has to evolve in the same way, and is best seen as the formalization of the scientific knowledge. Change can flow from specification to software, but also in the opposite direction. Moreoever, most specifications are likely to be incomplete, leaving out aspects of software behavior that are irrelevant from the point of view of science (e.g. resource management or technical interfaces such as Web APIs), but also aspects that are still under exploration and thus not yet formalized. For these reasons, I prefer the term Digital Scientific Notation [Hinsen 2018b], which better expresses the role of formal specifications in this context.

Digital Scientific Notations can take many forms. They do not have to resemble programming languages, nor the specification languages used in software engineering. My own experimental Digital Scientific Notation, Leibniz [Hinsen 2024], is intended to resemble traditional mathematical notation as used e.g. in physics. Its statements are embeddable into a narrative, such as a journal article, and it intentionally lacks typical programming language features such as scopes that do not exist in natural language, nor in mathematical notation. For other domains, graphical notations are likely to be more appropriate. These notations, the tooling that integrates them with software, and the scientific practices for working with them, all remain to be developed. The main expected benefit is conviviality, with computational tools adapted to the needs of researchers rather than researchers having to adapt to software tools designed for completely different application areas.

4 Conclusion

My principal goal with this work is to encourage scientists and research software engineers to reflect about their computational practices. Why, and to what degree, do you trust your own computations? How reliable do they have to be to support the conclusions you draw from their results? Why, and to what degree, do you trust the computations in the papers you read and cite? Do you consider their reliability sufficient to support the conclusions made?

These questions are abstract. Answering them requires considering the concrete level of the specific software used in a computation. The five categories I have discussed in section 2 should help with this step, even though it may be difficult at first to evaluate the software you use on some of the scales. Situated software is easy to recognize. The size of a software environment is not difficult to measure, but it requires appropriate tools and training in their use. Likewise, the evaluation of maturity is not difficult, but requires some effort, in particular an examination of a software project’s history. Conviviality is hard to diagnose, but rare anyway. This reduces the examination to Open Source vs. proprietary, which is straightforward.

The transparency vs. opacity scale deserves a more detailed discussion. Most experienced computational scientists make sure to examine both intermediate and final results for plausibility, making use of known properties such as positivity or order of magnitude. But plausibility is a fuzzy concept. Software is transparent only if users can check results for correctness, not mere plausibility. The strategies I proposed (sections 3.3, 3.4 and 3.5) have the goal of making such correctness checks easier. If plausibility is all we can check for, then the software is opaque, and its users are faced with a dilemma: if their results are neither obviously correct nor obviously wrong, are they entitled to consider them good enough? In practice they do, because the only realistic alternative would be to stop using computers. We even tend to consider popularity, which roughly means “this software is used by many people who didn’t find anything obviously wrong with it”, as an indicator for trustworthiness. Soergel [Soergel 2015] and Thimbleby [Thimbleby 2023] consider this “trust by default” misplaced, given what software engineering research tells us about the frequency of mistakes. Examples from the reproducibility crisis support this view that scientists tend to overestimate the reliability of their work in the absence of clear signs of problems.

Computational reliabilism, proposed by Durán and Formanek [Durán and Formanek 2018], offers a way out of this dilemma: it says that we can justify trust by default if we have a large body of experience reports about our software, of which a majority is favorable. Independent reviews would be particularly valuable experience reports, since their authors specifically look for potential problems. However, the large body of experience required for the reliabilism argument can be gathered only for mature software. And favorable experiences can only serve as evidence for reliability if users can actually judge the quality of the results they obtain, meaning that the software must be transparent.

The ideal structure for a reliable scientific software stack would thus consist of a foundation of mature software, on top of which a transparent layer of situated software, such as a script, a notebook, or a workflow, orchestrates the computations that together answer a specific scientific question. Both layers of such a stack are reviewable, as I have explained in section 3.1, but adequate reviewing processes remain to be enacted.

The remaining issue is experimental opaque software packages which, as I explained in section 2.2, are numerous in science. Evolving them towards maturity requires time, a large user base, and high software engineering standards. Mitigating opacity, e.g. by adopting the strategies I have proposed, requires a significant effort. Reliability comes at a cost. Making good choices requires a cost-benefit analysis in the context of a specific research project. The arguments for the choice should be published as part of any research report, to permit readers an assessment of the reliability of the reported findings.

The difficulty of reviewing scientific software also illustrates the deficiencies of the current digital infrastructure for science.⁶ The design, implementation, and maintenance of such an infrastructure, encompassing hardware, software, and best practices, has been neglected by research institutions all around the world, in spite of an overtly expressed enthusiasm about the scientific progress made possible by digital technology. The situation is improving for research data, for which appropriate repositories and archives are becoming available. For software, the task is more complex, and hindered by the contagious neophilia (“tech churn”) of the software industry. Scientists, research software engineers, research institutions, and funding agencies must recognize the importance of mature and reliable infrastructure software, which requires long-term funding and inclusive governance.

Notes

1This is of course true for software in general, see e.g. the discussion in [Shaw 2022:22].

2A famous quote in software engineering, often referred to as “Linus’ law”, states that “given enough eyeballs, all bugs are shallow”. However, this can only work if the many eyeballs are sufficiently trained to spot problems, meaning that “mere” users of opaque software don’t qualify.

3 Ken Thompson’s Turing Award Lecture “Reflections on Trusting Trust” [Thompson 1984] is an early and very readable discussion of the security implications of execution environments.

4 Windows is a trademark of the Microsoft group of companies.

5 macOS is a trademark of Apple Inc., registered in the U.S. and other countries and regions.

6 For more examples, see [Saunders 2022].

References

Barker, M., Chue Hong, N.P., Katz, D.S., et al. 2022. Introducing the FAIR Principles for research software. Scientific Data 9, 1, 622.

Chapter 1: For Want of a Printer. 2002. In: Free as in freedom: Richard Stallman’s crusade for free software. O’Reilly, Sebastopol, Calif.: Farnham.

Courtès, L. 2021. What’s in a package. GuixHPC blog. https://hpc.guix.i nfo/blog/2021/09/whats-in-a-package/.

Durán, J.M. and Formanek, N. 2018. Grounds for Trust: Essential Epistemic Opacity and Computational Reliabilism. Minds and Machines 28, 4, 645–666. feenk.com. 2023. Glamorous Toolkit. feenk gmbh.

Georgiou, P., Zanos, P., Mou, T.-C.M., et al. 2022. Experimenters’ sex modulates mouse behaviors and neural responses to ketamine via corticotropin releasing factor. Nature Neuroscience 25, 9, 1191–1200.

Hettrick, S. 2016. Research Software Sustainability. Knowledge Exchange.

Hinsen, K. 2016. The Power to Create Chaos. Computing in Science & Engineering 18, 4, 75–79.

Hinsen, K. 2018a. Reusable Versus Re-editable Code. Computing in Science & Engineering 20, 3, 78–83.

Hinsen, K. 2018b. Verifiability in computer-aided research: The role of digital scientific notations at the human-computer interface. PeerJ Computer Science 4, e158.

Hinsen, K. 2019. Dealing With Software Collapse. Computing in Science & Engineering 21, 3, 104–108.

Hinsen, K. 2023. The nature of computational models. Computing In Science & Engineering 25, 1, 61–66.

Hinsen, K. 2024. Leibniz – a Digital Scientific Notation. https://leibniz.kh insen.net/.

Hocquet, A. and Wieber, F. 2017. “Only the Initiates Will Have the Secrets Revealed”: Computational Chemists and the Openness of Scientific Software. IEEE Annals of the History of Computing 39, 4, 40–58.

Illich, I. 1973. Tools for conviviality. Calders and Boyars, London.

Jumper, J., Evans, R., Pritzel, A., et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596, 7873, 583–589. Kell, S. 2020. Convivial design heuristics for software systems. Conference Companion of the 4th International Conference on Art, Science, and Engineering of Programming, ACM, 144–148.

Ko, A.J., Abraham, R., Beckwith, L., et al. 2011. The state of the art in end-user software engineering. ACM Computing Surveys 43, 3, 21:1–21:44.

Kortzfleisch, V.T. von, Ambrée, O., Karp, N.A., et al. 2022. Do multiple experimenters improve the reproducibility of animal studies? PLOS Biology 20, 5, e3001564.

Leonelli, S. 2022. Open Science and Epistemic Diversity: Friends or Foes? Philosophy of Science 89, 5, 991–1001.

Merali, Z. 2010. Computational science: …Error. Nature 467, 7317, 775–777.

Nardi, B.A. 1993. A small matter of programming: Perspectives on end user computing. MIT Press, Cambridge, MA.

Nielsen, M. 2023. How is AI impacting science? https://michaelnotebook. com/mc2023/.

Nierstrasz, O. and Girba, T. 2022. Making Systems Explainable. 2022 Working Conference on Software Visualization (VISSOFT), IEEE, 1–4.

Parasuraman, R. and Riley, V. 1997. Humans and Automation: Use, Misuse, Disuse, Abuse. Human Factors 39, 2, 230–253.

Press, W.H., Teukolsky, S.A., Vetterling, W.T., and Flannery, B.P. 2007. Numerical recipes: The art of scientific computing. Cambridge University Press, Cambridge, UK ; New York.

Saunders, J.L. 2022. Decentralized Infrastructure for (Neuro)science. https://arxiv.org/abs/2209.07493.

Shaw, M. 2022. Myths and mythconceptions: What does it mean to be a programming language, anyhow? Proceedings of the ACM on Programming Languages 4, HOPL, 1–44.

Shirky, C. 2004. Situated Software. https://web.archive.org/web/200404 11202042/http://www.shirky.com/writings/situated/_software.html.

Soergel, D.A.W. 2015. Rampant software errors may undermine scientific results.

Thimbleby, H. 2023. Improving Science That Uses Code. The Computer Journal, bxad067.

Thompson, K. 1984. Reflections on trusting trust. Communications of the ACM 27, 8, 761–763.

Turkle, S. 2009. Simulation and its discontents. The MIT Press, Cambridge, Massachusetts.

Wilkinson, M.D., Dumontier, M., Aalbersberg, Ij.J., et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018.

Editors

Ludo Waltman
Editor-in-Chief

Adrian Barnett
Handling Editor

Editorial assessment

by Adrian Barnett

DOI: 10.70744/MetaROR.39.1.ea

This summary article does not present new data or experiments but instead takes a broad look at automated reasoning and software. Reviewer #1 thought the article needed much more detail, including citations, examples, screenshots and figures. They were concerned about strong generalisations that were lacking evidence and have provided places where they wanted these details. Reviewer #2 considers the differences between reviewability and the practicalities of reviewing everything, and how being easily able to build-on other software acts as a kind of reproducibility. In my own editorial review, I generally enjoyed reading the paper and it prompted some interesting thoughts on trade-offs with standardisation and the level of detail shown to users for statistical code.

Recommendations from the editor

As a statistician, I am in strong agreement on the widespread inappropriate use of statistical inference (page 2) and the importance of software. I also strongly agree that “independent critical inspection [is] particularly challenging” (page 3). I also strongly agree that “The main difficulty in achieving such audits is that none of today’s scientific institutions consider them part of their mission”, as this is everyone’s problem and nobody’s problem.

I also agree that automation has encouraged standardisation and I have personally supported standardisation because some practices are so bad that many authors need to be “standardised”. However, I’ve also felt frustration at the sometimes fussy requirements when uploading R packages to CRAN (https://cran.r-project.org/). Similarly, some blanket changes from CRAN seem pedantic. There’s likely a balance between reducing poor practice and becoming too prescriptive.

In terms of transparency (section 2.4) I did think about the “Verbose=TRUE” option that I sometimes see in R. I tend to turn this on, as it’s good to see more of the workings, but perhaps the default is off? I did look at some packages using the google search: “verbose site:cran.r-project.org/web/packages”. I was also reminded of the difference between Bayesian and frequentist statistical modelling. Frequentist modelling often uses maximum likelihood to create parameter estimates, which usually runs quickly to create the estimates. In contrast, Bayesian methods often use MCMC, which is often slow and creates long chains of estimates; however, the chains will show if the likelihood does not have a clear maximum, which is usually from a badly specified model, whereas the maximum likelihood simply finds any peak. Frustratingly, I often get more push back from reviewers when using Bayesian methods, whereas in my opinion it should be the other way around as the Bayesian estimates have shown far more of the inner workings.

Some reflection on the growing use of AI to write software may be worthwhile. Presumably this could be more standardised, but there are other concerns. Using automation to check code could also be worthwhile.

For section 3, I thought that more sharing of code would mean “more eyeballs”, but the sharing needs to be done in FAIR way.

I wondered if highly-used software should get more scrutiny. Peer review is a scarce resource, so is likely better directed towards high use software. Andrew Gelman recently put forward a similar argument for checking published papers when they reach 250 citations: https://statmodeling.stat.columbia.edu/2025/02/26/pp/.

I agreed with the need for effort (page 19) and wondered if this paper could call for more effort.

Minor comments:

typo “asses” on page 7.
“supercomputers are rare”, should this be “relatively rare” or am I speaking from a privileged university where I’ve always had access to supercomputers.
I did think about “testthat” at multiple points whilst reading the paper (https://testthat.r-lib.org/)
Can badges on github about downloads and maturity help (page 7)? Although, far from all software is on github.

Competing interests: None.

Peer review 1

Anonymous reviewer

DOI: 10.70744/MetaROR.39.1.rv1

Thank you for submitting this paper. I think the paper requires substantial, major revisions to be published. Throughout the paper I noted many instances where references or examples would help make the intent clear. I also think the message of the paper would benefit from several figures to demonstrate workflows or ideas. The figures presented are essentially tables, and I think the message could be made clearer for the reader if they were presented as flow charts or at least with clear numbering to hook the ideas to the reader – e.g., Figures 1 & 2 would benefit from having numbers on the key ideas.

The paper is lacking many instances of citation, and at times reads as though it is an essay delivering an opinion. I’m not sure if this is the type of article that the journal would like, but two examples of sentences missing citations are:

“Over the last two decades, an unexpectedly large number of peer-reviewed findings across many scientific disciplines have been found to be irreproducible upon closer inspection.” (Introduction, page 2)
“A large number of examples cited in this context involves faulty software or inappropriate use of software” (Introduction, page 3)

Two examples of sentences missing examples are:

Experimental software evolves at a much faster pace than mature software, and documentation is rarely up to date or complete (in Mature vs. experimental software, page 7). Could the author provide more examples of what “experimental software” is? There is also consistent use of universal terms like “…is rarely up to date or complete”, which would be better phrased as “is often not up to date or complete”
There are various techniques for ensuring or verifying that a piece of software conforms to a formal specification.

Overall the paper introduces many new concepts, and I think it would greatly benefit from being made shorter and more concise, with adding some key figures for the reader to refer back to to understand these new ideas. The paper is well written, and it is clear the author is a great writer, and has put a lot of thought into the ideas. However it is my opinion that because these ideas are so big and require so much unpacking, they are also harder to understand. The reader would benefit from having more guidance to come back to understand these ideas.

I hope this review is helpful to the author.

Review comments

Introduction

Highlight [page 2]: Ever since the beginnings of organized science in the 17th century, researchers are expected to put all facts supporting their conclusions on the table, and allow their peers to inspect them for accuracy, pertinence, completeness, and bias. Since the 1950s, critical inspection has become an integral part of the publication process in the form of peer review, which is still widely regarded as a key criterion for trustworthy results.

and Note [page 2]: Both of these statements feel like they should have some peer review, or reference on them, I believe. What was the beginnings of organised science in the 1600s? Why since the 1950s? Why not sooner? What happened then?

Highlight [page 2]: Over the last two decades, an unexpectedly large number of peer-reviewed findings across many scientific disciplines have been found to be irreproducible upon closer inspection.

and Note [page 2]: I would expect at least a couple of citations here, e.g., Stodden et al., https://www.pnas.org/doi/abs/10.1073/pnas.1708290115

Highlight [page 2]: In the quantitative sciences, almost all of today’s research critically relies on computational techniques, even when they are not the primary tool for investigation – and Note [page 2]: Again, it does feel like it would be great to acknowledge research in this space.

Highlight [page 2]: But then, scientists mostly abandoned doubting.

and Note [page 2]: This feels like an essay, where show me the evidence for where you can say something like this?

Highlight [page 2]: Automation bias

and Note [page 2]: What is automation bias?

Highlight [page 3]: A large number of examples cited in this context involves faulty software or inappropriate use of software

and Note [page 3]: Can you provide some examples of the examples cited that you are referring to here?

Highlight [page 3]: A particularly frequent issue is the inappropriate use of statistical inference techniques.

and Note [page 3]: Please provide citations to these frequent issues.

Highlight [page 3]: The Open Science movement has made a first step towards dealing with automated reasoning in insisting on the necessity to publish scientific software, and ideally making the full development process transparent by the adoption of Open Source practices – and Note [page 3]: Could you provide an example of one of these Open Science movements?

Highlight [page 3]: Almost no scientific software is subjected to independent review today.

and Note [page 3]: How can you justify this claim?

Highlight [page 3]: In fact, we do not even have established processes for performing such reviews

and Note [page 3]: I disagree, there is the Journal of Open Source Software: https://joss.theoj.org/, rOpenSci has a guide for development of peer review of statistical software: https://github.com/ropensci/statistical-software-review-book, and also maintain a very clear process of software review: https://ropensci.org/software-review/

Highlight [page 3]: as I will show

and Note [page 3]: How will you show this?

Highlight [page 3]: is as much a source of mistakes as defects in the software itself

and Note [page 3]: Again, this feels like a statement of fact without evidence or citation.

Highlight [page 3]: This means that reviewing the use of scientific software requires particular attention to potential mismatches between the software’s behavior and its users’ expectations, in particular concerning edge cases and tacit assumptions made by the software developers. They are necessarily expressed somewhere in the software’s source code, but users are often not aware of them.

and Note [page 3]: The same can be said of assumptions for equations and mathematics – the problem here is dealing with abstraction of complexity and the potential unintended consequences.

Highlight [page 4]: the preservation of epistemic diversity

and Note [page 4]: Please define epistemic diversity

Reviewability of automated reasoning systems

Highlight [page 5]: The five dimensions of scientific software that influence its reviewability.

and Note [page 5]: It might be clearer to number these in the figure, and also I might suggest changing the “convivial” – it’s a pretty unusual word?

Wide-spectrum vs. situated software

Highlight [page 6]: In between these extremes, we have in particular domain libraries and tools, which play a very important role in computational science, i.e. in studies where computational techniques are the principal means of investigation

and Note [page 6]: I’m not very clear on this example – can you provide an example of a “domain library” or “domain tool” ?

Highlight [page 6]: Situated software is smaller and simpler, which makes it easier to understand and thus to review.

and Note [page 6]: I’m not sure I agree it is always smaller and simpler – the custom code for a new method could be incredibly complicated.

Highlight [page 6]: Domain tools and libraries

and Note [page 6]: Can you give an example of this?

Mature vs. experimental software

Highlight [page 7]: Experimental software evolves at a much faster pace than mature software, and documentation is rarely up to date or complete

and Note [page 7]: Could the author provide more examples of what “experimental software” is? There is also consistent use of universal terms like “…is rarely up to date or complete”, which would be better phrased as “is often not up to date or complete”

Highlight [page 7]: An extreme case of experimental software is machine learning models that are constantly updated with new training data.

and Note [page 7]: Such as…

Highlight [page 7]: interlocutor

and Note [page 7]: suggest “middle man” or “mediator”, ‘interlocutor’ isn’t a very common word

Highlight [page 7]: A grey zone

and Note [page 7]: I think it would be helpful to discuss black and white zones before this.

Highlight [page 7]: The libraries of the scientific Python ecosystem

and Note [page 7]: Do you mean SciPy? https://scipy.org/. Can you provide an example of the frequent changes that break backward compatibility?

Highlight [page 7]: too late that some of their critical dependencies are not as mature as they seemed to be

and Note [page 7]: Again, can you provide some evidence for this?

Highlight [page 7]: The main difference in practice is the widespread use of experimental software by unsuspecting scientists who believe it to be mature, whereas users of instrument prototypes are usually well aware of the experimental status of their equipment.

and Note [page 7]: Again this feels like an assertion without evidence. Is this an essay, or a research paper?

Convivial vs. proprietary software

Highlight [page 8]: Convivial software [Kell 2020], named in reference to Ivan Illich’s book “Tools for conviviality” [Illich 1973], is software that aims at augmenting its users’ agency over their computation

and Note [page 8]: It would be really helpful if the author would define the word, “convivial” here. It would also be very useful if they went on to give an example of what they meant by: “…software that aims at augmenting its users’ agency over their computation.” How does it augment the users agency?

Highlight [page 8]: Shaw recently proposed the less pejorative term vernacular developers [Shaw 2022]

and Note [page 8]: Could you provide an example of what makes “vernacular developers” different, or just what they mean by this term?

Highlight [page 8]: which Illich has described in detail

and Note [page 8]: Should this have a citation to Illich then in this sentence?

Highlight [page 8]: what has happened with computing technology for the general public

and Note [page 8]: Can you give an example of this. Do you mean the rise of Apple and Windows? MS Word? Facebook? A couple of examples would be really useful to make this point clear.

Highlight [page 8]: tech corporations

and Note [page 8]: Suggest “tech corporations” be “technology corporations”.

Highlight [page 8]: Some research communities have fallen into this trap as well, by adopting proprietary tools such as MATLAB as a foundation for their computational tools and models.

and Note [page 8]: Can you provide an example of the alternative here, what would be the way to avoid this trap – use software such as Octave, or?

Highlight [page 8]: Historically, the Free Software movement was born in a universe of convivial technology.

and Note [page 8]: If it is historic, can you please provide a reference to this?

Highlight [page 8]: most of the software they produced and used was placed in the public domain

and Note [page 8]: Can you provide an example of this? I’m also curious how the software was placed in the public domain if there was no way to distribute it via the internet.

Highlight [page 8]: as they saw legal constraints as the main obstacle to preserving conviviality

and Note [page 8]: Again, these are conjectures that are lacking a reference or example, can you provide some examples of references of this?

Highlight [page 9]: Software complexity has led to a creeping loss of user agency, to the point that even building and installing Open Source software from its source code is often no longer accessible to non-experts, making them dependent not only on the development communities, but also on packaging experts. An experience report on building the popular machine learning library PyTorch from source code nicely illustrates this point [Courtès 2021].

and Note [page 9]: Can you summarise what makes it diﬀicult to install Open Source Software? Again, this statement feels like it is making a strong generalisation without clear evidence to support this. The article by Courtès (https://hpc.guix.info/blog/2021/09/whats-in-a-package/), actually notes that it’s straightforward to install PyTorch via pip, but using an alternative package manager causes diﬀiculty. The point you are making here seems to be that building and installing most open source software is almost prohibitive, but I think you’ve given strong evidence for this claim, and I don’t understand how this builds into your overall argument.

Highlight [page 9]: It survives mainly in communities whose technology has its roots in the 1980s, such as programming systems inheriting from Smalltalk (e.g. Squeak, Pharo, and Cuis), or the programmable text editor GNU Emacs.

and Note [page 9]: Can you give an example of how it survives in these communities?

Highlight [page 9]: FLOSS has been rapidly gaining in popularity, and receives strong support from the Open Science movement

and Note [page 9]: Can you provide some evidence to back this statement up?

Highlight [page 9]: the traditional values of scientific research.

and Note [page 9]: Can you state what you mean by “traditional values of scientific research”

Highlight [page 9]: always been convivial

and Note [page 9]: Can you provide a further explanation of what makes them convivial?

Transparent vs. opaque software

Highlight [page 9]: Transparent software

and Note [page 9]: It might be useful to explain a distinction between transparent and open software – or to perhaps open with a statement for why we are talking about transparent and opaque software.

Highlight [page 9]: Large language models are an extreme example.

and Note [page 9]: Based on your definition of transparent software – every action produces a visible result. If I type something into an LLM and get an immediate and visible result, how is this different? It is possible you are stating that the behaviour is able to be easily interpreted, or perhaps the behaviour is easy to understand?

Highlight [page 10]: Even highly interactive software, for example in data analysis, performs nonobvious computations, yielding output that an experienced user can perhaps judge for plausibility, but not for correctness.

and Note [page 10]: Could you give a small example of this?

Highlight [page 10]: It is much easier to develop trust in transparent than in opaque software.

and Note [page 10]: Can you state why it is easier to develop this trust?

Highlight [page 10]: but also less important

and Note [page 10]: Can you state why it is less important?

Highlight [page 10]: even a very weak trustworthiness indicator such as popularity becomes suﬀicient

and Note [page 10]: becomes suﬀicient for what? Reviewing? Why does it become suﬀicient?

Highlight [page 10]: This is currently a much discussed issue with machine learning models,

and Note [page 10]: Given it is currently much discussed, could you link to at least 2 research articles discussing this point?

Highlight [page 10]: treated extensively in the philosophy of science.

and Note [page 10]: Given that is has been treated extensively, can you please provide some key references after this statement? You do go on to cite one paper, but it would be helpful to mention at least a few key articles.

Size of the minimal execution environment

Highlight [page 11]: The importance of this execution environment is not suﬀiciently appreciated by most researchers today, who tend to consider it a technical detail

and Note [page 11]: This statement is a bit of a sweeping generalisation – why is it not suﬀiciently appreciated? What evidence do you have of this?

Highlight [page 11]: Software environments have only recently been recognized as highly relevant for automated reasoning in science and beyond

and Note [page 11]: Where have they been only recently recognised?

Highlight [page 11]: However, they have not yet found their way into mainstream computational science.

and Note [page 11]: Could you provide an example of what it might look like if they were in mainstream computational science? For example, https://github.com/ropensci/rix implements using reproducible environments for R with NIX. What makes this not mainstream? Are you talking about mainstream in the sense of MS Excel? SPSS/SAS/STATA?

Analogies in experimental and theoretical science

Highlight [page 12]: Non-industrial components are occasionally made for special needs, but this is discouraged by their high manufacturing cost

and Note [page 12]: Can you provide an example of this?

Highlight [page 12]: cables

and Note [page 12]: What do you mean by a cable? As in a computer cable? An electricity cable?

Highlight [page 13]: which an experienced microscopist will recognize. Software with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diagnose easily.

and Note [page 13]: I don’t think this is a fair comparison. Surely there must be instances of experiences microscopists not identifying defects? Similarly, why can’t there be examples of domain expert or professional programmer/computer scientist identifying errors. Don’t unit tests help protect us against some of our errors? Granted, they aren’t bullet proof, and perhaps act more like guard rails.

Highlight [page 13]: where “traditional” means not relying on any form of automated reasoning.

and Note [page 13]: Can you give an example of what a “traditional” scientific model or theory

Improving the reviewability of automated reasoning systems

Highlight [page 14]: Figure 2: Four measures that can be taken to make scientific software more trustworthy.

and Note [page 14]: Could the author perhaps instead call these “four measures” or perhaps give them a better name, and number them?

Review the reviewable

Highlight [page 14]: mature wide-spectrum software

and Note [page 14]: Can you give an example of what “mature wide-spectrum software” is?

Highlight [page 15]: The main diﬀiculty in achieving such audits is that none of today’s scientific institutions consider them part of their mission.

and Note [page 15]: I disagree. Monash provides an example here where they view software as a first class research output: https://robjhyndman.com/files/EBS_research_software.pdf

Science vs. the software industry

Highlight [page 15]: Many computers, operating systems, and compilers were designed specifically for the needs of scientists.

and Note [page 15]: Could you give an example of this? E.g., FORTRAN? COBAL?

Highlight [page 15]: Today, scientists use mostly commodity hardware

and Note [page 15]: Can you explain what you mean by “commodity hardware”, and give an example.

Highlight [page 15]: even considered advantageous if it also creates a barrier to reverse- engineering of the software by competitors

and Note [page 15]: Can you give an example of this?

Highlight [page 15]: few customers (e.g. banks, or medical equipment manufacturers) are willing to pay for

and Note [page 15]: What about software like SPSS/STATA/SAS – surely many many industries, and also researchers will pay for software like this that is considered mature?

Emphasize situated and convivial software

Highlight [page 16]: a convivial collection of more situated modules, possibly supported by a shared wide-spectrum layer.

and Note [page 16]: Could you give an example of what this might look like practically? Are you saying things like SciPy would be restructured into many separate modules, or?

Highlight [page 16]: In terms of FLOSS jargon, users make a partial fork of the project. Version control systems ensure provenance tracking and support the discovery of other forks. Keeping up to date with relevant forks of one’s software, and with the motivations for them, is part of everyday research work at the same level as keeping up to date with publications in one’s wider community. In fact, another way to describe this approach is full integration of scientific software development into established research practices, rather than keeping it a distinct activity governed by different rules.

and Note [page 16]: Could the author provide a diagram or schematic to more clearly show how such a system would work with forks etc?

Highlight [page 17]: a universe is very

and Note [page 17]: Perhaps this could be “would be very different” – since this doesn’t yet exist, right?

Highlight [page 17]: Improvement thus happens by small-step evolution rather than by large-scale design. While this may look strange to anyone used to today’s software development practices, it is very similar to how scientific models and theories have evolved in the pre-digital era.

and Note [page 17]: I think some kind of schematic or workflow to compare existing practices to this new practice would be really useful to articulate these points. I also think this new method of development you are proposing should have a concrete name.

Highlight [page 17]: Existing code refactoring tools can probably be adapted to support application-specific forks, for example via code specialization. But tools for working with the forks, i.e. discovering, exploring, and comparing code from multiple forks, are so far lacking. The ideal toolbox should support both forking and merging, where merging refers to creating consensual code versions from multiple forks. Such maintenance by consensus would probably be much slower than maintenance performed by a coordinated team.

and Note [page 17]: Perhaps an example of screenshot of a diff could be used to demonstrate that we can make these changes between two branches/commits, but comparing multiple is challenging?

Make scientific software explainable

Highlight [page 18]: An interesting line of research in software engineering is exploring possibilities to make complete software systems explainable [Nierstrasz and Girba 2022]. Although motivated by situated business applications, the basic ideas should be transferable to scientific computing

and Note [page 18]: Is this similar to concepts such as “X-AI” or “X-ML” – that is, “Explainable” Artificial Intelligence or Machine Learning?

Highlight [page 18]: Unlike traditional notebooks, Glamorous Toolkit [feenk.com 2023],

and Note [page 18]: It appears that you have introduced “Glamorous Toolkit” as an example of these three principles? It feels like it should be introduced earlier in this paragraph?

Highlight [page 18]: In Glamorous Toolkit, whenever you look at some code, you can access corresponding examples (and also other references to the code) with a few mouse clicks

and Note [page 18]: I think it would be very beneficial to show screenshots of what the author means – while I can follow the link to Glamorous Toolkit, bitrot is a thing, and that might go away, so it would good to see exactly what the author means when they discuss these examples.

Use Digital Scientific Notations

Highlight [page 18]: There are various techniques for ensuring or verifying that a piece of software conforms to a formal specification

and Note [page 18]: Can you give an example of these techniques?

Highlight [page 18]: The use of these tools is, for now, reserved to software that is critical for safety or security,

and Note [page 18]: Again, could you give an example of this point? Which tools, and which software is critical for safety or security?

Highlight [page 19]: formal specifications

and Note [page 19]: It would be really helpful if you could demonstrate an example of a formal specification so we can understand how they could be considered constraints.

Highlight [page 19]: All of them are much more elaborate than the specification of the result they produce. They are also rather opaque.

and Note [page 19]: It isn’t clear to me how these are opaque – if the algorithm is defined, it can be understood, how is it opaque?

Highlight [page 19]: Moreover, specifications are usually more modular than algorithms, which also helps human readers to better understand what the software does [Hinsen 2023]

and Note [page 19]: A tight example of this would be really useful to make this point clear. Perhaps with a figure of a specification alongside an algorithm.

Highlight [page 19]: In software engineering, specifications are written to formalize the expected behavior of the software before it is written. The software is considered correct if it conforms to the specification.

and Note [page 19]: Is an example of this test drive development?

Highlight [page 19]: A formal specification has to evolve in the same way, and is best seen as the formalization of the scientific knowledge. Change can flow from specification to software, but also in the opposite direction.

and Note [page 19]: Again, I think a good figure here would be very helpful in articulating this clearly.

Highlight [page 19]: My own experimental Digital Scientific Notation, Leibniz [Hinsen 2024], is intended to resemble traditional mathematical notation as used e.g. in physics. Its statements are embeddable into a narrative, such as a journal article, and it intentionally lacks typical programming language features such as scopes that do not exist in natural language, nor in mathematical notation.

and Note [page 19]: Could we see an example of what this might look like?

Conclusion

Highlight [page 20]: Situated software is easy to recognize.

and Note [page 20]: Could you provide some examples?

Highlight [page 20]: Examples from the reproducibility crisis support this view

and Note [page 20]: Can you provide some example papers that you mention here?

Highlight [page 21]: The ideal structure for a reliable scientific software stack would thus consist of a foundation of mature software, on top of which a transparent layer of situated software, such as a script, a notebook, or a workflow, orchestrates the computations that together answer a specific scientific question. Both layers of such a stack are reviewable, as I have explained in section 3.1, but adequate reviewing processes remain to be enacted.

and Note [page 21]: Again, I think it would be very insightful for the reader to have a clear figure to rest these ideas upon.

Highlight [page 21]: has been neglected by research institutions all around the world

and Note [page 21]: I do not think this is true – could you instead say “neglected my most/many” perhaps?

Competing interests: None.

Peer review 2

Nico Formanek

DOI: 10.70744/MetaROR.39.1.rv2

In his article Establishing trust in automated reasoning (Hinsen, 2023) Hinsen argues that much of current scientific software lacks reviewability. Because scientific software has become such a central part of many scientific endeavors he worries that unreviewed software might contain mistakes which will never be spotted and consequently taint the scientific record. To illustrate this worry he cites issues with reproductions in different fields of science, which are often subsumed under the umbrella term of reproducibility crises. These crises, though not uncontested, have varied sources. In the field of social psychology reproducibility issues can for example often be traced to errors in statistical analyses, while shifting baselines and data leakage lead to problems in ML. Hinsen is only concerned with errors in scientific software. He suggests that potential errors could be spotted more easily if scientific software would be more reviewable. Thus he proposes five criteria against which reviewability could be judged. I will not discuss them in detail in this commentary and refer the interested reader to Hinsen (2023, section 2) for an extensive discussion. I note though, that the five criteria are meant to ensure an ideal type of reproducibility which Hinsen defines as follows: “Ideally, each piece of software should perform a well-defined computation that is documented in sufficient detail for its users and verifiable by independent reviewers.” (Hinsen, 2023, p.2). I take the upshot of these criteria to be that one could assert the reviewability of a piece of software before actually doing the review. They could thus function, perhaps contrary to Hinsen’s open science convictions, as a gatekeeping device in a peer review process for software. An editor could ”desk reject” software for not fulfilling the criteria before even sending it out to potential reviewers. If I am correct in this interpretation then we should entertain the same caution with them as we do with preregistration.

To be fair, Hinsen envisions a software review process which differs from current peer review with its acknowledged defects in several ways. He says, ”Developing suitable intermediate processes and institutions for reviewing such software is perhaps possible, but I consider it scientifically more appropriate to restructure such software into a convivial collection of more situated modules, possibly supported by a shared wide-spectrum layer.” (Hinsen, 2023, p.16).

Convivial software in turn is supposed to augment ”its users’ agency over their computation.” (Hinsen, 2023, p.16). This gives us a hint about the kind of user Hinsen has in mind – it is the software developer as a user. His concept of reviewability aims to make software transparent only to this kind of user (see Hinsen, 2023, p.20). In one of his many comparisons of scientific software to science, he notes that ”[. . . ] the main intellectual artifacts of science, i.e. theories and models, have always been convivial.” (Hinsen, 2023, p.9) and we can guess that he wants this to be the case for software too. But, if at all, scientific theories and models only have ever been convivial for scientists. The comparison also works the other way around, science as much as software is heavily fragmented into modules (disciplines). Scientists have always relied on the results of other scientists – they often have done and still do so without reviewing them. Has this hindered progress? I think one would be hard pressed to answer such a question in general for science, and perhaps it is the same for scientific software.

As Hinsen admits formal peer review is a quite novel addition to scientific methodology, being enforced on a larger scale only since the past fifty years or so. Science has progressed many years without, so we could ask why scientific software should not do likewise. Hinsen’s answer of course has to do with how he grades such software with respect to his reviewability criteria – obviously, most of it scores badly. Most scientific software is neither reviewed nor reviewable, Hinsen claims. This he considers a defect, because only reviewable software has to potential of being reviewed. Many practical considerations he discusses actually speak against the hope that most reviewable software will actually be reviewed. Still, without reviewability, it is hard, if not impossible, to spot mistakes. A case that was recently brought to my attention emphasizes this point. In Beheim et al. (2021) it is pointed out that a statistical analysis imputed missing values in an archaeo-historical database with the number 0. But for the statistical model (and software!) in use 0 had a different meaning than not available. This casts doubt on the conclusion that was drawn from the model. Beheim et al. were only able to spot this assumption because the code and data were available for review¹. Cases like this abound and are examples for invisible programming values that philosopher James Moor discussed in the context of computer ethics (see Moor, 1985, The invisibility factor). Hinsen calls such values “tacit assumptions made by software developers” (Hinsen, 2023, p.3). We might speculate though, what would have happened if this questionable result had been incorporated into the scientific canon. Would later scientists really have continued building on it without ever realizing their shaky foundations? Or would the whole edifice have had to face the tribunal of experience at some point and crumbled? Perhaps the originating problem would never have been found and a whole research program would have been abandoned, perhaps a completely different part would have been blamed and excised – hard to say!

But maybe reviewability can also serve a different aim than establishing trust in the results of certain pieces of scientific software. Perhaps, it facilitates building on and incorporating pieces of such software in other projects. Its purpose could be more instrumental than epistemic. Although Hinsen seems to worry more about the epistemic problems coming with lack of reviewability, many points he makes implicitly deal with practical problems of software engineering. Whoever has fought against jupyter notebooks with legacy python requirements can immediately relate to his wish for keeping the execution environment as small as possible. For Hinsen software is actually defined by its execution environment (Hinsen, 2023, p.11), thus the complete environment must be available for its reviewability². Software cannot be really seen as a separate entity and a review always reviews the whole environment. Analogously to Quine-Duhem we could call this situation review holism. But review holism might be less problematic than its scientific cousin suggests. We might not actually need to explicitly review the whole system. Perhaps it is sufficient if we achieve frictionless reproducibility (see Donoho, 2024), that is, other people can more or less easily incorporate and built on the software in question. Firstly if other software which incorporates the software in question works, it already is a type of successful reproduction. Secondly, the process of how software evolves might weed out any major errors, whatever errors remain are perhaps just irrelevant. In all fairness it has to be said that Hinsen does not think this is the case with current software. He argues that ”Software with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diagnose easily.” (Hinsen, 2023, p.13). But if that is the case then Hinsen’s later recourse to reliabilist-style justifications for software correctness is blocked too. We are in a situation for which the late Humphreys coined the term strange error (Rathkopf & Heinrichs, 2023, p.5). Strange errors are a challenge for any reliabilist account of justification because their magnitude can easily overwhelm arduously collected reliability assurances. If computational reliabilism was just reliabilism, and Hinsen seems to take it as such³, it would suffer from this problem too. But computational reliabilism has an additional internalist component, which explicitly allows for the whole toolbox of ”rationalist” software verification methods. If possible we should learn something about our tools other than their mere reliability. As Hacking said, ”[To understand] whether one sees through a microscope, one needs to know quite a lot about the tools.” (Hacking, 1981, p.135).

I would go so far and say that, if available, internalist justificiations are preferable to reliabilistic guarantees. It is only the case that often they are not and then we might content ourselves with the guarantees reliabilism provides. I said might content here, because such guarantees are unlikely to satisfy the skeptic. Obviously strange errors are always a possibility and no finite observation of correct software behaviour can completely rule them out. But in practice such concerns tend to fade over time, although they provide opportunity for unchecked philosophically skepticism. Many discussions about software opacity feed from such skepticism and this is what I tried to balance with computational reliabilism. In this spirit computational reliabilism was an attempt to temper theoretical skeptics in philosophy, not to give normative guidance to software engineering practice. My view was always that practice has the last say over philosophical concerns. If the emerging view in software engineering practice now is that more skepticism is appropriate, I will happily concur. But I should like to remind the practitioner that evidence for such skepticism has to be given in practice too, mere theoretical possibilities are not sufficient to establish it.

Reviewability does not mean reviewed. And only reviews can give us trust – or so we might think. As Hinsen acknowledges we should not expect that a majority of scientific software will ever be reviewed. Does this mean we cannot trust the results from such software? Above I tried to sketch a way out of this conundrum: We can view reviewability as advocated by Hinsen as a way to enable frictionless reproducibility, which in turn lets us built upon software, incorporate it in our own projects and use its results. As long as it works in a practically fulfilling way, this might be all the reviewing we need.

Notes

¹A statistician once told me, that one glance at the raw data of this example immediately made clear to him that whatever problem there was with imputation, the data would never have supported the desired conclusions in any way. One man’s glance is another’s review.

²Hinsen’s definition of software closely parallels that of Moor, who argued that computer programs are a relation between a computer, a set of instructions and an activity (Moor, 1978, p.214).

³Hinsen characterizes computational reliabilism as follows, ”As an alternative source of trust, they propose computational reliabilism, which is trust derived from the experience that a computational procedure has produced mostly good results in a large number of applications.” (Hinsen, 2023, p.10)

References

Beheim, B., Atkinson, Q. D., Bulbulia, J., Gervais, W., Gray, R. D., Henrich, J., Lang, M., Monroe, M. W., Muthukrishna, M., Norenzayan, A., Purzy- cki, B. G., Shariff, A., Slingerland, E., Spicer, R., & Willard, A. K. (2021). Treatment of missing data determined conclusions regarding moralizing gods. Nature, 595 (7866), E29–E34. https://doi.org/10.1038/s41586-021-03655-4

Donoho, D. (2024). Data Science at the Singularity. Harvard Data Science Re- view, 6 (1). https://doi.org/10.1162/99608f92.b91339ef

Hacking, I. (1981). Do We See Through a Microscope? Pacific Philosophical Quarterly, 62 (4), 305–322. https://doi.org/10.1111/j.1468-0114.1981.tb00070.x

Hinsen, K. (2023, July). Establishing trust in automated reasoning. https:// doi.org/10.31222/osf.io/nt96q

Moor, J. H. (1978). Three Myths of Computer Science. The British Journal for the Philosophy of Science, 29 (3), 213–222. https://doi.org/10.1093/bjps/29.3.213

Moor, J. H. (1985). What is computer ethics? Metaphilosophy, 16 (4), 266–275. https://doi.org/10.1111/j.1467-9973.1985.tb00173.x

Rathkopf, C., & Heinrichs, B. (2023). Learning to Live with Strange Error: Be- yond Trustworthiness in Artificial Intelligence Ethics. Cambridge Quarterly of Healthcare Ethics, 1–13. https://doi.org/10.1017/S0963180122000688

Competing interests: None.

Author response

DOI: 10.70744/MetaROR.XXX.1.ar

Dear editors and reviewers, Thank you for your careful reading of my manuscript and the detailed and insightful feedback. It has contributed significantly to the improvements in the revised version. Please find my detailed responses below.

1 Reviewer 1

Thank you for this helpful review, and in particular for pointing out the need for more references, illustrations, and examples in various places of my manuscript. In the case of the section on experimental software, the search for examples made clear to me that the label was in fact badly chosen. I have relabeled the dimension as “stable vs. evolving software”, and rewritten the section almost entirely. Another major change motivated by your feedback is the addition of a figure showing the structure of a typical scientific software stack (Fig. 2), and of three case studies (section 2.7) in which I evaluate scientific software packages according to my five dimensions of reviewability. The discussion of conviviality (section 2.4), a concept that is indeed not widely known yet, has been much expanded. I have followed the advice to add references in many places. I have been more hesitant to follow the requests for additional examples and illustrations, because of the inevitable conflict with the equally understandable request to make the paper more compact. In many cases, I have preferred to refer to examples discussed in the literature. A few comments deserve a more detailed reply:

Introduction

Highlight [page 3]: In fact, we do not even have established processes for performing such reviews

and Note [page 3]: I disagree, there is the Journal of Open Source Software: https://joss.theoj.org/, rOpenSci has a guide for development of peer review of statistical software: https://github.com/ropensci/statistical software-review-book, and also maintain a very clear process of software review: https://ropensci.org/software-review/

As I say in the section “Review the reviewable”, these reviews are not independent critical examination of the software as I define it. Reviewers are not asked to evaluate the software’s correctness or appropriateness for any specific purpose. They are expected to comment only on formal characteristics of the software publication process (e.g. “is there a license?”), and on a few software engineering quality indicators (“is there a test suite?”).

and Note [page 3]: The same can be said of assumptions for equations and mathematics- the problem here is dealing with abstraction of complexity and the potential unintended consequences.

Indeed. That’s why we need someone other than the authors to go through mathematical reasoning and verify it. Which we do.

Reviewability of automated reasoning systems

Wide-spectrum vs. situated software

Highlight [page 6]: Situated software is smaller and simpler, which makes it easier to understand and thus to review.

and Note [page 6]: I’m not sure I agree it is always smaller and simpler- the custom code for a new method could be incredibly complicated.

The comparison is between situated software and more generic software performing the same operation. For example, a script reading one specific CSV file compared to a subroutine reading arbitrary CSV files. I have yet to see a case in which abstraction from a concrete to a generic function makes code smaller or simpler.

Convivial vs. proprietary software

Highlight [page 8]: most of the software they produced and used was placed in the public domain

and Note [page 8]: Can you provide an example of this? I’m also curious how the software was placed in the public domain if there was no way to distribute it via the internet.

Software distribution in science was well organized long before the Internet, it was just slower and more expensive. Both decks of punched cards and magnetic tapes were routinely sent by mail. The earliest organized software distribution for science I am aware of was the DECUS Software Library in the early 1960s.

Size of the minimal execution environment

Note [page 11]: Could you provide an example of what it might look like if they were in mainstream computational science? For example, https://github.com/ropensci/rix implements using reproducible environments for R with NIX. What makes this not mainstream? Are you talking about mainstream in the sense of MS Excel? SPSS/SAS/STATA?

I have looked for quantitative studies on software use in science that would allow to give a precise meaning to “mainstream”, but I have not been able to find any. Based on my personal experience, mostly with teaching MOOCs on computational science in which students are asked about the software they use, the most widely used platform is Microsoft Windows. Linux is already a minority platform (though overrepresented in computer science), and Nix users are again a small minority among Linux users.

Analogies in experimental and theoretical science

Highlight [page 13]: which an experienced microscopist will recognize. Soft ware with a small defect, on the other hand, can introduce unpredictable errors in both kind and magnitude, which neither a domain expert nor a professional programmer or computer scientist can diag- nose easily.

and Note [page 13]: I don’t think this is a fair comparison. Surely there must be instances of experiences microscopists not identifying defects? Similarly, why can’t there be examples of domain expert or professional program mer/computer scientist identifying errors. Don’t unit tests help protect us against some of our errors? Granted, they aren’t bullet proof, and perhaps act more like guard rails.

There are probably cases of microscopists not noticing defects, but my point is that if you ask them to look for defects, they know what to do (and I have made this clearer in my text). For contrast, take GROMACS (one of my case studies in the revised manuscript) and ask either an expert programmer or an experienced computational biophysicist if it correctly implements, say, the AMBER force field. They wouldn’t know what to do to answer that question, both because it is ill-defined (there is no precise definition of the AMBER force field) and because the number of possible mistakes and symptoms of mistakes is enormous. I have seen a protein simulation program fail for proteins whose number of atoms was in a narrow interval, defined by the size that a compiler attributed to a specific data structure. I was able to catch and track down this failure only because a result was obviously wrong for my use case. I have never heard of similar issues with microscopes.

Improving the reviewability of automated reasoning systems

Review the reviewable

Highlight [page 15]: The main difficulty in achieving such audits is that none of today’s scientific institutions consider them part of their mission.

and Note [page 15]: I disagree. Monash provides an example here where they view software as a first class research output: https://robjhyndman.com/files/EBS_research_software.pdf

This example is about superficial reviews in the context of career evaluation. Other institutions have similar processes. As far as I know, none of them ask reviewers to look at the actual code and comment on its correctness or its suitability for some specific purpose.

Science vs. the software industry

Highlight [page 15]: few customers (e.g. banks, or medical equipment manufacturers) are willing to pay for

and Note [page 15]: What about software like SPSS/STATA/SAS- surely many many industries, and also researchers will pay for software like this that is considered mature?

I could indeed extend the list of examples to include various industries. Compared to the huge number of individuals using PCs and smartphones, that’s still few customers.

Emphasize situated and convivial software

Note [page 16]: Could the author provide a diagram or schematic to more clearly show how such a system would work with forks etc?

I have decided the contrary: I have significantly shortened this section, removing all speculation about how the ideas could be turned into concrete technology. The reason is that I have been working on this topic since I wrote the reviewed version of this manuscript, and I have a lot more to say about it than would be reasonable to include in this work. This will become a separate article.

Make scientific software explainable

Note [page 18]: I think it would be very beneficial to show screenshots of what the author means- while I can follow the link to Glamorous Toolkit, bitrot is a thing, and that might go away, so it would good to see exactly what the author means when they discuss these examples.

Unfortunately, static screenshots can only convey a limited impression of Glamorous Toolkit, but I agree that they have are a more stable support than the software itself. Rather than adding my own screenshots, I refer to a recent paper by the authors of Glamorous Toolkit that includes many screenshots for illustration.

Use Digital Scientific Notations

Highlight [page 19]: formal specifications and Note [page 19]: It would be really helpful if you could demonstrate an example of a formal specification so we can understand how they could be considered constraints.

Highlight [page 19]: Moreover, specifications are usually more modular than algorithms, which also helps human readers to better understand what the software does [Hinsen 2023]

and Note [page 19]: A tight example of this would be really useful to make this point clear. Perhaps with a figure of a specification alongside an algorithm.

I do give an example: sorting a list. To write down an actual formalized version, I’d have to introduce a formal specification language and explain it, which I think goes well beyond the scope of this article. Illustrating modularity requires an even larger example. This is, however, an interesting challenge which I’d be happy to take up in a future article.

and Note [page 19]: Is an example of this test drive development?

Not exactly, though the underlying idea is similar: provide a condition that a result must satisfy as evidence for being correct. With testing, the condition is spelt out for one specific input. In a formal specification, the condition is written down for all possible inputs.

2 Reviewer 2

First of all, I would like to thank the reviewer for this thoughtful review. It addresses many points that require clarifications in the my article, which I hope to have done adequately in the revised version.

One such point is the role and form of reviewing processes for software. I have made it clearer that I take “review” to mean “critical independent inspection”. It could be performed by the user of a piece of software, but the standard case should be a review performed by experts at the request of some institution that then publishes the reviewer’s findings. There is no notion of gatekeeping attached to such reviews. Users are free to ignore them. Given that today, we publish and use scientific software without any review at all, the risk of shifting to the opposite extreme of having reviewers become gatekeepers seems unlikely to me.

Your comment on users being software developers addresses another important point that I had failed to make clear: conviviality is all about diminishing the distinction between developers and users. Users gain agency over their computations at the price of taking on more of a developer role. This is now stated explicitly in the revised article. Your hypothesis that I want scientific software to be convivial is only partially true. I want convivially structured software to be an option for scientists, with adequate infrastructure and tooling support, but I do not consider it to be the best approach for all scientific software.

The paragraph on the relevance and importance of reviewing in your comment is a valid point of view but, unsurprisingly, not mine. In the grand scheme of science, no specific quality assurance measure is strictly necessary. There is always another layer above that will catch mistakes that weren’t detected in the layer below. It is thus unlikely that unreliable software will cause all of science to crumble. But from many perspectives, including overall efficiency, personal satisfaction of practitioners, and insight derived from the process, it is preferable to catch mistakes as closely as possible to their source. Pre-digital theoreticians have always double-checked their manual calculations before submitting their papers, rather than sending off unchecked results and count on confrontation with experiment for finding mistakes. I believe that we should follow this same approach with software. The cost of mistakes can be quite high. Consider the story of the five retracted protein structures that I cite in my article (Miller, 2006, 10.1126/science.314.5807.1856). The five publications that were retracted involved years of work by researchers, reviewers, and editors. In between their publication and their retraction, other protein crystallographers saw their work rejected because it was in contradiction with the high-profile articles that later turned out to be wrong. The whole story has probably involved a few ruined careers in addition to its monetary cost. In contrast, independent critical examination of the software and the research processes in which it was used would likely have spotted the problem rather quickly (Matthews, 2007).

You point out that reviewability is also a criterion in choosing software to build on, and I agree. Building on other people’s software requires trusting it. Incorporating it into one’s own work (the core principle of convivial software) requires understanding it. This is in fact what motivated my reflections on this topic. I am not much interested in neatly separating epistemic and practical issues. I am a practitioner, my interest in epistemology comes from a desire for improving practices.

Review holism is something I have not thought about before. I consider it both impossible to apply in practice and of little practical value. What I am suggesting, and I hope to have made this clearer in my revision, is that reviewing must take into account the dependency graph. Reviewing software X requires a prior review of its dependencies (possibly already done by someone else), and a consideration of how each dependency influences the software under consideration. However, I do not consider Donoho’s “frictionless reproducibility” a sufficient basis for trust. It has the same problem as the widespread practice of tacitly assuming a piece of software to be correct because it is widely used. This reasoning is valid only if mistakes have a high chance of being noticed, and that’s in my experience not true for many kinds of research software. “It works”, when pronounced by a computational scientist, really means “There is no evidence that it doesn’t work”.

This is also why I point out the chaotic nature of computation. It is not about Humphreys’ “strange errors”, for which I have no solution to offer. It is about the fact that looking for mistakes requires some prior idea of what the symptoms of a mistake might be. Experienced researchers do have such prior ideas for scientific instruments, and also e.g. for numerical algorithms. They come from an understanding of the instruments and their use, including in particular a knowledge of how they can go wrong. But once your substrate is a Turing-complete language, no such understanding is possible any more. Every programmer has made the experience of chasing down some bug that at first sight seems impossible. My long-term hope is that scientific computing will move towards domain-specific languages that are explicitly not Turing-complete, and offer useful guarantees in exchange. Unfortunately, I am not aware of any research in this space.

I fully agree with you that internalist justifications are preferable to reliabilistic ones. But being fundamentally a pragmatist, I don’t care much about that distinction. Indisputable justification doesn’t really exist anywhere in science. I am fine with trust that has a solid basis, even if there remains a chance of failure. I’d already be happy if every researcher could answer the question “why do you trust your computational results?” in a way that shows signs of critical reflection.

What I care about ultimately is improving practices in computational science. Over the last 30 years, I have seen numerous mistakes being discovered by chance, often leading to abandoned research projects. Some of these mistakes were due to software bugs, but the most common cause was an incorrect mental model of what the software does. I believe that the best technique we have found so far to spot mistakes in science is critical independent inspection. That’s why I am hoping to see it applied more widely to computation.

2.1 References

Miller, G. (2006) A Scientist’s Nightmare: Software Problem Leads to Five Retractions. Science 314, 1856. https://doi.org/10.1126/science.314.5807.1856

Matthews, B.W. (2007) Five retracted structure reports: Inverted or incorrect? Protein Science 16, 1013. https://doi.org/10.1110/ps.072888607

3 Editor

Bayesian methods often use MCMC, which is often slow and creates long chains of estimates; however, the chains will show if the likelihood does not have a clear maximum, which is usually from a badly specified model…

That is an interesting observation I haven’t seen mentioned bedore. I agree that Bayesian inference is particularly amenable to inspection. One more reason to normalize inspection and inspectability in computational science.

Some reflection on the growing use of AI to write software may be worthwhile.

The use of AI in writing and reviewing software is a topic I have considered for this review, since the technology has evolved enormously since I wrote the current version of the manuscript. However, in view of reviewer 1’s constant admonition to back up statements with citations, I refrained from delving into this topic. We all know it’s happening, but it’s too early to observe a clear impact on research software. I have therefore limited myself to a short comment in the Conclusion section.

I wondered if highly-used software should get more scrutiny.

This is an interesting suggestion. If and when we get serious about reviewing code, resource allocation will become an important topic. For getting started, it’s probably more productive to review newly published code than heavily used code, because there is a better chance that authors actually act on the feedback and improve their code before it has many users. That in turn will help improve the reviewing process, which is what matters most right now, in my opinion.

“supercomputers are rare”, should this be “relatively rare” or am I speaking from a privileged university where I’ve always had access to supercomputers.

If you have easy access to supercomputer, you should indeed consider yourself privileged. But did you ever use supercomputer time for reviewing someone else’s work? I have relatively easy access to supercomputers as well, but I do have to make a request and promise to do innovative research with the allocated resources.

I did think about “testthat” at multiple points whilst reading the paper (https://testthat.r-lib.org/)

I hadn’t seen “testthat” before, not being much of a user of R. It looks interesting, and reminds me of similar test support features in Smalltalk which I found very helpful. Improving testing culture is definitely a valuable contribution to improving computational practices.

Can badges on github about downloads and maturity help (page 7)?

Badges can help, on GitHub or elsewhere, e.g. in scientific software catalogs. I see them as a coarse-grained output of reviewing. The right balance to find is between the visibility of a badge and the precision of a carefully written review report. One risk with badges is the temptation to automate the evaluation that leads to it. This is fine for quantitative measures such as test coverage, but what we mostly lack today is human expert judgement on software.

Available versions

Cite