Published at MetaROR

May 4, 2026

Table of contents

Abstract
Full text
Editors
Editorial assessment
Peer review 1
Peer review 2
Peer review 3
Author response
Leave a comment

Cite

Cite this article as:

Röseler, L., Wallrich, L., Hartmann, H., Altegoer, L., Boyce, V., Field, S., Goltermann, J., Hüffmeier, J., Pennington, C., Pittelkow, M.-M., Silverstein, P., van Ravenzwaaij, D., & Azevedo, F. (2025). Handbook for Reproduction and Replication Studies (0.1) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.16990115

Handbook for Reproduction and Replication Studies

Lukas Röseler¹, Lukas Wallrich², Helena Hartmann³, Luisa Altegoer, Veronica Boyce, Sarahanne Field⁴, Janik Goltermann⁵, Joachim Hüffmeier, Charlotte Pennington⁶, Merle-Marie Pittelkow, Priya Silverstein, Don van Ravenzwaaij⁴, Flavio Azevedo^4,7

1. University of Münster
2. Birkbeck, University of London
3. Essen University Hospital
4. University of Groningen
5. University Hospital Münster
6. Aston University
7. Cologne University

Originally published on August 29, 2025 at:

https://doi.org/10.5281/zenodo.16990115

Abstract

The practice of repeatedly testing published results with the same data (reproduction) or new data (replication) is currently gaining traction in the social sciences, owing to multiple failures to reproduce and replicate published findings. Along with increased skepticism have come guidelines for the repeated testing of hypotheses from various disciplines and fields. This guide aims to enable researchers to conduct high-quality reproductions and replications across social science disciplines. First we summarize recent developments, then provide a comprehensive guide to carrying out reproductions and replications, and finally present an example for how guidance needs to be tailored for specific fields. Our guide covers the entire research process: choosing a target study, deciding between different types of reproductions and replications, planning and running the new study, analyzing the results, discussing outcomes in the light of potential differences, and publishing a report.

Full text

Summary

Keywords: replication, repetitive research, reproducibility, meta-science, meta-research, open science, open research, open scholarship

Last update: 2025-08-28

How to Cite

Please cite this handbook as:

Röseler, L., Wallrich, L., Hartmann, H., Altegoer, L., Boyce, V., Field, S. M., Goltermann, J., Hüffmeier, J., Pennington, C. R., Pittelkow, M.-M., Silverstein, P., van Ravenzwaaij, D., & Azevedo, F. (2025). Handbook for Reproduction and Replication Studies. Zenodo. https://doi.org/10.5281/zenodo.16990115

Part I Foundations

1 Background

“The proof established by the test must have a specific form, namely, repeatability. The issue of the experiment must be a statement of the hypothesis, the conditions of test, and the results, in such form that another experimenter, from the description alone, may be able to repeat the experiment. Nothing is accepted as proof, in psychology or in any other science, which does not conform to this requirement.” – (Dunlap 1926)

Repeatability is the cornerstone of many sciences: A majority of the scientific progress rests on the successful accumulation of evidence for claims through reproduction and replications to establish robust discoveries. Reproductions and replications, that is repeated testing of a hypothesis with the same (reproduction) or different (replication) data, are necessary.

Cumulative science without repetition is costly. The aim of this guide is to empower researchers to conduct high-quality reproductions and replications and thereby contribute to making their fields of research more cumulative and robust. Issues of replicability have been discussed across many disciplines, such as psychology (Open Science Collaboration 2015), economics (Dreber and Johannesson 2024), biology (Errington, Mathur, et al. 2021), marketing (Urminsky and Dietvorst 2024), linguistics (McManus 2024), computer science (Hummel and Manner 2024) and epidemiology (Lash, Collin, and Van Dyke 2018) and the number of replications has been rising sharply (see Figure 1.1).

While the number of replication and reproduction studies has increased, the overall proportion of them is still very small, with reviews finding yearly replication rates of up to 1% (Perry, Morris, and Lea 2022). Moreover, much of the guidance on replications is being developed actively (Clarke et al. 2024) and in narrow parts of science, which leads to fragmentation, siloing, and potentially inconsistent information.

Figure 1.1. Number of replication studies by year of publication, based on the FORRT Replication Database (Röseler et al. 2024) as of July 16, 2025.

Here we attempt to integrate useful guidelines (e.g., Block and Kuckertz 2018; Jekel et al. 2020) into a comprehensive overview that allows diverse fields to profit from each other. In sum, this guide provides information about the entire process of research allowing researchers at all career stages to plan, conduct, and publish reproduction and replication studies. We limit our scope to quantitative research, given that the concept of reproducibility and replicability itself is highly contested among qualitative researchers (see Makel, Plucker, and Hegarty 2012; Cole et al. 2024; Pownall 2022; Bennett 2021).

2 Understanding Replications and Reproductions

In this guide, we focus on studies that re-examine a previously tested hypothesis and refer to them as repetitions (i.e., reproductions and replications) with the general field being called repetitive research as suggested by Schöch (2023). However, it is important to note from the outset, that there is no overarching terminology or consensus (e.g., Voelkl et al. 2025), as the formal development of replication methods has begun relatively late in the social, behavioral, and cognitive sciences. For example, empirical psychology is more than 100 years old, but until the advent of the replication/reproducibility crisis in the early 2010s, replication methods have been rarely discussed (e.g., King 1995). Different fields of research seem to tackle the task differently and independently, which has led to multiple overlapping terminologies across psychology (Schmidt 2009; Hüffmeier, Mazei, and Schultze 2016), management (Tsang and Kwan 1999), marketing (Urminsky and Dietvorst 2024), organizational sciences (Köhler and Cortina 2021), computer sciences (Heroux et al. 2018), language learning (McManus 2024), and the humanities (Schöch 2023).

2.1 Reproduction and Replication

The terms reproduction and replication are used in different ways between disciplines; for example, in psychology, studies using different data are commonly referred to as replications and studies using the same data are referred to as reproductions, whereas in other fields, such as computational science or economics, these terms may be used in the opposite manner or treated interchangeably (see Miłkowski, Hensel, and Hohol 2018; Ankel-Peters, Fiala, and Neubauer 2023). In this paper, replication is used to refer to efforts involving the analysis of different data, and reproduction to efforts involving the same data. The different data do not necessarily need to be from a different sample but can also constitute distinct (non-overlapping) subsets from the same sample (e.g., incidental or panel data, Huang and Huang 2024).

Reproduction and replication should always be considered together and if possible, reproduction should come before replication. This is because, at the early stages of research, reproduction is much more cost eﬀicient; first confirming whether the findings are reproducible can clarify whether a replication is worthwhile. Furthermore, if the research procedure consists of “moving away” from a specific finding in terms of changing the analysis code, materials, and dataset to test its generalizability or boundary conditions, a numerical reproduction (using the same data and same code) is the closest possible repetition of a finding and a useful foundation for further steps. We discuss multiple cases to illustrate the relationship between reproduction and replication in Table 2.1 (Note that a similar distinction is made by The Turing Way Community (2025) but uses a less specific terminology for reproductions.)

**Table 2.1.** Possible combinations of reproduction and replication outcomes.
Case	Reproducible?	Replicable?	Possible interpretation
A	Yes	Yes	The original finding is reproducible and generalizable.
B	Yes	No	The original finding is reproducible but not generalizable.
C	No	Yes	The original finding is not reproducible but replicators could determine a scenario where it holds.
D	No	No	The original finding is neither reproducible nor generalizable.

2.2 Outcome

Common language often conflates outcome and study descriptions: researchers typically use the phrase “has been replicated” to refer to a replication attempt that has corroborated the findings of the original study, whereas “failed to replicate” or “could not be replicated” is used to refer to circumstances where a replication attempt has not corroborated the original results or has led to a different interpretation or conclusions (see also Patil, Peng, and Leek (2016a)).

In this article, when we state that a “study was reproduced/replicated” we mean that there has been a replication attempt, irrespective of its outcome. With “replicable” and “reproducible” we express that there was support for the original hypothesis. Note that the outcome of a replication/reproduction study is often not straightforward, but may depend on the success criteria applied. This is discussed in Section 7.1.

2.3 Types of replication

We heavily rely on the typology provided by Hüffmeier, Mazei, and Schultze (2016) where different types of replications are defined by the closeness or similarity between original and replication study. Similarity cannot be evaluated without a theory about the concepts involved. For example, the concept of age can differ strongly between replications of historical, psychological, or biological studies, leading to different measures of the concept itself and thus different judgments about the similarity of an object’s age.

Under the assumption of a stable world and constant laws or regularities that are investigated by the social, behavioral, and cognitive sciences, a reproduction and replication study’s closeness to an original study is associated with replication ‘success’ (Hüffmeier, Mazei, and Schultze 2016; LeBel et al. 2018). The argument can be made from two different philosophical perspectives that we call inductive (phenomenon-focused, effects application, bottom-up) and deductive (theory-focused, theory application, top down, e.g., Calder, Phillips, and Tybout 1981; Borgstede and Scholz 2021). From an inductive perspective, a replication that is very similar to an original study should lead to the same result whereas one that differs with respect to any criterion may lead to different results.¹ This is a stance often taken by proponents of findings that failed to replicate (e.g., Baumeister and Vohs 2016; Syed 2023), arguing that characteristics such as time or place are different and can be valid reasons for different results. From a deductive (theory-focused) view, the only changes that matter are those that affect the underlying theory. Consider for example a replication experiment that is identical in every aspect except for the season (summer instead of winter). If the theory that is tested is about color perception, the replication is likely judged to be close to the original study but if it is about participants’ current tea preferences, it is likely judged to be different from the original study in a theoretically relevant aspect.² A related dimension of closeness concerns contextual sensitivity—the extent to which the meaning of a questionnaire or the effect of a manipulation depends on time, culture, or population. As Van Bavel et al. (2016) demonstrate, studies on contextually sensitive topics were significantly less likely to replicate successfully in Open Science Collaboration (2015), even though methodological fidelity was high. This raises important questions about what constitutes a “close” replication: Should a study on celebrity attitudes, for example, use the same examples (which may be outdated and thus psychologically inert), or should it adapt to locally and temporarily salient figures to trigger the same cognitive or emotional responses? In such cases, strict methodological similarity might paradoxically undermine theoretical closeness, and thus the validity of the replication attempt. This tension highlights that procedural fidelity does not always equate to theoretical equivalence—particularly for studies involving social meaning, identity, or temporally anchored norms. LeBel et al. (2018) provide a taxonomy for classifying a replication study’s closeness for psychological research.

Figure 2.1. Taxonomy for classifying a replication study’s methodological similarity to an original study. Reprinted from LeBel et al. (2018) with permission

Support for the view that methodological features that are theoretically irrelevant such as the use of text versus image stimuli or the type of sample can have a strong impact on the results is provided by Landy et al. (2020), who let different groups of researchers test identical hypotheses using different study designs. The groups arrived at entirely different and even opposite conclusions for similar hypotheses. The differences in the study designs were not predicted by the theories involved in the respective studies: A priori, none of the differences (e.g., within- vs. between-subjects design, picture vs. text stimuli) “should” have affected the conclusions. Note that other theories such as demand characteristics (Orne 2017) could help in these cases. Moreover, this does not disconfirm the deductive perspective but may be a demonstration of the lack of specification of theories – as well as a reminder that statistical choices affect statistical power by changing the variance, and thus standardised effect sizes. In line with deviations from original studies mostly having uncertain consequences, close replications more directly test the credibility of original results, while conceptual replications that vary features of the design are concerned with generalizability.

Note that Nosek and Errington (2020) define replication as a study “for which any outcome would be considered diagnostic evidence about a claim from prior research”. This can lead to issues when the original claim is not clear on its boundary conditions. Conceptual replications that highlight limitations to the claim made clearly count, e.g. when the original claim was about a universal effect, and the replication shows that it does not hold in a specific country. Conversely, “replications” that go beyond the claim made, and test the transferability of a claim explicitly made about, e.g., maths education to science education may indeed serve to be framed differently, as they do not directly speak to the claim made originally. Where original authors’ failed to specify the scope of their claim, we would understand that they imply a broad/universally applicable relationship, which any attempts at generalisation help to corroborate or specify.

In terms of Schöch (2023), who defines an overarching type of repetitive research based on multiple dimensions, replications are concerned with the same question as a previous study, use the same (close replication) or a similar (conceptual replication) method and use different data (otherwise they are reproductions).

2.4 Types of reproduction

Reproductions can be numerical reproductions, testing whether the same data, code and software lead to the same results, or robustness reproductions, extending the original analysis and exploring the central finding’s limits (Dreber and Johannesson (2024)). Most reproductions would include both a numerical reproduction as baseline and then a robustness reproduction, unless the numerical reproduction is not possible due to a lack of code or software.

Part II The Replication Process

3 Choosing the Target Study

Reproduction and replication studies can serve different goals and depending on the goal, the way of choosing a target study differs (see Pittelkow et al. 2023). In large-scale reproduction and replication projects, such as Brodeur et al. (2024), the Reproducibility Project: Psychology (Open Science Collaboration 2015) or the Reproducibility Project: Cancer Biology (Errington, Mathur, et al. 2021), the primary aim is to assess the overall reliability of a field or a set of findings, leading to a top-down approach in which the decision to replicate comes first, followed by the selection of specific replication targets. This is often done in a way aimed to be representative of a field, ideally through random sampling, though this is generally constrained by practicalities. Here, individual studies are not the primary focus in the decision to repeat; instead, choices are guided by broader methodological or theoretical considerations. In contrast, individual researchers frequently adopt a bottom-up approach, where the decision to replicate is driven by engagement with a specific study (or theoretically related set of studies, e.g., Röseler et al. 2021). This may occur when a researcher wishes to build upon an existing finding and seeks to verify its robustness before doing so, or when they harbor doubts about a claim and aim to test its validity. Since reproductions and replications can serve multiple purposes— from assessing theoretical frameworks to correcting potential errors—there is no singular correct way to decide what to repeat. The choice of targets ultimately depends on the overarching goals and methodological approach of the replication effort, as well as on practical constraints. However, what does matter is that the selection of reproduction and replication targets is well justified and transparently communicated. For instance, researchers can use structured frameworks such as the replication target selection checklist to ensure clarity and consistency in their decision-making process (Pittelkow et al. 2023). For a comment on what empirical reasons for replications are, see Kamermans et al. (2025).

3.1 Determining Reproduction and Replication Value

Whether a target study is “worth reproducing” or “worth replicating” is highly debated and is suggested to depend on several overlapping factors, including value (sometimes also referred to as impact or relevance), uncertainty, and feasibility (P. M. Isager et al. 2023). Below, different suggestions for operationalizing these factors are discussed systematically.

Note that there is also ongoing discussion about whether or not all studies are generally ‘worth replicating’. One perspective is what is worthy of publication is worthy of replication (Feldman 2025) or on a different note, what is worthy of publication should be worthy of replication – though this perspective is becoming complicated through the rise of influential preprints and a public-review-curate model to publications. Naturally, a public report of a study is necessary for other researchers to attempt a replication and an available dataset is needed for a reproduction. To take a more fine-grained look at the publication status, several different types of research emerge. An article can be retracted, that is, there is no confidence anymore in its findings due to research misconduct or severe errors. When the data of a study were fabricated and it was thus retracted, a reproduction will not be informative but a replication may inform researchers about the correctness of the hypothesis unlike the original report. Other reasons (or unclear reasons) for retraction may conversely increase the replication and reproduction value, as the source of a true claim may have become untrustworthy (and not easily citeable) due to issues unrelated to its truth (e.g. plagiarism).

Replicating and reproducing every finding that was ever published appears impossible to achieve, which is why researchers need to make decisions about prioritization. In the following, we discuss criteria by which such a prioritization can occur – restricted to quantitative research.

3.2 Value

The original study should be somehow relevant for the replication to have value (e.g., Karhulahti, Martončik, and Adamkovic 2024). It may have started a research stream. For example, Jacowitz and Kahneman’s (1995) studies on anchoring and adjustment were fundamental for how anchoring effects are investigated today, and were therefore replicated by Klein et al. (2014). Field et al. (2019) propose a method for the selection of replication studies that features the theoretical importance of the original study result. Relevance may be evidenced by many citations as they show that many studies are building on the finding, testing similar hypotheses, or criticizing the study. Note that a study could also be cited as a negative example or study that has not been replicated or retracted for some reason. Isager et al. (2023) suggest deciding what to replicate based on sample size and citation count (but see Pittelkow, Field, and Ravenzwaaij 2025). In a Delphi study examining consensus among psychologists that had conducted empirical replications on what should influence the decision of what study to replicate, elements that came up were the importance of the original study for research (as indicated by citations, the phenomenon being over- or understudied, and the impact factor of the journal), the theoretical relevance of the study, and the implications of the original study for practice, policy, or clinical work (Pittelkow et al. 2023). The relevance of societal impact was also stressed by (2024), as a study may have a high value for a societal problem (e.g., a new vaccine or a repeated test of a claim that is relevant in the political discourse such as criminality among immigrants).

For scientists reproducing or replicating a study because they are interested in building on its findings (including if they wish to build upon their own original findings), their interest to build on it may be a suﬀicient indicator of its relevance to their research program.

3.3 Uncertainty

The more uncertain the original study’s outcome is, the higher the potential of knowledge gained from reproduction and replication. Although no findings are definitive, research reports differ in the strength of the evidence they present (e.g., Registered Reports³ are typically more convincing than non preregistered studies, Soderberg et al. 2021). Similarly, sample size (within a given field) has been proposed as an indicator of evidence strength (Peder Mortvedt Isager et al. 2021). Pittelkow et al. (2021), Pittelkow et al. (2023) and Field et al. (2019) all argued for using the current strength of evidence in favour of the original claim as an important element that features into the choosing a replication target. However, the degree of uncertainty can be uncertain or misjudged: In some areas of research a hypothesis had been claimed to be confirmed hundreds of times and yet, large-scale replication effort could not support the original hypothesis so that after hundreds of studies the existence of the phenomenon was still unknown (e.g., Friese et al. 2019). Meta-analyses allow some tests for uncertainty (e.g. via correction of bias, evaluation of risk of bias, or estimates of heterogeneity). Although there are numerous ways to meta-analytically evaluate the expected replicability of a set of claims, none of them is as solid as a well-designed replication attempt (Carter et al. 2019). Other heuristics to estimate robustness reproducibility and replicability of sets of findings have been proposed. They include the caliper test, relative proximity, or z-curve (Bartoš and Schimmack 2022; see Adler, Röseler, and Schöniger 2023, for an overview and a ShinyApp that combines these tools). Individual findings can be assessed through forensic meta-science tests (for an overview, see Heathers 2025), and through the assessment of papers for reporting issues, such as those identified by statcheck (Nuijten and Polanin 2020; DeBruine and Lakens 2025). Moreover, methods such as sum of p-values (Held, Pawel, and Micheloud 2024) and Bayesian re-analysis can be applied to help determine the degree of evidence for a given effect an original study might contain (Field et al. 2019; Pittelkow et al. 2021).

If the original paper reports multiple studies for the same phenomenon, researchers should check the proportion of significant studies and whether all of them confirm the hypothesis. More studies reduce the overall statistical power (power deflation). Provided the hypothesis is correct, a single study may test it with 90% power, that is, the statistical analysis will indicate the correctness of the hypothesis with a probability of .9. Now, if 10 studies are run with 90% power each, the chances of all of them supporting the hypothesis (even if it is true) are 0.9¹⁰ ≈ 0.35. For 80%, even finding five significant findings in a row is fairly unlikely (0.8⁵ ≈ 0.33). Thus, studies reporting a set of many and only significant findings when each of the studies does not have very high power should be taken with caution (see also Francis 2012; Schimmack 2012; Lakens and Etz 2017).

For large parts of the literature and given the overall low replicability rate in many fields (though not all, e.g., Soto 2019), the mere lack of a reproduction or close replication by independent researchers can be used as an argument for uncertainty (e.g., Pittelkow et al. 2023). If a study has only been replicated by the original authors, it can be indicative of nobody else being interested in the phenomenon (i.e., low replication value) or nobody else being able to provide evidence for it (i.e., high uncertainty). For example, it is possible that reports of failed replications are held back by reviewers due to an aversion to null findings, replications, or findings criticizing their own work.

As replications can also be used to probe a phenomenon’s generalizability, a lack of variety in study designs can motivate a replication attempt. If there is reason to assume that a phenomenon is highly dependent on context (e.g., works only for graduate students, with English-speaking people, when people are incentivized, for the chosen stimuli, …), it can be replicated and extended in other contexts. More generally, when background factors are introduced to a study (e.g., there was a positive correlation in study X but researchers suspect it to vanish under condition M), the original finding needs to be replicated in a part of the new study for the argument to work. An added benefit of this is to help avoid later claims of ‘hidden moderators’ in original studies; an argument which has been used previously to refute the validity of replication study results (Zwaan et al. 2018).

Finally, uncertainty can be the result of a lack of specificity in the original report: If there are details missing that cannot be retrieved anymore (e.g., researchers involved in the original study cannot be reached), a replication can develop, test and share a comprehensive set of materials. For example, Chartrand and Bargh’s (1999) seminal study on the chameleon effect requires many materials but none of them are openly available. Accordingly, Pittelkow et al. (2023) identified the clarity of the original study protocol as an important element that features into the decision of replication study selection. Reconstructing these materials and documenting a procedure would, thus, be a valuable contribution of a replication study.

Theoretical contribution

In some cases, theories are so vague that a failed replication would likely be criticized for misunderstanding the theory (e.g., Baumeister and Vohs 2016). This suggests that the target theory was not well specified. If accepted as a reason not to replicate, it can discourage any form of replication despite the target finding being relevant. Instead, replication researchers can ask original authors for feedback on the study protocol before collecting data to try to ensure that it tests (and then articulates) the intended theory. They can also engage in adversarial collaboration or “red teaming” (e.g., Cowan et al. 2020; Clark et al. 2022; Corcoran, Hohwy, and Friston 2023), that is work together with the original authors to design a study that they agree would be able to corroborate the original claim, or to call it into doubt.

Nevertheless, it has been argued that because so many original studies are flawed, the theories built upon them are weak, or contaminated. This, in turn, can lead to flawed replication studies, especially in the case of theory that aims to explain phenomena (Field et al. 2024), risking a vicious cycle in which successful replications potentially perpetuate flaws across studies.

Availability of reproductions and replications

While a single replication (or robustness reproduction) cannot provide conclusive evidence in regard to the veracity of original claims, the first numerical reproduction, and arguably also the first robustness reproduction and replication adds the greatest value in terms of reducing uncertainty. Therefore, the search for existing reproductions and replications is a key part of the selection of a target study.

Although there is no comprehensive database with reproductions yet, researchers can check resources such as the Institute for Replication’s discussion paper series (https://i4replication.org/discussion_paper.html, cf. Brodeur, Dreber, et al. 2024), the ReplicationWiki (Höffler 2017), the CODECHECK register (https://codecheck.org.uk/register/, cf. Nüst and Eglen 2021), or the Social Science Reproduction Platform (https://www.socialsciencereproduction.org/).

With regard to replications, researchers can browse the FORRT Replication Database (https://forrt-replications.shinyapps.io/fred_explorer/, cf. Röseler et al. 2024), though this does not (yet) provide a replacement for manual searches.

3.4 (Potential) Researcher Bias

Researchers typically work in relatively small communities to investigate the same phenomenon. These researchers are invested in their work and can be influenced by certain researcher biases, such as confirmation bias (the tendency to preferentially seek out, evaluate and recall information that supports one’s existing beliefs, see Mahoney 1977) and motivated reasoning (generating post-hoc rationalizations that frame previous decisions in a favourable light, see Hardwicke and Wagenmakers 2023; Munafò et al. 2020). In some cases, researchers profit off their work and the (perceived) replicability of their findings may be associated with personal financial gain. Such conflicts of interest should be disclosed, but this is not always the case (see Heirene et al. 2024).

However, replication researchers are just as prone to bias as original authors can be. Certain studies are more likely to be chosen for replication than others (see Pennington 2023; Yarkoni 2013), and there may be a publication bias in replication studies in favor of nonsignificant findings (Berinsky, Druckman, and Yamamoto 2021), though there is no empirical evidence for this yet. Nevertheless, greater interest in failed replications seems very likely, incentivizing replication researchers to apply questionable research practices (QRPs) so that the results are nonsignificant (“null hacking,” Protzko 2018; Baumeister, Tice, and Bushman 2022). The problems of p-hacking and null-hacking can mostly be solved through preregistration and the use of Registered (Replication) Reports (e.g., Brodeur, Dreber, et al. 2024; Soderberg et al. 2021). Another type of bias is that researchers may be interested in replicating specific studies because of personal admiration towards a study or envy towards a colleague.

3.5 Feasibility

Reproductions require the original dataset. We recommend that researchers check whether the journal that published the original study has a data editor or reproducibility manager who has done a reproducibility check or provides a replication package. A replication package is a collection of materials to allow reproduction of the original results. Ideally, the dataset in the replication package, or shared separately, adheres to the FAIR criteria (Wilkinson et al. 2016), that is, it should be findable, accessible, interoperable, and reusable. Otherwise, the reproduction author would need to send a data sharing request to the original authors. In any case, they may need to consult with the original authors regarding software versions and code that does not work anymore due to changes in the software.

While original data is not necessary for replications, thorough documentation of the original study is highly beneficial. Moreover, replication researchers should evaluate whether they can achieve the target sample size, which is often a multiple of the original sample size (see section Sample Size Determination). Pittelkow et al. (2023) identified the resources available to the replicating team in terms of funding, time, equipment, and (if relevant) previous experience and expertise as important elements that feature into the replication study selection. When choosing a target study, researchers should try to anticipate practical problems, and should restrict their choice of replication target to align with their lab resources in order to prevent ‘secondary’ research waste (Field et al. 2019). Specifically, some studies may be diﬀicult to replicate (e.g., longitudinal studies). Other studies, such as those conducted with the use of highly technical, restricted, or expensive equipment, such as studies involving MRI scanning, might require expertise and knowledge that is not represented in all potential replication research teams (Field et al. 2019).

Moreover, there are no established standards for replications in some fields yet. In that case, replications may add less to the reduction of uncertainty and replicators need to propose methods. For example, replications with response-surface-analyses are not as established as those with t-tests for two-group study designs. Furthermore, the complexity of the data types can pose challenges for definitions of successful replications, such as in neuroimaging research (e.g., MRI studies) which often implicates outcome variables with an additional spatial component.

4 Planning and Conducting Reproductions and Replications

Planning depends on whether the focus is on a certain method or a theory, that is whether the replication will be close or conceptual. Table 4.1 provides an overview of reproduction and replication types, or more generally “repetitive research” (Schöch 2023), drawn from different resources (e.g., Dreber and Johannesson 2024; Hüffmeier, Mazei, and Schultze 2016; Cortina, Köhler, and Aulisi 2023). The decision between these types is the first step in planning.

In addition, the formation of the replication team is important, as replications can take substantial resources. Notably, repetitive research has successfully been conducted collaboratively with graduate and undergraduate students (e.g., Boyce et al. 2024; Hawkins et al. 2018; Jekel et al. 2020; Moreau and Wiebels 2023) and we recommend the use of replication studies to engage students of different levels in conducting and publishing research.

Table 4.1: Types of repetitive research ordered by reproduction and replication and respective closeness to the original study.

Type	Description	Goals
Computational Reproduction	Reanalysis of the same data with the same code	Correctness of original report
Recoding reproduction	Reanalysis of the same data, with new (equivalent) code	Correctness of original report
Robustness Reproduction	Reanalysis of the same data with new coding choices; can vary in closeness	Robustness of original finding and sensitivity to different analytical decisions or software
Multiverse analysis	Analyze data in all sensible ways (i.e., a large number of different robustness reproductions)	Robustness and generalizability of original finding, identification of potential moderators or sources for effect variability
Internal replication	Replicate one of your own studies as closely as possible	Demonstrate one’s findings’ generalizability across studies and rule out fear of false-positives (e.g., for new discoveries)
Close / direct / exact replication	Conduct a new study (based on work by other researchers) that is as close as possible to the original study	Rule out fear of the original finding being a false-positive, validate original materials or design, check generalizability/external validity for theoretically irrelevant variables (e.g., population, year of data collection)
Close replication with extension	Add a variable or procedure to a close replication	Rule out fear of the original finding being a false-positive, test generalizability of original finding
Conceptual / constructive replication	Conduct a study with changes that may be theoretically relevant but that tests the same hypothesis (e.g., different operationalization)	Generalizability of original finding, validity of theory

4.1 Post Publication Conversations

When planning the replication study, additional knowledge should be taken into account such as any discussions of the original finding. There can be other studies citing the original studies, criticizing them, disconfirming their underlying theory, identifying errors, reinterpreting the finding, or making suggestions for replications. All of these might highlight considerations that need to be taken into account when designing a replication study that robustly tests the original claim or its generalisability.

Thus, replication researchers should look for post-publication discussions on the target study such as published comments and reviews, blog posts, or discussions on social media. These can often be found via Altmetric (https://www.altmetric.com) or other tools that allow researchers to quickly identify discussions on social media or news outlets beyond scientific journals (PubPeer, Hypothes.is, or the in-development platform Alphaxiv.org; for a review see Henriques et al. (2023)).

4.2 Reproduction before Replication

Many features of a replication study rest on the correctness of the original report. A reproduction allows researchers to investigate this by being able to uncover coding errors, fraud, robustness to analytical decisions, and generalizability. To make eﬀicient use of resources, we encourage researchers to investigate the original finding’s reproducibility and robustness first. In other words, ideally, reproductions should take place before planning and conducting a replication study. Depending on the availability of the code and data, these can take several minutes to weeks.

If the original code and dataset are available, researchers can try to numerically reproduce the results. Beware, however, that differences in software versions or default settings may lead to slight deviations or require corrections in some cases (for a large-scale test of reproducibility see Brodeur, Dreber, et al. 2024). Similarly, the lack of a set seed for random number generators can mean that analyses relying on random numbers (e.g., bootstrapping) cannot be exactly reproduced. If no analysis script is available, analyses need to be recreated from the descriptions in the report (recoding reproduction). In this case, special attention should be paid to processing steps such as exclusion of outliers, transformation of variables, and handling of missing data. However, in many research areas information on these steps is often incomplete (Field et al. 2019); older research tends to be especially limited in terms of the methodological details they provide. In addition, we recommend testing the robustness of the original finding by making small alterations to the data processing and analyses procedure (robustness reproductions). For example, if the analyses were run for a subset of the data (e.g., participants aged 21 to 30 or without outliers ± 3 standard deviations), this subset can be changed (e.g., participants aged 18 to 30 or without outliers ± 2 standard deviations). Here, the initial focus should be on choices that are not determined by the theory that is presented, though this can also be used to explore the generalisability of some aspects of theory. Finally, if the original study was preregistered and the original code is available, reproduction researchers can check whether the original analyses adhere to the preregistered analysis plan.

If neither code nor data are available (or shared by the authors), no reproduction is possible. Researchers can still use automated tools to compare reported p-values with those that can be computed from test statistics via the website statcheck.io (where documents may be uploaded), the corresponding R package (Nuijten and Polanin 2020), or papercheck (DeBruine and Lakens 2025), which is still actively maintained.

Figure 4.1. Decision tree to choose between types of reproductions depending on available code and data.

4.3 Close replication before conceptual replication

If the goal is to increase the generalizability of a specific finding, we also suggest starting with replications that adhere as close as possible to the original study (e.g., close replications) and only later conduct conceptual replications. Based on Hüffmeier, Mazei, and Schultze (2016), we propose the typology and order of replication attempts in Figure 4.2. Importantly, replications at any stage should not compromise any aspects of an original study, but rather—at the latest from the third study stage (constructive replications) onwards—try to improve one or more aspects of the original study, such as “[…] more valid measures, more critical control variables, a more realistic task, a more representative sample, or a design that allows for stronger conclusions regarding causality” (Köhler and Cortina 2021, 494). Köhler and Cortina term such replications “constructive replica-tions” and caution against the conduct of “quasi-random” replications that vary features without clear rationale.

Finally, there may be cases where the sequence of replications is not necessary, or where the context of the replication team requires a focus on generalisability to a specific context (see Section 7.3).

Note: This is an adaptation and update of the typology of replication studies by Hüffmeier, Mazei, and Schultze (2016). The typology is conceptualized as a hierarchy of studies that together help to (i) establish the validity and replicability of new effects, (ii) exclude alternative explanations, (iii) test relevant boundary conditions, and (iv) test generalizability.

Figure 4.2. Sequence of replications from exact replications to conceptual replications under field conditions

5 Execution of Reproductions

5.1 Gathering resources

Prerequisites for reproduction studies are available data and ideally also code. These are usually linked within the manuscript and shared via repositories (e.g., Zenodo, OSF.io, github.com, gitlab.com) or they are part of the supplemental materials that are listed on the article’s website. In special cases, an entire original manuscript may be reproducible and written in Markdown language. Researchers searching for target studies to reproduce can check topfactor.org and filter for Data transparency level 3 (Leibniz Institute for the Social Sciences 2023, will no longer be updated). They can also use the extensive database of economics studies with available data compiled by Sebastian Kranz (Kranz 2025).

If data are not publicly available, researchers can contact the authors of the original study. In this case, we recommend them to adhere to Guide for Accelerating Computational Reproducibility in the Social Sciences (ACRE) guidelines for constructive communication (Berkeley Initiative for Transparency in the Social Sciences 2020).

When re-using data, researchers need to respect licenses. Generally, research data should be licensed openly, that is re-use and alteration should be permitted, likely requiring citation of the original resource (e.g., CC-BY 4.0 Attribution). Note, however, that non-derivative licenses may prohibit reproductions; in that case separate approval would be required from the copyright holder.

When it comes to reporting, Ankel-Peters et al. (2025) provide a table for reporting results from the computational reproduction that includes resource availability (e.g., raw data, cleaning code, analysis code).

5.2 Contacting Authors

Reproduction authors may have to contact the original authors if there is something missing. It will often be necessary to contact the authors more than once because missing descriptions of details of the original study only become apparent once the replication study is planned. In most cases, the original paper identifies one of the authors as “cor-responding author” with an e-mail address. We recommend a quick web search to check if this is the current email address, as researchers frequently change institutions and thus e-mail addresses. Sometimes, it may be most helpful to write to the last authors instead, who tend to have more stable e-mail addresses, or to copy all authors into the email. Templates for asking for materials and sharing replication results can be found in Section Appendix E. Note that original authors may not respond due to institutional changes or not being active in academia anymore.

5.3 Identification of Claims

Statistical analyses and their results are always used as a way to evaluate a certain claim. While Ankel-Peters et al. (2025) recommend reproductions to identify “results [that] are essential for the paper’s main argument to hold“, we acknowledge that a reproduction can also focus on secondary results if they are relevant in some other context. In either case, reproduction researchers need to justify the choice of the claim in their report

5.4 Preregistration

Preregistrations contain a description of the planned study or analysis prior to their execution. This way, they can reduce researchers’ ‘degrees of freedom’. In the case of reproductions, they can prevent QRPs (e.g., “null hacking,” Bryan, Yeager, and O’Brien 2019; “gotcha bias,” Berinsky, Druckman, and Yamamoto 2021) as long as the entire analysis plan is preregistered (Brodeur, Cook, et al. 2024) and the data have not yet been accessed. While a numerical reproduction with available code does not require preregistration, we recommend a priori specification of all further planned analyses.

It should be noted that a preregistered analysis plan or analysis script is much easier to create with access to data and reproductions are impossible with unavailable data, preregistration cannot exclude the risk of authors having already looked at the data, yet making fraudulent claims regarding data access in a preregistration is evidently academic misconduct. How much weight readers and reviewers will give to a preregistration based on data that could have been accessed already will differ, but generating it is a way to keep ourselves accountable and produce robust reproductions.

5.5 Deviations

To increase trust in the reported results, reproduction researchers need to report them in a transparent way, in a possible preregistration and the final report. Ideally, all changes to the original procedure are explained, justified, and hypotheses about their expected effect on the outcomes are reported. Note that some journals’ publishing reproductions require adherence to special requirements such a Registered Report format (e.g., Journal of Open Psychology Data) or including a minimum of two independent reproductions (e.g., Journal of Robustness Reports).

5.6 Analysis

The main part of the reproduction is the analysis. Factors that are potentially relevant for reproduction success include the software of the machine that is running the code as well as versions of the software and additional packages or plug-ins. For example, users of the open source software R can get a comprehensive overview of the program version and their machine using the function sessionInfo(), which should be included in supplementary materials. For python users, a package has been developed to run a similar function session_info.show() (https://gitlab.com/joelostblom/session_info).

Apart from a numerical reproduction where the same code is used, reproduction researchers can explore alternative ways that should and should not affect the results, test new hypotheses or theories, and run exploratory analyses. Their report should be clearly structured to discern these methods. Finally, for statistical analyses, the reproduction report should include reproducibility indicators (Dreber and Johannesson 2024) that summarize statistical significance and relative effect sizes across the original and reproduction results. Ankel-Peters et al. (2025) recommend a visual summary of these indicators in the form of a reproducibility dashboard and specification curves (Simonsohn, Simmons, and Nelson 2020; see also Mazei, Hüffmeier, and Schultze 2025). We strongly recommend reproduction researchers to consult the respective resources for further details.

5.7 Discussion

The discussion section should include a clear evaluation of the reproduction success on different levels (Ankel-Peters et al. 2025). Researchers should report possible reasons for failure (e.g., objective coding errors, changes in software packages) and the role of differences between the original and the reproduction studies’ results with respect to their conclusions. Finally, if the original authors provided comments, the reproduction report should include a discussion of them.

6 Execution of Replications

6.1 Preregistration and Registered (Replication) Reports

Due to the replications being met with skepticism, we encourage researchers to adhere to the highest standards of openness and transparency. This includes preregistering the replication including the analysis plan (ideally with an analysis code that was tested beforehand using data from test runs or simulations), and criteria for the results to distinguish between a replication success and failure. A preregistration without an analysis plan provides no safeguard against p-hacking (Brodeur, Cook, et al. 2024). Beware that these criteria can be structured sequentially. For example, if there is a manipulation check, it can be defined that it has to work for the replicability to actually be evaluated. Boyce et al. (2024) also found that repeating unsuccessful replications did not change the outcomes unless obvious weaknesses were fixed.

There is a specific preregistration template by Brandt et al. (2014) but it may not fit the structure of some studies beyond social psychology (e.g., personality science or cognitive psychology; for a list of preregistration templates see https://osf.io/7xrn9 and https://osf.io/zab38/wiki/home). To facilitate publication of the replication, we furthermore encourage submitting it as a Registered Report. A rejection due to the results is not possible at this point. A list of journals offering Registered Reports (irrespective of replications) is available online.

A special review platform for Registered Reports is Peer Community in Registered Reports (PCI-RR; https://rr.peercommunityin.org) where a community reviews pre-prints. Once accepted by PCI-RR, authors can decide to publish their paper in participating journals (PCI friendly journals) without another editorial round.

Finally, replication researchers need to deal with deviations from their preregistration in a transparent way. In principle, there is nothing wrong with deviating from what one had planned but most importantly, all changes should be listed, discussed, and it should be made transparent how the changes affected the results (for recommendations on changes and documentation, see Lakens (2024), and Willroth and Atherton (2024). If changes are noticed during the data collection, many platforms also allow the upload of amendments with preserved version history.

6.2 Sample Size Determination

For replication studies, power analyses or other types of sample size justification can be simpler than for studies testing entirely new hypotheses because there already is a study that did what one is planning, with a result that one can refer to. However, we advise against simply using the original study’s sample size. While the maxim for most decisions is “stay as close as possible to the original study”, sample sizes of replication studies usually need to be larger. To be informative, replication failure should provide evidence for a null hypothesis or a substantially smaller effect size, which requires a larger sample. While a general tutorial for sample size justification is provided by Lakens (2022), we briefly present approaches that are fit for replication studies.

As a pair of original and replication studies is usually concerned with multiple effect sizes (e.g., for different scales/items/groups/hypotheses), their number and individual power need to be considered carefully. If the interpretation will rely on the significance of all effect sizes, the total power will be smaller than the power for each individual test. To get along with limited resources, researchers may choose one single effect size and argue that it is central, or clearly specify other methods for aggregation across results (e.g., testing multivariate models).

6.2.1 Small Telescopes Approach

The idea behind the small telescopes approach (Simonsohn 2015) is that a replication study should be precise but how far this precision exceeds the original study should be limited. Specifically, the replication study should be able to detect an effect size for which the original study had insuﬀicient power (usually 33%). If that effect size can be ruled out, the original study can be treated as uninformative, as with such low power, the result becomes more likely to have been a false positive.

This approach is based on the notion that replications should assess the evidentiary value of the original study, and that the ‘burden of proof’ shifts back to proponents of a hypothesis if their evidence is shown to be very weak. It is particularly appropriate when original studies are very imprecise. In that case, a replication that finds a much smaller effect may well still be compatible with the (wide) confidence interval of the original study, and it might be impossible to reject the original claim on that basis.

As an example, Schultze, Gerlach, and Rittich (2018, fig. 4) found an effect in three studies with an average effect size of r = -.11, 95% CI [-.22, -.01].

If we wanted to achieve high power to rule out an effect of -.01, and thus show that the true effect does not fall into their confidence interval, we would need a sample size of 108,218 participants (alpha = 5%, one-tailed test⁴). Conversely, with the small telescopes approach, we would aim to test whether the replication effect is smaller than the effect which the original study had 33% power to detect, r = -.043 (alpha = 5%, one-tailed test). Simonsohn (2015) showed that this requires a sample 2.5 times as large as the original for 80% power. However, we deem that level of power insuﬀicient for replications, and instead suggest aiming for 95% power (given that a false negative in a replication leads to a wrong claim regarding the absence of an effect). This requires a multiple of 4.5 (rather than 2.5, see Wallrich 2025), so a sample is in this case of 4.5 * 793 = 3,569 participants. If this replication then results in an estimate that is significantly smaller than the effect the original study had 33% power to detect, the small telescopes approach would suggest treating the original study as unable to provide reliable evidence for its claim.

6.2.2 Equivalence Testing

If statements can be made about the smallest effect size of interest (SESOI), researchers can aim to test whether the replication effect is smaller than that. Given that the direction is fixed by the original, this simply requires running a one-sided test, e.g., a t-test in the case of a two group design, in the “lesser” direction. If the replication effect size is significantly smaller than the SESOI, the original claim is taken to be refuted in this instance by those who accept that this is really the smallest effect of interest. Lakens et al. (2018) provide a practical tutorial on equivalence testing, though they focus on cases where observations in either direction would falsify the null hypothesis.

6.2.3 Bayesian Approach

External knowledge can be incorporated into sample size planning (uninformative / flat priors; heterogeneity; shrinkage) using the R package BayesRepDesign (Pawel, Consonni, and Held 2023). Moreover, Micheloud and Held (2022) provide a method for incorporating an original study’s uncertainty into power calculations. With interim analyses (e.g., sequential testing) , a replication study can also be stopped early and prevent wasting resources (E. J. Wagenmakers, Gronau, and Vandekerckhove 2019). However, when planning to use Bayes Factors to make inferences about replication success, it is important to plan to use plausibly narrow priors. Priors that assign substantial likelihood to effects rarely observed (e.g., N(0,1) priors for standardized mean differences in the social sciences) may be taken to unfairly privilege the null hypothesis, which is inappropriate for a study setting out to find support for it.

6.2.4 Meta-Analytical Estimates

If the replication study is part of a larger research programme, it is possible to include other studies in the estimate of the (minimum) effect size one wishes to detect/rule out. The target study may be part of a multistudy paper with at least one other study that includes an effect size for the hypothesis of interest. Researchers can compare the effect sizes and possibly pool them to get a more precise estimate (for a related Shiny App, see McShane and Böckenholt 2017).

Metrics such as average effect sizes, heterogeneity, or the confidence interval width are valuable estimates needed for the replication’s sample size justification. If there is a meta-analysis on the general topic, researchers can also use that to inform sample size planning, but should prioritise estimates that aim to correct for publication bias and other QRPs (for an overview see Nagy et al. 2025). They should also choose effect sizes from a set of studies that resembles the planned replication study as closely as possible. For correlational effects, researchers can check metabus.org (Bosco, Uggerslev, and Steel 2017) to identify similar studies.

6.2.5 Multilab Replications

Multilab replications, that is replications that are conducted by different groups of researchers in different locations adhering to the same protocol, allow researchers to investigate heterogeneity of effects and estimate effect sizes with high precision. There are currently no standards for planning sample sizes for multilab replications. Depending on the specific goals, a power analysis needs to account for possible moderator hypotheses and the desired precision of effect size, heterogeneity estimates, or cultural variables. Note that this often requires large sample sizes for any level of the moderator (e.g., culture, profession). Usually, the different labs are required to collect data from a minimum number of participants. Each lab’s study and all analysis scripts should be preregistered to prevent local and global QRPs such as optional stopping or ad hoc exclusions of single labs.

6.3 Changes in the Methods

A replication study should closely resemble the original study, in the case of conducting a direct/close replication. However, this is diﬀicult for multiple reasons: First, original studies may not include suﬀicient detail to allow for a replication (Aguinis and Solarino 2019; Errington, Denis, et al. 2021). Second, scientific progress in the form of new methods and insights and cultural changes might require replication researchers to make changes or additions to their study. Third, obvious errors must be corrected. We elaborate on a number of reasons to deviate from an original study. In the replication report, all deviations should be reported and justified exhaustively.

Unspecific original materials: If the original study does not specify a key element that is needed for the replication, replication researchers can reach out to the original study’s authors and ask for the If this is not possible because authors cannot be reached or they are unwilling/unable to share the materials, new materials must be created. In this case, special attention should be paid to the theory, so that the new materials exhibit both face and construct validity.
Deprecated materials: If a psychological study about person perception published in the 1980s used celebrities, the examples used may no longer have the same For example, Mussweiler et al. (2000) used “a 10-year old car (1987 Opel Kadett E)” to be evaluated in German Marks. For a new study, car and currency would have to be replaced as a car’s age is strongly associated with price. Like most studies, the original provides no details about the conditions that a new stimulus would have to meet. Ideally, the theoretical requirements for stimuli should be specified in primary research, where they are not, replication authors need to make their own assumptions and report them explicitly (see Simons, Shoda, and Lindsay 2017).
Translation: Most published original studies are in If the replication sam-ple’s mother tongue is not English, translation may be necessary. Standards for translation differ strongly even between subfields. For example, when a personality scale is translated, the translated version will usually be validated and tests of invariance will be required. In social psychology, such procedures are less common, and often merely a back-translation is conducted. However, in any field, measurement invariance is required if one wants to compare effect sizes across samples, so that this should be tested rather than assumed where possible.
Necessity of a special sample: Many large-scale replication projects made use of click workers (e.g., via MTurk) or use student Replicators should consider if using such samples satisfy their needs and evaluate which platform to use (for best practices and ethical considerations, see Kapitány and Kavanagh (2023). Even if the original study used such a convenience population, changing to a different convenience population may require tweaks to maintain comparability, e.g. with regard to participant attentiveness and engagement with the paradigm.
Quality of methods and apparatus: Replicating old studies often faces the problem that something new has been discovered that should be taken into account. If a specific tool or method is used, there may be another recent method that is more For example, software for eye tracking studies from the early 2000s is now deprecated; there is new hardware and software that researchers will use. This might also apply to analysis methods, yet where possible, both the results from the original methods as well as state-of-the-art methods should be reported; where a choice has to be made, it is essential that invalid methods are avoided while comparability is maintained as far as possible. Finally, if the original finding’s generalizability is tested, new items or tasks that vary more or less systematically can be added to compare results for the original parts versus these extensions (though order effects have to be carefully considered, as a second manipulation might affect participants differently from a first manipulation)
Adding checks: Doing a replication often implies some uncertainty in the results, so it is wise to include checks that will be helpful to interpret the results, especially if they are For example, if there are occurrences that would make the results meaningless, it is good to have a way to measure them and incorporate that into the study. This could include positive or negative controls (items that are diagnostic of the method rather than the question of interest), manipulation checks (generally placed after the critical parts of the experiment), or attention checks. See Frank et al. (2025, chapter 12.3) for further discussion.

6.4 Piloting

If considerable resources are linked to the full execution of a replication (e.g., in a Registered Replication Report), or when new materials are used, researchers may want to consider piloting it (or parts of it) first. For multi-lab replications, researchers may want to consider a sequential study order in contrast to a simultaneous design: As Buttliere (2024) put it: “Who gets better results, 39 people doing it the first time or one person doing it 39 times?” (p.4) Beware that piloting may not be of value if it is simply an underpowered version of the study; instead it may be used to identify flaws in the methodology or test assumptions about the distribution of values or participants’ qualitative responses. Importantly, small pilot studies should never be used to derive effect sizes for power analyses as their results are too imprecise.

For instance, researchers should follow general best practices for their replications including piloting their study on a few participants to ensure that the instructions are clear, that the procedure works smoothly (e.g., website loads appropriately), and that all necessary data are recorded. A debriefing survey where pilot participants are asked about their experience, the clarity of instructions, and the clarity of any user interface, can help to identify some issues that could undermine the replication. See Frank et al. (2025, chapter 12.3.1) for further discussion on piloting studies.

6.4.1 Collaborating and Consulting with the Original Authors

To reduce the chance that a failure to replicate is dismissed by the original study’s authors afterwards by pointing out a flaw in the method, researchers can consult with the original authors before running the study. However, in the past, this still has not kept the original authors from dismissing a replication as an inadequate test of a hypothesis (Baumeister and Vohs 2016). Note that replication researchers have even been accused of “null hacking” (Protzko 2018) although little evidence exists for this claim (Berinsky, Druckman, and Yamamoto 2021). While involving original authors can help in creating a good study when reporting is poor, ideally original studies should be reported in suﬀicient detail for others to replicate them without further involving the original authors. Historically, the relationship between involvement of original authors and the average replication effect size is not clear (although there have been lab effects in some cases (Powers et al. 2013). This is showcased here in a few examples:

Powers et (2013) investigated the effect of video games on information processing and found larger effect sizes for active research groups.
Ten effects from Open Science Collaboration (2015) were replicated in Many Labs 5 (Ebersole et 2020), where the original authors commented on the study protocols of the planned replication before these replications were conducted, and “the revised protocols produced effect sizes similar to those of the RP:P protocols (Δr = .002 or .014, depending on analytic approach).”
McCarthy et (2021) conducted a multisite replication of hostile priming where one of the original authors was involved. Each laboratory conducted a close and a conceptual replication and found no difference and recommended that “researchers should not invest more resources into trying to detect a hostile priming effect using methods like those described in Srull and Wyer (1979)”.
After Baumeister and Vohs (2016) criticized the failed registered replication report by Hagger et al. (2016) for their methods, Vohs et al. (2021) conducted another registered replication report and also found a null
After no effect of the pen-in-mouth task was found in the facial feedback Registered Replication Report by Wagenmakers et al. (2016), another multilab test, which included one of the original authors, arrived at the same results (2022).
The Many Labs 4 project set out to test the effect of author involvement on replication success but found an overall null effect for the group of studies that did and that did not include original findings’ authors (Klein et al. 2014).
For social priming studies’ replication success, “the strongest predictor of replication success was whether or not the replication team included at least one of the authors of the original paper” (Mac Giolla et al. 2024).

6.5 Adversarial Collaborations

Although they are not specific to replication projects, researchers have often issued calls for adversarial collaborations (Clark et al. 2022; Cowan et al. 2020; Corcoran, Hohwy, and Friston 2023). Thereby, groups of researchers can collaborate and try to settle conflicting views by designing and conducting a study designed to settle a debate. A related idea are “red teams” where experts are invited to critique the analysis plan, without becoming authors and thus without a conflict of interest in terms of desired results (2018).

6.6 Analysis

Analyses of replication results are often a compromise or a combination of the original analysis and the current state-of-the-art. Generally, replication studies should follow the original analysis plan as closely as possible. That does not only concern statistical procedures but also data processing (e.g., exclusion of outliers, transformation and computation of variables). Even when following the original analysis plan for their confirmatory analysis, researchers should still follow best practices and examine their raw data to check for distributional anomalies to detect whether participants might be inattentive, guessing or speeding, and report relevant sensitivity checks where helpful. Some things to check for include theory-agnostic condition/manipulation checks (e.g., were participants faster in the condition focused on speed?) and the results of attention checks or control trials. Generally, it is advisable not to remove participants from the main analysis on that basis, but instead to confirm that the rates of non-compliance are acceptably low and to report robustness to the exclusion of these participants. See Ward and Meade (2023) for a comprehensive review of strategies for assessing and responding to careless responding.

At times, methodological advances may suggest that the original statistical tests are not robust. In such cases, researchers may want to run both the test that the original study used, as well as the statistical approach that is most appropriate by today’s standards (for instance, both the t-test that can be compared with the original, and the mixed-effect model that is justified by the study design). Where original data is available, or can be obtained from the original authors, researchers might be able to also update the analyses in the original study, which facilitates interpretation.

Where original statistical analyses are fundamentally flawed, replication researchers are faced with a diﬀicult choice. For instance, it has been convincingly demonstrated that the famous Dunning-Kruger effect (Kruger and Dunning 1999) is based on analyses strongly influenced by a statistical artifact, namely regression to the mean (Gignac and Zajenkowski 2020). In such a context, one may want to report results based on the original methods alongside more robust tests, yet needs to be very careful to frame them in a way that “replication success” cannot be claimed in the absence of evidence for the original claim.

Exclusion criteria are another area where there may be tension between the original study and current best practices. Typically, it makes sense to run the analysis both ways to check for robustness, yet one analysis choice should be preregistered as the central analysis.

Naturally, original and replication results should be compared. Unstandardized values can be informative with respect to sample characteristics (e.g., overall reaction times). How to do this analytically depends on the choice of success criteria discussed in the next section.

7 Discussion

7.1 Defining and Determining Replication Success

There is no strong consensus yet on what constitutes a replication success and some approaches can be biased (Schauer and Hedges 2021) or imprecise (Patil, Peng, and Leek 2016b). Like in classical null hypothesis significance testing (NHST), replication researchers face the trade-off between dichotomizing something that is not dichotomous (success vs. failure) and making a clear decision about the outcome. On the one hand this is a question about statistical choices and their interpretation, namely how to compare original and replication effect sizes (or p-values) and how to interpret differences. On the other hand, it is a more complex question about how to interpret a mixed pattern of results, where some results are consistent across original and replication, while others are not. Here, it is important for replication researchers to specify which effects are of primary interest in their preregistration, and how they will aggregate results, noting that requiring multiple effects to yield the same result will reduce statistical power.

Below, we present different approaches to assessing replication success as summarized by Heyard et al. (2025; see also Muradchanian et al. 2021; Röseler and Wallrich 2024; Errington, Denis, et al. 2021, Table 1)

Name	Question answered	Type of Reproducibility investigated
Quantified reproducibility assessment, QRA	“After performing multiple measurements of an object, what is the precision of the measured quantity obtained?”	Same data – same analysis; Different data – same analysis; Same data – different analysis; Different data – different analysis
Jaccard similarity coefficient	“By what extent do the results of two (or more) fMRI experiments overlap?”	Same data – same analysis
Sceptical 𝑝-value	“To what extent are the results of a replication study in conflict with the beliefs of a sceptic of the original study?”	Different data – same analysis
Modified Brinley plot	“Given a pre-specified desired effect and multiple replications, what is the share of replications that, represented graphically, achieve the desired effect?”	Same data – same analysis; Same data – different analysis; Different data – same analysis; Different data – different analysis
Likelihood-based approach for reproducibility	“Given a theoretically interesting effect size derived from the original study, what is the evidence for or against replicating this effect?”	Different data – same analysis
Bayesian mixture model for reproducibility rate	“Given the results (𝑝-values) from a set of original and replication studies, what is the rate of reproducibility, and how is it related to certain aspects of the experiments?”	Different data – same analysis
Unified framework for estimating the credibility of published research	“For a specific published research work, what is the evidence for its credibility measured on four different dimensions: method and data transparency, analytic reproducibility, analytic robustness and effect reproducibility?”	Same data – same analysis; Same data – different analysis; Different data – same analysis; Different data – different analysis
Reproducibility scale of workflow execution – Tonkaz	“Given a certain original research paper with results based on computation, can the workflow to generate the results be executed and verified?”	Same data – same analysis; Same data – different analysis; Different data – same analysis; Different data – different analysis
Mean relative effect size	“What is the average ratio of replication study effects to original study effects?”	Different data – same analysis; Same data – same analysis
Correlation between effects	“Do the replication studies and the original studies produce effects that are correlated?”	Different data – same analysis; Same data – same analysis
Fragility Index	“Given the results of an original study were significant, what is the smallest change in the original data that is needed to deem the results non-significant? and vice-versa for original null results” – “How fragile are the original results to small changes in the underlying data?”	Same data – different analysis
Externally standardized residuals	“Is the original study consistent with the replication(s)?” – “Are all studies included in a meta-analysis replicable?”	Different data – same analysis; Same data – different analysis
Snapshot hybrid	“After replicating an original study, what is the evidence for a null, small, medium or large effect?”	Different data – same analysis
Bayesian Evidence Synthesis	“Given several conceptual replications with substantial diversity in data, design and methods but investigating the same theory, what is the evidence underlying a certain theory of interest?”	Different data – different analysis
Design analysis	“Given the results of an original study and an effect of a hypothetical replication study, what is the probability of the estimate being in the wrong direction, and what is the factor by which the magnitude of the effect is overestimated?”	Different data – same analysis
Reproducibility Maps	“For fMRI research, how many and which of the truly active voxels were strongly reproduced?”	Same data – same analysis; Same data – different analysis
Continuously cumulating meta-analytic approach	“Given subsequent replications that were performed to date, what is the current evidence for an effect?”	Different data – same analysis
Correspondence test	“To what extent does the effect size from the replication study differ or is equivalent to that of the original study?”	Different data – same analysis
Z-curve	“Do all studies combined provide credible evidence for a phenomenon?”	Different data – same analysis
Cross-validation methods	“To what extent can the stability of a result be trusted, and to what extent can the result be generalized?”	Different data – same analysis
Network Comparison Test, NCT	“Given two network structures, how similar are they to each other?”	Same data – same analysis; Different data – same analysis
Leave-one-out error	“Given a deep learning model, how generalizable are its results?”	Different data – same analysis
Subjective reproducibility assessment	“Does the replication team consider the replication as successful?” – “To what extent does the replication team trust in the reproducibility of a finding?”	Different data – same analysis
I squared – 𝐼²	“Given a set of replications, to what extent is the total variation across study results due to heterogeneity?” – “How consistent are the results across replications?”	Different data – same analysis; Different data – different analysis
Credibility analysis	“How credible are the results of a study, in a Bayesian framework?”	Different data – same analysis
Consistency of original with replications, 𝑃_orig	“To what extent are the replication effect sizes consistent with the effect size of an original study?”	Different data – same analysis; Different data – different analysis
Proportion of population effects agreeing in direction with the original, 𝑃_>0	“To what extent do the replication effect sizes agree with the sign found in the original study?”	Different data – same analysis; Different data – different analysis
RepliCATS	“How reliable do experts believe the claims from an original finding are?”	Different data – same analysis
RepeAT – Repeatability Assessment Tool	“Does the presented research align with community standards of reproducible biomedical research, using electronic health records?”	Same data – same analysis; Different data – same analysis
P interval	“Given the results of an original study, what is the range of 𝑝-values a replication (following the same design) would lie in with 80% probability?”	Different data – same analysis
RipetaScore	“Given certain trust in research, reproducibility and professionalism quality indicators, how high does a paper score?”	Same data – same analysis; Different data – same analysis
Bland-Altman Plot	“Do the effects estimated in several original-replication study pairs agree with each other?” – “How good is the agreement between repeated measures/studies?”	Same data – same analysis; Same data – different analysis; Different data – same analysis; Different data – different analysis
Sceptical Bayes Factor	“In light of the replication data, at which level of evidence can an advocate of the original study convince a sceptic?”	Different data – same analysis

7.2 Interpreting Divergent Results (Replication Failures)

When replications succeed, the original claim gains further credence (as long as the methods are sound). However, when replications fail, many explanations and interpretations can be advanced, which need to be carefully considered and discussed in a report. While replication failure can highlight issues with statistical conclusion validity in the original studies (John, Loewenstein, and Prelec 2012; Nelson, Simmons, and Simonsohn 2018; Simmons, Nelson, and Simonsohn 2011), other explanations need to be considered, including issues with internal, external, and construct validity in both original and replication studies (Fabrigar, Wegener, and Petty 2020; Vazire, Schiavone, and Bottesini 2022). For example, internal validity is threatened when attrition rates differ between experimental conditions in original or replication studies, creating potential confounds in the interpretation of treatment effects (H. Zhou and Fishbach 2016). Construct validity is threatened when original or replication studies use unvalidated ad-hoc measures, fail to employ validated manipulations of the target construct, or when differences in sample characteristics between original and replication studies mean that manipulations and measures do not work as intended (Fabrigar, Wegener, and Petty 2020; Fiedler, McCaughey, and Prager 2021; Flake and Fried 2020). External validity is threatened when original findings do not generalize to the specifics of the replication study due to person and context differences between studies that moderate the effect. Thus, before making statements about the original finding’s robustness and generalizability, replication researchers need to critically discuss potential methodological shortcomings in both original studies and replication attempts that limit statistical conclusion, internal, external, and construct validity.

7.2.1 Hidden Moderator Account

One challenge for replication researchers is the identification of hidden/unknown confounds that may influence or bias the phenomenon under study. Each study has a set of potential extraneous or unknown moderator variables that is unique to it. These may seem trivial, such as the brightness of an experimental laboratory, or important, such as a cultural difference between the replicating and original studies. Yet even seemingly trivial differences could potentially change results. Often statistical and methodological choices are made to circumvent or attenuate these issues. However, for some paradigms, these variables could be unknown to the original researcher (Fiedler 2011). These are referred to in the literature as unknown moderators, background variables, hidden moderators or fringe variables. While they are always a way to reject unpleasant replication results, they can potentially bias replications, which highlights that a single replication is never entirely conclusive (though it might raise enough doubts that researchers do not see the value in addressing the remaining uncertainty). It should be noted that the same argument could be applied to raise doubts about any original study, questioning whether the effect is really due to the hypothesised cause or due to some hidden moderator or background variable. Clearly a skeptic who stops at that level would not be taken very seriously, so that it is important to move conversations about replication failure beyond general suspicion of hidden moderators.

(Bargh 2006) suggested that the evidence generated by empirical findings far outweighs the resources of (social) psychology to conceptualize and understand the mechanisms underlying their effects. Therefore, boundary conditions are not easily specified, which can impact both direct and conceptual replication success. Replication failure indicates that the original claim does not generalise to the setting of the replication. Whether that generalises to the setting of the original study needs to be considered in light of theory, and might be a legitimate matter of contention.

7.3 The Role of Differences for the Interpretation of Findings

Each replication outcome should be evaluated in the light of its closeness, which is why all deviations with the respective reasons and, if possible, their potential impact on the results should be discussed. Existing theories may help assess whether a deviation should affect the outcomes. For example, most psychological theories are agnostic towards age so that a different distribution of participants’ age will be unproblematic in most cases. Researchers may choose to evaluate replications from both phenomenon-focused / inductive and theory-focused / deductive views. Different types of interpretations are listed in Figure 7.1 and integrated from previous accounts by Borgstede and Scholz (Borgstede and Scholz 2021) and Freese and Peterson (Freese and Peterson 2017, fig. 3).

Figure 7.1. Interpretation of replication outcomes depend on similarity of closeness and results as well as the view (inductive vs. deductive).

Note.

Inductive or phenomenon-oriented views assume minimal generalizability of the original For example, they cannot cast doubts on the original finding unless the replication is highly similar to the original study.
Deductive or theory-oriented views assume maximal generality of a For example, different results (i.e., replication failures) cast doubts on the theory regardless of the replication type.

7.4 Comments from the Original Study’s Authors

If the replication results do not converge with the original results, replication researchers can reach out to the original study’s authors and ask for a comment that they can publish together with the replication report. A template for asking for a comment is in the appendix. Note that some journals (e.g., Journal of Comments and Replications in Economics) require such statements at the time of submission.

Part III Advanced Topics and Applications

8 Communicating and Publishing

The final step of replication research is publishing and communicating the results. Researchers should adhere to best practices of transparency and openness promotion guidelines (TOP, Grant et al. 2024) and to the reporting standards of their respective field (e.g., JARS standards for reporting psychology replications, Association, n.d.). For example, they should report a link to the preregistration, analysis plan, and analysis script, share all materials and data (if possible in light of ethical and legal limitations) under an open license (see also Janz and Freese 2021), and report methods and results comprehensively. Finally, in writing the report, reproduction and replication authors should be mindful of their language. Ideally, being replicated would be an honor for authors since other researchers deem their findings important but a failed replication could potentially harm the reputation of the original and increase distrust towards them among their peers. We recommend a descriptive and impersonal language. When criticizing bad documentation, no access to data, or brevity in methods replication authors should keep in mind the historical context of the original publication. For example, sharing data was much more diﬀicult in the 1990s and not required in many areas until recently.

The journals that published the original studies are often also chosen by authors for publication in accordance with the pottery-barn-rule (Srivastava 2012). However, in our experience, many journals reject replications due to their lack of novelty. We list several options for writing and publishing the report in Table 8.1. These are non-exclusive, that is, researchers can choose multiple of them. An overview of active journals that exclusively publish replications is in Table 8.2.

**Table 8.1**. Reporting and communicating reproductions and replications.
Type	Description
FORRT Replication Database	This open and collaborative database contains thousands of replication findings and makes them visible. Anyone can enter results using a guided survey (https://t1p.de/fred_submit).
PubPeer	Researchers can comment on the original study and say that there is a replication attempt, describe the outcome, and provide links/references/DOIs to the replication(s). Researchers checking pubpeer.com or using the browser plug-in that automatically highlights studies for which there are comments will see your comment.
Manuscript (required for Preprint and Journal Article)	Manuscripts are mostly used as they are the traditional form of a research article. For judgment and decision making, there are useful examples by Feldman (2024). For reproducibility analyses the I4R Replication Report Template (https://osf.io/j2qrx) can be used. Moreover, Röseler et al. (2025) provide general templates for reproductions and replications.
Preprint	We recommend publishing a report in the form of a traditional or standardized manuscript as a preprint. This secures open access and makes the report visible, citable, and commentable. There are many preprint servers across the social sciences (e.g., PsyArxiv, SOCARXIV, SportRxiv, MediArXiv, MindRxiv, EdArXiv, AfricArXiv, or MetaArXiv). In some countries, researchers have a legal right for a secondary publication of their research (green open access). Be aware that preprints are faster in terms of publication than journal articles, but are usually not peer-reviewed.
Journal article	Most researchers have to “play by the rules”, that is, publish or perish (Bakker, Dijk, and Wicherts 2012; Koole and Lakens 2012). While some have argued for a pottery barn rule where journals that published the original finding have to publish respective replication attempts (e.g., Srivastava 2012), many journals are not (yet) interested in replications. Notable exceptions are listed in the appendix. This is why journals dedicated to replications have emerged (see Table 8.2). Moreover, researchers can submit their preprint to a PCI community (see https://peercommunityin.org/current-pcis/), which is a preprint peer-review service. Several journals are PCI-friendly, which means that they publish articles recommended by the respective PCI.

Many institutions and libraries recommend adding a CC-BY disclaimer on journal submissions that give the researchers the right to use the accepted manuscript as they like or choosing Diamond Open Access journals that are defined by no fees for publishing and reading research.

**Table 8.2**. Active journals dedicated to reproductions and replications.
Journal name	Commercial status	Owners	Disciplines	Article types	Website
Journal of Comments and Replications in Economics	Non-commercial, diamond OA	ZBW	Economics	Replications, Reproductions and comments research	https://jcr-econ.org/
Replication Research	Non-commercial, diamond OA	Münster Center for Open Science and FORRT	Multidisciplinary	Reproductions, Replications, Conceptual articles	https://replicationresearch.or/
Journal of Open Psychology Data	Commercial, Gold OA (APCs: 450 pounds)	Ubiquity Press	Psychology	Reproductions (only as Registered Reports)	https://openpsychologydata.metajnl.com/
Journal of Robustness Reports	Non-commercial, diamond OA	SciPost	Multidisciplinary	At least two independent reproductions are required, limited to 500 words	https://scipost.org/JRobustRep
Rescience C	Non-commercial, diamond OA	Olivia Guest, Benoît Girard, Konrad Hinsen, Nicolas P. Rougier	Multidisciplinary	Reproductions	https://rescience.github.io/
Journal of Management Scientific Reports	Commercial (subscription based)	Sage	Management	Replications, reproductions, related methods	https://smgmt.org/jomsr/
Journal of Reproducibility in Neuroscience	Non-commercial, diamond OA	Center of Trial and Error	Neuroscience	Replications, Comments, Reviews, conceptual articles	https://jrn.trialanderror.org/
Rescience X	Non-commercial, diamond OA	Etienne B. Roesch	Multidisciplinary	Replications (Experiments)	http://rescience.org/x
AIS Transactions on Replication Research	Non-commercial, diamond OA	Association for Information Systems	Information Systems	Exact, Methodological, Conceptual Replications	https://aisel.aisnet.org/trr/

9 Field-Specific Replication Challenges: An example from MRI research

9.0.1 Introduction

While the principles of reproducibility and replication apply across scientific disciplines, certain fields face distinct methodological and practical challenges. Neuroimaging research, particularly MRI-based studies, is one example where field-specific complexities cause specific challenges for data sharing, reproducibility and replicability. Other fields may have different specialized requirements on these topics. Generally, false-positive findings are likely driven by a combination of low statistical power, a high number of researcher degrees of freedom and statistical tests, and biased motivation towards obtaining positive (i.e., significant) results (Ioannidis 2005). Most of these factors are arguably aggravated in MRI studies, making replication research in this field particularly relevant albeit challenging. In addition, the analyzed data and obtained findings are characterized by a three-dimensional spatial component (or four dimensions in case of functional MRI studies (fMRI) in combination with time series data), which further complicates the matter. In the following we summarize the inherent peculiarities of replication research in the field of neuroimaging.

9.0.2 Researcher Degrees of Freedom

Brain imaging comes with a massive number of researcher degrees of freedom along the preprocessing and analysis pipelines. Preprocessing steps include for example motion correction procedures, spatial normalization and smoothing, with additional steps necessary for some imaging modalities, such as temporal signal filtering for fMRI. For each of these steps a multitude of parameter options and toolboxes are available. It has been shown that different preprocessing toolboxes can lead to fundamentally different results, even when aiming to harmonize all parameters (X. Zhou et al. 2022), and that different teams analyzing the same dataset can arrive at different final conclusions dependent on the used pipeline (Botvinik-Nezer et al. 2020). Furthermore, a large variety of operationalizations of neurobiological targets is available. For example, cerebral gray matter structure could be investigated as voxel-wise gray matter, segmentation-based regional cortical surface, thickness or gyrification.

Analysis-wise, the high number of researcher degrees of freedom is mainly a consequence of the multidimensional data structure. Basically, the central question is where in the brain to look for effects and how to define significance in the face of a large number of tests. There is an immensely high number of single data points represented by spatial units in the obtained individual images (e.g., two-dimensional pixels or three-dimensional voxels). Analysis is often done utilizing mass-univariate approaches where a statistical model is calculated separately for each of these spatial units. For example, in cerebral MRI research the analysis of 400k voxels is common. To avoid false-positive findings, region-of-interests (ROIs) are often defined or the analysis is restricted to a smaller region in the brain (i.e., small volume correction) to narrow down the search space and unique methods to correct for multiple testing are applied (Han, Glenn, and Dawson 2019). This again results in a multitude of options, such as the anatomical vs. functional definition of a ROI based on several different atlases and a variety of voxel-based or cluster-based inference methods to choose from. Botvinik-Nezer et al. (2020) gave the same fMRI dataset (raw data and preprocessed data), along with predefined hypotheses to 70 independent analysis teams and observed substantial variation in obtained results, attributable to variability in the analysis pipelines (in fact, none of the 70 teams used the same pipeline). Even when the same code and data is available the reproducibility of MRI analysis can be challenging (Leehr et al. 2024).

9.0.3 Sample Size Justification

The gold standard for sample size justification is a power analysis. In neuroimaging this is complicated by the outlined mass-univariate three-dimensional data structure. Any power analysis would need to incorporate assumptions about the covariance structure of all data points, as well as the spatial extent and distribution of statistical effects, and the method to correct for multiple tests. While these numerous tests are not independent from another, the extent of their spatial covariance structure is diﬀicult to assess and depends on preprocessing steps, such as image smoothing but is also on the data and the specific research question. Due to the high number of single data points, the obtained result is not a single statistical estimate with an effect size but rather a highly individual three-dimensional distribution of effect sizes around a peak localization. Simulation-based power analysis approaches have been previously suggested to address this problem. However, valid simulations require assumptions about valid spatial distributions of effects (contingent on regional anatomical peculiarities and on the specific research question), often diﬀicult to assess and many developed power analysis tools have been discontinued. To date the utilization of power analysis is extremely rare in MRI research.

Without proper power estimation, justifying sample size becomes challenging. As in other fields of research the statistical power ultimately depends on the expected effect size. Recent large-scale investigations in the domain of mental health neuroimaging suggest that maximum underlying effect sizes are very small across various neuroimaging modalities (below 2% explained variance, Marek et al. 2022; Winter et al. 2022) and could require thousands of individuals to obtain robust and replicable statistical estimates (Marek et al. 2022). In contrast, given the labor-intensive and costly nature of MRI assessments, most MRI studies tend to have small sample sizes, making them likely underpowered (Button et al. 2013). Smaller samples may be suitable however, for research questions where the neurobiological effect sizes are expected to be larger, such as in psychosis research or when using highly individually tailored or within-subject designs (Lynch et al. 2024; Marek et al. 2022; Rosenberg and Finn 2022; Spisak, Bingel, and Wager 2023).

9.0.4 Criteria of Replication Success

Regarding the definition of replication success, the three-dimensional data structure requires special attention when defining replication success. In addition to other possible definitions, it has to be defined where in the brain the criteria of replication success should be met. As discussed above, there is not only one effect size but rather a 3D map with an effect size for each spatial unit (e.g., voxel). Goltermann and Altegoer (2025) describe a variety of potential criteria focusing on statistical significance in accordance with different spatial definitions revolving around the original finding. These include significance either at the peak voxel location (where the effect in the original study had the largest effect size), or in a ROI that can be defined in terms of spatial proximity to this peak voxel (for example a 15mm sphere with the peak voxel as a center) or in terms of an anatomically defined region where the original effect was found (for example anywhere in the hippocampus). Another possibility is the definition of a ROI directly deducted from the original results mask, if available (i.e., the original thresholded mask). Each of these spatial definitions comes with important limitations. For example, the meaning of proximity could be judged very different in different locations in the brain, as some anatomically or functionally defined structures may vary in size and distinctiveness (e.g., comparing the small and clearly-defined amygdala with a large and diﬀicult to define dorsolateral prefrontal cortex). Thus, it may be necessary to combine several criteria in a systematic and/or subjective manner.

It should be noted that these criteria apply to voxel-based analyses. For other neuroimaging techniques, such as segmentation-based MRI analysis, diffusion tensor imaging (white matter integrity), or functional connectivity metrics, other criteria for replication success may be necessary.

9.0.5 Open Science Practices in Neuroimaging

While suggestions on open science practices and replication studies are not fundamentally different from other research areas, their necessity for neuroimaging studies could be even more pressing and there are some peculiarities to consider. Due to the high number of researcher degrees of freedom the utilization of automated preprocessing pipelines is highly advisable (e.g., Esteban et al. 2019), ideally in combination with containerized toolbox environments for preprocessing and analysis (Renton et al. 2024). In face of reproducibility challenges the transparent publication of preprocessing and analysis scripts becomes even more vital. While the publication of data is advised whenever possible, this can be diﬀicult when sensitive patient data is included and whenever anonymization is diﬀicult. For example, while this is currently subject of debate, MRI-derived brain scans may retain fingerprint-like identifiable features, even when removing the face from the image (Jwa, Koyejo, and Poldrack 2024; Abramian and Eklund 2019). When the publication of raw data is not possible, comprehensive statistical brain maps (i.e., the statistical results in each voxel) should be made publicly available in non-thresholded form (Taylor et al. 2023) and/or data can be published in aggregated form (e.g., summarized for one brain region). Preregistrations can and should be used to make the exploitation of researcher degrees of freedom more transparent. To facilitate preregistrations in neuroimaging, there are multiple templates available. To incorporate all the specifics coming with MRI studies Beyer, Flannery et al. (2021) developed a fMRI specific template, which can be assessed here:https://doi.org/10.23668/psycharchives.5121. For replication research, preregistrations should contain a definition of replication success criteria that take into consideration the spatial dimension of results. Overall, open science practices and replications are still extremely rare in neuroimaging research despite their pressing relevance. Finally, there are also unique tensions to be navigated between open science practices in neuroimaging and the ongoing climate crisis, for example the sustainability of data sharing (see Puhlmann et al. 2025 for a perspective).

Part IV Conclusion and Checklist

10 Conclusion

As replication researchers from multiple disciplines, we have discussed current standards, best-practices, and open debates surrounding the planning and execution of reproductions and replications. We have also highlighted the need for field-specific guidance and debate by presenting the special case of replications with MRI data. Our recommendations are summarized in the checklist below. With decades of research waiting to be reproduced and replicated, we hope to provide a starting point for interdisciplinary discussions and support researchers in embracing the essential and exciting element of repetitive research.

10.1 Reproductions and Replications Checklist

Justify choice of target study and claims
Choose a reproduction/replication type that aligns with your aims
Gather and review all relevant materials
Reproduce before you replicate, where possible
Discuss all updates, changes, and extensions of the original materials (as close as possible, as updated as necessary)
Preregister your study and analysis plan
Predetermine conditions for success and failure
Use balanced language when describing the outcomes
Carefully evaluate outcomes and potential reasons for divergences
Report your research comprehensively and openly accessible

References

Abramian, David, and Anders Eklund. 2019. “Refacing: Reconstructing Anonymized Facial Features Using GANs.” In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), 1104–8. IEEE. https://doi.org/10.1109/ISBI.2019.8759515.

Adler, S. J., L. Röseler, and M. K. Schöniger. 2023. “A Toolbox to Evaluate the Trustworthiness of Published Findings.” Journal of Business Research 167: 114189. https://doi.org/10.1016/j.jbusres.2023.114189.

Aguinis, H., and A. M. Solarino. 2019. “Transparency and Replicability in Qualitative Research: The Case of Interviews with Elite Informants.” Strategic Management Journal 40 (8): 1291–1315. https://doi.org/10.1002/smj.3015.

Ankel-Peters, J., A. Brodeur, A. Dreber, M. Johannesson, F. Neubauer, and J. Rose. 2025. “A Protocol for Structured Robustness Reproductions and Replicability Assessments.” Q Open, qoaf004. https://doi.org/10.1093/qopen/qoaf004.

Ankel-Peters, J., N. Fiala, and F. Neubauer. 2023. “Do Economists Replicate?” Journal of Economic Behavior & Organization 212: 219–32. https://doi.org/10.1016/j.jebo. 2023.05.009.

Association, American Psychological. n.d. “Journal Article Reporting Standards (JARS): Quantitative Replications Reporting Table.” https://apastyle.apa.org/jars/quant-table-6.pdf.

Bakker, M., A. van Dijk, and J. M. Wicherts. 2012. “The Rules of the Game Called Psychological Science.” Perspectives on Psychological Science 7 (6): 543–54. https://doi.org/10.1177/1745691612459060.

Bargh, J. A. 2006. “What Have We Been Priming All These Years? On the Development, Mechanisms, and Ecology of Nonconscious Social Behavior.” European Journal of Social Psychology 36 (2): 147–68. https://doi.org/10.1002/ejsp.336.

Bartoš, F., and U. Schimmack. 2022. “Z-Curve 2.0: Estimating Replication Rates and Discovery Rates.” Meta-Psychology 6. https://doi.org/10.15626/MP.2021.2720.

Baumeister, R. F., D. M. Tice, and B. J. Bushman. 2022. “A Review of Multisite Replication Projects in Social Psychology: Is It Viable to Sustain Any Confidence in Social Psychology’s Knowledge Base?” Perspectives on Psychological Science 18 (4): 912–35. https://doi.org/10.1177/17456916221121815.

Baumeister, R. F., and K. D. Vohs. 2016. “Misguided Effort with Elusive Implica-tions.” Perspectives on Psychological Science 11 (4): 574–75. https://doi.org/10. 1177/1745691616652878.

Bekkers, R. 2024. “Replication Value: A Comment and Alternative.” https://doi.org/10.31234/osf.io/uj5g7.

Bennett, E. A. 2021. “Open Science from a Qualitative, Feminist Perspective: Epistemological Dogmas and a Call for Critical Examination.” Psychology of Women Quarterly 45 (4): 448–56. https://doi.org/10.1177/03616843211036460.

Berinsky, A. J., J. N. Druckman, and T. Yamamoto. 2021. “Publication Biases in Replication Studies.” Political Analysis 29 (3): 370–84. https://doi.org/10.1017/pan.2020.34.

Berkeley Initiative for Transparency in the Social Sciences. 2020. “Guide for Advancing Computational Reproducibility in the Social Sciences.” https://bitss.github.io/ACRE/.

Beyer, F., J. Flannery, R. Gau, L. Janssen, L. Schaare, H. Hartmann, G. Nilsonne, et al. 2021. “A fMRI Pre-Registration Template.” PsychArchives. https://doi.org/10. 23668/PSYCHARCHIVES.5121.

Block, J., and A. Kuckertz. 2018. “Seven Principles of Effective Replication Studies: Strengthening the Evidence Base of Management Research.” Management Review Quarterly 68 (4): 355–59. https://doi.org/10.1007/s11301-018-0149-3.

Borgstede, M., and M. Scholz. 2021. “Quantitative and Qualitative Approaches to Generalization and Replication–a Representationalist View.” Frontiers in Psychology 12: 605191. https://doi.org/10.3389/fpsyg.2021.605191.

Bosco, F. A., K. L. Uggerslev, and P. Steel. 2017. “MetaBUS as a Vehicle for Facilitating Meta-Analysis.” Human Resource Management Review 27 (1): 237–54. https://doi. org/10.1016/j.hrmr.2016.09.013.

Botvinik-Nezer, R., F. Holzmeister, C. F. Camerer, A. Dreber, J. Huber, M. Johannesson, and J. R. Rieck. 2020. “Variability in the Analysis of a Single Neuroimaging Dataset by Many Teams.” Nature 582 (7810): 84–88.

Boyce, V., B. Prystawski, A. B. Abutto, E. M. Chen, Z. Chen, H. Chiu, and M. C. Frank. 2024. “Estimating the Replicability of Psychology Experiments After an Initial Failure to Replicate,” May. https://doi.org/10.31234/osf.io/an3yb.

Brandt, M. J., H. IJzerman, A. Dijksterhuis, F. J. Farach, J. Geller, R. Giner-Sorolla, and A. Van’t Veer. 2014. “The Replication Recipe: What Makes for a Convincing Replication?” Journal of Experimental Social Psychology 50: 217–24. https://doi. org/10.1016/j.jesp.2013.10.005.

Brodeur, A., N. M. Cook, J. S. Hartley, and A. Heyes. 2024. “Do Preregistration and Preanalysis Plans Reduce p-Hacking and Publication Bias? Evidence from 15,992 Test Statistics and Suggestions for Improvement.” Journal of Political Economy Microeconomics 2 (3): 527–61. https://doi.org/10.1086/730455.

Brodeur, A., A. Dreber, F. Hoces de la Guardia, and E. Miguel. 2024. “Reproduction and Replication at Scale.” Nature Human Behaviour 8 (1): 2–3. https://doi.org/10. 1038/s41562-023-01807-2.

Bryan, C. J., D. S. Yeager, and J. M. O’Brien. 2019. “Replicator Degrees of Freedom Allow Publication of Misleading Failures to Replicate.” Proceedings of the National Academy of Sciences 116 (51): 25535–45. https://doi.org/10.1073/pnas.1910951116.

Buttliere, B. 2024. “Was This Registered Report Pilot Tested? Examination of Vaidis, Sleegers, van Leeuwen, DeMarree, … & Priolo, d. (2024).” https://doi.org/10.31234/ osf.io/c6r8x.

Button, K., J. Ioannidis, C. Mokrysz, et al. 2013. “Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience.” Nature Reviews Neuroscience 14: 365–76. https://doi.org/10.1038/nrn3475.

Calder, B. J., L. W. Phillips, and A. M. Tybout. 1981. “Designing Research for Applica-tion.” Journal of Consumer Research 8 (2): 197–207. https://doi.org/10.1086/208856. Carter, E. C., F. D. Schönbrodt, W. M. Gervais, and J. Hilgard. 2019. “Correcting for Bias in Psychology: A Comparison of Meta-Analytic Methods.” Advances in Methods and Practices in Psychological Science 2 (2): 115–44. https://doi.org/10.1177/2515245919847196.

Chartrand, T. L., and J. A. Bargh. 1999. “The Chameleon Effect: The Perception–Behavior Link and Social Interaction.” Journal of Personality and Social Psychology 76 (6): 893. https://doi.org/10.1037/0022-3514.76.6.893.

Clark, C. J., T. Costello, G. Mitchell, and P. E. Tetlock. 2022. “Keep Your Enemies Close: Adversarial Collaborations Will Improve Behavioral Science.” Journal of Applied Research in Memory and Cognition 11 (1): 1. https://doi.org/10.1037/mac0000004.

Clarke, B., P. Y. (K.) Lee, S. R. Schiavone, M. Rhemtulla, and S. Vazire. 2024. “The Prevalence of Direct Replication Articles in Top-Ranking Psychology Journals.” American Psychologist. https://doi.org/10.1037/amp0001385.

Cole, N. L., S. Ulpts, A. Bochynska, E. Kormann, M. Good, B. Leitner, and T. Ross-Hellauer. 2024. “Reproducibility and Replicability of Qualitative Research: An Integrative Review of Concepts, Barriers and Enablers.” https://doi.org/10.31222/osf.io/ n5zkw_v1.

Coles, N. A., D. S. March, F. Marmolejo-Ramos, et al. 2022. “A Multi-Lab Test of the Facial Feedback Hypothesis by the Many Smiles Collaboration.” Nature Human Behaviour 6: 1731–42. https://doi.org/10.1038/s41562-022-01458-9.

Corcoran, A. W., J. Hohwy, and K. J. Friston. 2023. “Accelerating Scientific Progress Through Bayesian Adversarial Collaboration.” Neuron 111 (22): 3505–16. https://doi.org/10.1016/j.neuron.2023.08.027.

Cortina, J. M., T. Köhler, and L. C. Aulisi. 2023. “Current Reproducibility Practices in Management: What They Are Versus What They Could Be.” Journal of Management Scientific Reports 1 (3-4): 171–205. https://doi.org/10.1177/27550311231202696.

Cowan, N., C. Belletier, J. M. Doherty, A. J. Jaroslawska, S. Rhodes, A. Forsberg, M. Naveh-Benjamin, P. Barrouillet, V. Camos, and R. H. Logie. 2020. “How Do Scientific Views Change? Notes from an Extended Adversarial Collaboration.” Perspectives on Psychological Science 15 (4): 1011–25. https://doi.org/10.1177/1745691620906415.

DeBruine, L., and D. Lakens. 2025. Papercheck: Check Scientific Papers for Best Practices. R Package Version 0.0.0.9053. https://github.com/scienceverse/papercheck.

Dreber, A., and M. Johannesson. 2024. “A Framework for Evaluating Reproducibility and Replicability in Economics.” Economic Inquiry. https://doi.org/10.1111/ecin.13244.

Dunlap, K. 1926. “The Experimental Methods of Psychology.” In Psychologies of 1925, edited by C. Murchison, 331–51. Clark University Press. https://doi.org/10.1037/ 11020-022.

Ebersole, C. R., M. B. Mathur, E. Baranski, D. J. Bart-Plange, N. R. Buttrick, C. R. Chartier, and P. Szecsi. 2020. “Many Labs 5: Testing Pre-Data-Collection Peer Review as an Intervention to Increase Replicability.” Advances in Methods and Practices in Psychological Science 3 (3): 309–31. https://doi.org/10.1177/2515245920958687.

Errington, T. M., A. Denis, N. Perfito, E. Iorns, and B. A. Nosek. 2021. “Challenges for Assessing Replicability in Preclinical Cancer Biology.” eLife 10: e67995. https://doi.org/10.7554/eLife.67995.

Errington, T. M., M. Mathur, C. K. Soderberg, A. Denis, N. Perfito, E. Iorns, and B. A. Nosek. 2021. “Investigating the Replicability of Preclinical Cancer Biology.” eLife 10: e71601. https://doi.org/10.7554/eLife.71601.

Esteban, O., C. J. Markiewicz, R. W. Blair, C. A. Moodie, A. I. Isik, A. Erramuzpe, and J. Gorgolewski. 2019. “fMRIPrep: A Robust Preprocessing Pipeline for Functional MRI.” Nature Methods 16 (1): 111–16.

Fabrigar, L. R., D. T. Wegener, and R. E. Petty. 2020. “A Validity-Based Framework for Understanding Replication in Psychology.” Personality and Social Psychology Review 24 (4): 316–44. https://doi.org/10.1177/1088868320931366.

Feldman, G. 2024. “Registered Report Stage 1 Manuscript Template.” https://doi.org/ 10.17605/OSF.IO/YQXTP.

———. 2025. “The Value of Replications Goes Beyond Replicability and Is Associated with the Value of the Research It Replicates: Commentary on Isager Et Al., 2021.” Meta-Psychology 9. https://doi.org/10.15626/MP.2024.4326.

Fiedler, K. 2011. “Voodoo Correlations Are Everywhere—Not Only in Neu-roscience.” Perspectives on Psychological Science 6 (2): 163–71. https://doi.org/10.1177/1745691611400237.

Fiedler, K., L. McCaughey, and J. Prager. 2021. “Quo Vadis, Methodology? The Key Role of Manipulation Checks for Validity Control and Quality of Science.” Perspectives on Psychological Science 16 (4): 816–26. https://doi.org/10.1177/1745691620970602.

Field, S. M., R. Hoekstra, L. Bringmann, and D. van Ravenzwaaij. 2019. “When and Why to Replicate: As Easy as 1, 2, 3?” Collabra: Psychology 5 (1): 46. https://doi.org/10.1525/collabra.218.

Field, S. M., Leonhard Volz, Artem Kaznatcheev, and Noah van Dongen. 2024. “Can a Good Theory Be Built Using Bad Ingredients?” Computational Brain & Behavior 7:608–15. https://doi.org/10.1007/s42113-024-00220-w.

Flake, J. K., and E. I. Fried. 2020. “Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them.” Advances in Methods and Practices in Psychological Science 3 (4): 456–65. https://doi.org/10.1177/2515245920952393.

Francis, G. 2012. “Too Good to Be True: Publication Bias in Two Prominent Studies from Experimental Psychology.” Psychonomic Bulletin & Review 19: 151–56. https://doi.org/10.3758/s13423-012-0227-9.

Freese, J., and D. Peterson. 2017. “Replication in Social Science.” Annual Review of Sociology 43 (1): 147–65. https://doi.org/10.1146/annurev-soc-060116-053450.

Friese, M., D. D. Loschelder, K. Gieseler, J. Frankenbach, and M. Inzlicht. 2019. “Is Ego Depletion Real? An Analysis of Arguments.” Personality and Social Psychology Review 23 (2): 107–31. https://doi.org/10.1177/1088868318762183.

Gignac, G. E., and M. Zajenkowski. 2020. “The Dunning-Kruger Effect Is (Mostly) a Statistical Artefact: Valid Approaches to Testing the Hypothesis with Individual Differences Data.” Intelligence 80: 101449. https://doi.org/10.1016/j.intell.2020.101449.

Goltermann, J., and L. Altegoer. 2025. “ReFiNe-MDD: Replicability of Findings in Neuroimaging in Depression.” https://doi.org/10.17605/OSF.IO/N86Q9.

Grant, S., K. S. Corker, D. T. Mellor, S. L. K. Stewart, A. G. Cashin, M. Lagisz, and B. A. Nosek. 2024. “TOP 2025: An Update to the Transparency and Openness Promotion Guidelines.” https://doi.org/10.31222/osf.io/nmfs6.

Hagger, M. S., N. L. D. Chatzisarantis, H. Alberts, C. O. Anggono, C. Batailler, A. Birt, R. Brand, et al. 2016. “A Multilab Preregistered Replication of the Ego-Depletion Effect.” Perspectives on Psychological Science 11 (4): 546–73.

Han, H., A. L. Glenn, and K. J. Dawson. 2019. “Evaluating Alternative Correction Methods for Multiple Comparison in Functional Neuroimaging Research.” Brain Sciences 9 (8): 198. https://doi.org/10.3390/brainsci9080198.

Hardwicke, T. E., and E. J. Wagenmakers. 2023. “Reducing Bias, Increasing Transparency and Calibrating Confidence with Preregistration.” Nature Human Behaviour 7 (1): 15–26. https://doi.org/10.1038/s41562-022-01497-2.

Hawkins, R. X., E. N. Smith, C. Au, J. M. Arias, R. Catapano, E. Hermann, and M. Frank. 2018. “Improving the Replicability of Psychological Science Through Pedagogy.” Advances in Methods and Practices in Psychological Science 1 (1): 7–18. https://doi.org/10.1177/2515245917740427.

Heathers, J. 2025. “An Introduction to Forensic Metascience.” https://doi.org/10.5281/zenodo.14871843.

Heirene, R., D. LaPlante, E. Louderback, B. Keen, M. Bakker, A. Serafimovska, and S. Gainsbury. 2024. “Preregistration Specificity and Adherence: A Review of Preregistered Gambling Studies and Cross-Disciplinary Comparison.” Meta-Psychology 8. https://doi.org/10.15626/MP.2021.2909.

Held, L., S. Pawel, and C. Micheloud. 2024. “The Assessment of Replicability Using the Sum of p-Values.” Royal Society Open Science 11 (8): 240149. https://doi.org/10.1098/rsos.240149.

Henriques, S. O., N. Rzayeva, S. Pinfield, and L. Waltman. 2023. “Preprint Review Services: Disrupting the Scholarly Communication Landscape?” https://doi.org/10.31235/osf.io/8c6xm.

Heroux, Michael A., Lorena A. Barba, Manish Parashar, Victoria Stodden, and Michela Taufer. 2018. “Toward a Compatible Reproducibility Taxonomy for Computational and Computing Sciences.” https://doi.org/10.2172/1481626.

Heyard, R., S. Pawel, J. Frese, B. Voelkl, H. Würbel, S. McCann, and S. Zellers. 2025. “A Scoping Review on Metrics to Quantify Reproducibility: A Multitude of Questions Leads to a Multitude of Metrics.” Royal Society Open Science 12 (7): 242076. https://doi.org/10.1098/rsos.242076.

Höffler, J. H. 2017. “ReplicationWiki: Improving Transparency in Social Sciences Re-search.” D-Lib Magazine 23 (3): 1. https://doi.org/10.1045/march2017-hoeffler.

Huang, F. L., and A. B. Huang. 2024. “Replication Studies Using Secondary or Non-experimental Datasets.” School Psychology Review, 1–15. https://doi.org/10.1080/2372966X.2024.2346781.

Hüffmeier, J., J. Mazei, and T. Schultze. 2016. “Reconceptualizing Replication as a Sequence of Different Studies: A Replication Typology.” Journal of Experimental Social Psychology 66: 81–92. https://doi.org/10.1016/j.jesp.2015.09.009.

Hummel, T., and J. Manner. 2024. “A Literature Review on Reproducibility Studies in Computer Science.” In Proceedings of the 16th ZEUS Workshop on Services and Their Composition (ZEUS 2024)(CEUR). Vol. 3673.

Ioannidis, J. P. 2005. “Why Most Published Research Findings Are False.” PLoS Medicine 2 (8): e124.

Isager, P. M., R. C. M. van Aert, Š. Bahník, M. J. Brandt, K. A. DeSoto, R. Giner-Sorolla, I. Krueger, et al. 2023. “Deciding What to Replicate: A Decision Model for Replication Study Selection Under Resource and Knowledge Constraints.” Psychological Methods 28 (2): 438–51.

Isager, Peder Mortvedt, Anna E. van’t Veer, Daniël Lakens, et al. 2021. “Replication Value as a Function of Citation Impact and Sample Size.” MetaArXiv. https://doi.org/10.31222/osf.io/knjea.

Jacowitz, K. E., and D. Kahneman. 1995. “Measures of Anchoring in Estimation Tasks.” Personality and Social Psychology Bulletin 21 (11): 1161–66. https://doi.org/10.1177/01461672952111004.

Janz, N., and J. Freese. 2021. “Replicate Others as You Would Like to Be Replicated Yourself.” PS: Political Science & Politics 54 (2): 305–8. https://doi.org/10.1017/S1049096520000943.

Jekel, M., S. Fiedler, R. Allstadt Torras, D. Mischkowski, A. R. Dorrough, and A. Glöck-ner. 2020. “How to Teach Open Science Principles in the Undergraduate Curriculum— the Hagen Cumulative Science Project.” Psychology Learning & Teaching 19 (1): 91–106. https://doi.org/10.1177/1475725719868149.

John, L. K., G. Loewenstein, and D. Prelec. 2012. “Measuring the Prevalence of Questionable Research Practices with Incentives for Truth Telling.” Psychological Science 23 (5): 524–32. https://doi.org/10.1177/0956797611430953.

Jwa, Anita S., Oluwasanmi Koyejo, and Russell A. Poldrack. 2024. “Demystifying the Likelihood of Reidentification in Neuroimaging Data: A Technical and Regulatory Analysis.” Imaging Neuroscience 2 (March). https://doi.org/10.1162/imag_a_00111.

Kamermans, K. L., L. Dudda, T. Daikoku, and S. Verheyen. 2025. “The Is-Ought Problem in Deciding What to Replicate: Which Motives Guide Current Replication Practices?” https://doi.org/10.31234/osf.io/6xdy2_v2.

Kapitány, R., and C. M. Kavanagh. 2023. “Best Practices and Ethical Considerations for Crowd-Sourced Data in the Behavioral Sciences.” https://doi.org/10.31219/osf.io/ sn5gh.

Karhulahti, V., M. Martončik, and M. Adamkovic. 2024. “Pre-Replication in Meaningful Science.” https://doi.org/10.31234/osf.io/5gn7m.

King, G. 1995. “Replication, Replication.” PS: Political Science & Politics 28 (3): 444–52. https://doi.org/10.2307/420301.

Klein, R. A., K. A. Ratliff, M. Vianello, R. B. Adams Jr, Š. Bahník, M. J. Bernstein, and B. A. Nosek. 2014. “Investigating Variation in Replicability.” Social Psychology. https://doi.org/10.1027/1864-9335/a000178.

Köhler, T., and J. M. Cortina. 2021. “Play It Again, Sam! An Analysis of Constructive Replication in the Organizational Sciences.” Journal of Management 47 (2): 488–518. https://doi.org/10.1177/0149206319843985.

Koole, S. L., and D. Lakens. 2012. “Rewarding Replications: A Sure and Simple Way to Improve Psychological Science.” Perspectives in Psychological Science 7: 608–14. https://doi.org/10.1177/1745691612462586.

Kranz, S. 2025. “Extensive Database of Economics Studies with Available Data.” https://ejd.econ.mathematik.uni-ulm.de/.

Kruger, J., and D. Dunning. 1999. “Unskilled and Unaware of It: How Diﬀiculties in Recognizing One’s Own Incompetence Lead to Inflated Self-Assessments.” Journal of Personality and Social Psychology 77 (6): 1121–34. https://doi.org/10.1037/0022-3514.77.6.1121.

Lakens, D. 2022. “Sample Size Justification.” Collabra: Psychology 8 (1): 33267. https://doi.org/10.1525/collabra.33267.

———. 2024. “When and How to Deviate from a Preregistration.” Collabra: Psychology 10 (1). https://doi.org/10.1525/collabra.117094.

Lakens, D., and A. J. Etz. 2017. “Too True to Be Bad: When Sets of Studies with Significant and Nonsignificant Findings Are Probably True.” Social Psychological and Personality Science 8 (8): 875–81. https://doi.org/10.1177/1948550617693058.

Lakens, D., A. M. Scheel, and P. M. Isager. 2018. “Equivalence Testing for Psychological Research: A Tutorial.” Advances in Methods and Practices in Psychological Science 1 (2): 259–69. https://doi.org/10.1177/2515245918770963.

Landy, J. F., M. L. Jia, I. L. Ding, D. Viganola, W. Tierney, A. Dreber, and Crowdsourcing Hypothesis Tests Collaboration. 2020. “Crowdsourcing Hypothesis Tests: Making Transparent How Design Choices Shape Research Results.” Psychological Bulletin 146 (5): 451. https://doi.org/10.1037/bul0000220.

Lash, T. L., L. J. Collin, and M. E. Van Dyke. 2018. “The Replication Crisis in Epidemiology: Snowball, Snow Job, or Winter Solstice?” Current Epidemiology Reports 5: 175–83.

LeBel, E. P., R. J. McCarthy, B. D. Earp, M. Elson, and W. Vanpaemel. 2018. “A Unified Framework to Quantify the Credibility of Scientific Findings.” Advances in Methods and Practices in Psychological Science 1 (3): 389–402. https://doi.org/10.1177/2515245918787489.

Leehr, E. J., F. R. Seeger, J. Böhnlein, B. Gathmann, T. Straube, K. Roesmann, and Lueken. 2024. “Association Between Resting-State Connectivity Patterns in the Defensive System Network and Treatment Response in Spider Phobia—a Replication Approach.” Translational Psychiatry 14 (1): 137.

Leibniz Institute for the Social Sciences. 2023. “TOP Factor: Open Data Levels of Social Science Journals.” https://topfactor.org/journals?factor=Data+Transparency.

Lynch, C. J., I. G. Elbau, T. Ng, et al. 2024. “Frontostriatal Salience Network Expansion in Individuals in Depression.” Nature 633: 624–33. https://doi.org/10.1038/s41586-024-07805-2.

Mac Giolla, Erik, Simon Karlsson, David A. Neequaye, and Magnus Bergquist. 2024. “Evaluating the Replicability of Social Priming Studies.” Meta-Psychology 8. https://doi.org/10.15626/MP.2022.3308.

Mahoney, M. J. 1977. “Publication Prejudices: An Experimental Study of Confirmatory Bias in the Peer Review System.” Cognitive Therapy and Research 1: 161–75. https://doi.org/10.1007/BF01173636.

Makel, M. C., J. A. Plucker, and B. Hegarty. 2012. “Replications in Psychology Research: How Often Do They Really Occur?” Perspectives on Psychological Science 7 (6): 537–42. https://doi.org/10.1177/1745691612460688.

Marek, S., B. Tervo-Clemmens, F. J. Calabro, D. F. Montez, B. P. Kay, A. S. Hatoum, and N. U. Dosenbach. 2022. “Reproducible Brain-Wide Association Studies Require Thousands of Individuals.” Nature 603 (7902): 654–60.

Mazei, J., J. Hüffmeier, and T. Schultze. 2025. “Specification Curve and Reproducibility Dashboards for Social Science Research: Recommendations for Implementation.” Advances in Methods and Practices in Psychological Science.

McCarthy, R., W. Gervais, B. Aczel, R. L. Al-Kire, M. Aveyard, S. Marcella Baraldo, and C. Zogmaister. 2021. “A Multi-Site Collaborative Study of the Hostile Priming Effect.” Collabra: Psychology 7 (1): 18738. https://doi.org/10.1525/collabra.18738.

McManus, K. 2024. “Replication Studies in Second Language Acquisition Research: Definitions, Issues, Resources, and Future Directions: Introduction to the Special Issue.” Studies in Second Language Acquisition 46 (5): 1299–319. https://doi.org/10.1017/S0272263124000652.

McShane, B. B., and U. Böckenholt. 2017. “Single-Paper Meta-Analysis: Benefits for Study Summary, Theory Testing, and Replicability.” Journal of Consumer Research 43 (6): 1048–63. https://doi.org/10.1093/jcr/ucw085.

Micheloud, C., and L. Held. 2022. “Power Calculations for Replication Studies.” Statistical Science 37 (3): 369–79. https://doi.org/10.1214/21-STS828.

Miłkowski, M., W. M. Hensel, and M. Hohol. 2018. “Replicability or Reproducibility? On the Replication Crisis in Computational Neuroscience and Sharing Only Relevant Detail.” Journal of Computational Neuroscience 45 (3): 163–72. https://doi.org/10.1007/s10827-018-0702-z.

Moreau, D., and K. Wiebels. 2023. “Ten Simple Rules for Designing and Conducting Undergraduate Replication Projects.” PLOS Computational Biology 19 (3): e1010957. https://doi.org/10.1371/journal.pcbi.1010957.

Munafò, M. R., C. D. Chambers, A. M. Collins, L. Fortunato, and M. R. Macleod. 2020. “Research Culture and Reproducibility.” Trends in Cognitive Sciences 24 (2): 91–93. https://doi.org/10.1016/j.tics.2019.12.002.

Muradchanian, J., R. Hoekstra, H. Kiers, and D. van Ravenzwaaij. 2021. “How Best to Quantify Replication Success? A Simulation Study on the Comparison of Replication Success Metrics.” Royal Society Open Science 8 (5): 201697. https://doi.org/10.1098/rsos.201697.

Mussweiler, T., F. Strack, and T. Pfeiffer. 2000. “Overcoming the Inevitable Anchoring Effect: Considering the Opposite Compensates for Selective Accessibility.” Personality and Social Psychology Bulletin 26 (9): 1142–50. https://doi.org/10.1177/01461672002611010.

Nagy, Tamás, Jane Hergert, Mahmoud M. Elsherif, Lukas Wallrich, Kathleen Schmidt, Tal Waltzer, Jason W. Payne, et al. 2025. “Bestiary of Questionable Research Practices in Psychology.” Advances in Methods and Practices in Psychological Science 8 (3). https://doi.org/10.1177/25152459251348431.

Nelson, L. D., J. Simmons, and U. Simonsohn. 2018. “Psychology’s Renaissance.” Annual Review of Psychology 69 (1): 511–34. https://doi.org/10.1146/annurev-psych-122216-011836.

Nosek, B. A., and T. M. Errington. 2020. “What Is Replication?” PLoS Biology 18 (3): e3000691. https://doi.org/10.1371/journal.pbio.3000691.

Nuijten, M. B., and J. R. Polanin. 2020. “‘Statcheck’: Automatically Detect Statistical Reporting Inconsistencies to Increase Reproducibility of Meta‐analyses.” Research Synthesis Methods 11 (5): 574–79. https://doi.org/10.1002/jrsm.1408.

Nüst, D., and S. J. Eglen. 2021. “CODECHECK: An Open Science Initiative for the Independent Execution of Computations Underlying Research Articles During Peer Review to Improve Reproducibility.” F1000Research 10: 253. https://doi.org/10.12688/f1000research.51738.2.

Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716. https://doi.org/10.1126/science.aac4716.

Orne, Martin T. 2017. “On the Social Psychology of the Psychological Experiment: With Particular Reference to Demand Characteristics and Their Implications.” In Sociological Methods, 279–99. Routledge.

Patil, P., R. D. Peng, and J. T. Leek. 2016a. “A Statistical Definition for Reproducibility and Replicability.” BioRxiv, 066803. https://doi.org/10.1101/066803.

———. 2016b. “What Should Researchers Expect When They Replicate Studies? A Statistical View of Replicability in Psychological Science.” Perspectives on Psychological Science 11 (4): 539–44. https://doi.org/10.1177/1745691616646366.

Pawel, S., G. Consonni, and L. Held. 2023. “Bayesian Approaches to Designing Replication Studies.” Psychological Methods. https://doi.org/10.1037/met0000604.

Pennington, C. R. 2023. A Student’s Guide to Open Science: Using the Replication Crisis to Reform Psychology. Open University Press.

Perry, T., R. Morris, and R. Lea. 2022. “A Decade of Replication Study in Education? A Mapping Review (2011–2020).” Educational Research and Evaluation 27 (1-2): 12–34. https://doi.org/10.1080/13803611.2021.2022315.

Pittelkow, M. M., S. M. Field, P. M. Isager, T. van’t Veer A. E. Anderson, S. N. Cole, and D. Van Ravenzwaaij. 2023. “The Process of Replication Target Selection in Psychology: What to Consider?” Royal Society Open Science 10 (2): 210586. https://doi.org/10.1098/rsos.210586.

Pittelkow, M. M., S. M. Field, and D. van Ravenzwaaij. 2025. “Thinking Beyond RVCN: Addressing the Complexity of Replication Target Selection.” https://doi.org/10.31234/osf.io/6tmyx_v2.

Pittelkow, M. M., R. Hoekstra, J. Karsten, and D. van Ravenzwaaij. 2021. “Replica-tion Target Selection in Clinical Psychology: A Bayesian and Qualitative Reevalua-tion.” Clinical Psychology: Science and Practice 28 (2): 210. https://doi.org/10.1037/cps0000013.

Powers, K. L., P. J. Brooks, N. J. Aldrich, M. A. Palladino, and L. Alfieri. 2013. “Effects of Video-Game Play on Information Processing: A Meta-Analytic Investigation.” Psychonomic Bulletin & Review 20 (6): 1055–79. https://doi.org/10.3758/s13423-013-0418-z.

Pownall, M. 2022. “Is Replication Possible for Qualitative Research?” https://doi.org/10.31234/osf.io/dwxeg.

Protzko, J. 2018. “Null-Hacking, a Lurking Problem.” https://doi.org/10.31234/osf.io/9y3mp.

Puhlmann, Lars, Anna Koppold, Gesa Feld, Tina Lonsdorf, Kirsten Hilger, Susanne Vogel, and Hannes Hartmann. 2025. “There Is No Research on a Dead Planet–Fostering Ecologically Sustainable Open Science Practices in Neuroscience.” OSF Preprint. https://doi.org/10.31219/osf.io/rju75_v1.

Renton, A. I., T. T. Dao, T. Johnstone, O. Civier, R. P. Sullivan, D. J. White, P. Lyons, et al. 2024. “Neurodesk: An Accessible, Flexible and Portable Data Analysis Environment for Reproducible Neuroimaging.” Nature Methods 21 (5): 804–8. https://doi.org/10.1038/s41592-023-02145-x.

Röseler, L., M. Hein, and P. Oppong Boakye. 2025. “Standardized Reproduction and Replication Templates (StaRT).” https://doi.org/10.17605/OSF.IO/BRXTD.

Röseler, L., L. Kaiser, C. Doetsch, N. Klett, C. Seida, A. Schütz, and Y. Zhang. 2024. “The Replication Database: Documenting the Replicability of Psychological Science.” Journal of Open Psychology Data 12 (1): 8. https://doi.org/10.5334/jopd.101.

Röseler, L., Astrid Schütz, Pia A. Blank, Marieluisa Dück, Sabine Fels, Jana Kupfer, Linda Scheelje, and Christian Seida. 2021. “Evidence Against Subliminal Anchoring: Two Close, Highly Powered, Preregistered, and Failed Replication Attempts.” Journal of Experimental Social Psychology 92: 104066. https://doi.org/10.1016/j.jesp.2020.104066.

Röseler, L., and L. Wallrich. 2024. FReD: Interfaces to the FORRT Replication Database. http://forrt.org/FReD/.

Rosenberg, M. D., and E. S. Finn. 2022. “How to Establish Robust Brain–Behavior Relationships Without Thousands of Individuals.” Nature Neuroscience 25: 835–37. https://doi.org/10.1038/s41593-022-01110-9.

Schauer, J. M., and L. V. Hedges. 2021. “Reconsidering Statistical Methods for Assessing Replication.” Psychological Methods 26 (1): 127–39. https://doi.org/10.1037/met0000302.

Schimmack, U. 2012. “The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles.” Psychological Methods 17 (4): 551. https://doi.org/10.1037/a0029487.

Schmidt, S. 2009. “Shall We Really Do It Again? The Powerful Concept of Replication Is Neglected in the Social Sciences.” Review of General Psychology 13 (2): 90–100. https://doi.org/10.1037/a0015108.

Schöch, C. 2023. “Repetitive Research: A Conceptual Space and Terminology of Replication, Reproduction, Revision, Reanalysis, Reinvestigation and Reuse in Digital Humanities.” International Journal of Digital Humanities 5 (2): 373–403. https://doi.org/10.1007/s42803-023-00073-y.

Schultze, T., T. M. Gerlach, and J. C. Rittich. 2018. “Some People Heed Advice Less Than Others: Agency (but Not Communion) Predicts Advice Taking.” Journal of Behavioral Decision Making 31 (3): 430–45. https://doi.org/10.1002/bdm.2065.

Simmons, J. P., L. D. Nelson, and U. Simonsohn. 2011. “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Psychological Science 22 (11): 1359–66. https://doi.org/10.1177/0956797611417632.

Simons, D. J., Y. Shoda, and D. S. Lindsay. 2017. “Constraints on Generality (COG): A Proposed Addition to All Empirical Papers.” Perspectives on Psychological Science 12 (6): 1123–28. https://doi.org/10.1177/1745691617708630.

Simonsohn, U. 2015. “Small Telescopes: Detectability and the Evaluation of Replication Results.” Psychological Science 26 (5):559–69. https://doi.org/10.1177/0956797614567341.

Simonsohn, U., J. P. Simmons, and L. D. Nelson. 2020. “Specification Curve Analysis.” Nature Human Behaviour 4: 1208–14. https://doi.org/10.1038/s41562-020-0912-z.

Soderberg, C. K., T. M. Errington, S. R. Schiavone, et al. 2021. “Initial Evidence of Research Quality of Registered Reports Compared with the Standard Publishing Model.” Nature Human Behaviour 5: 990–97. https://doi.org/10.1038/s41562-021-01142-4.

Soto, C. J. 2019. “How Replicable Are Links Between Personality Traits and Consequential Life Outcomes? The Life Outcomes of Personality Replication Project.” Psychological Science 30 (5): 711–27. https://doi.org/10.1177/0956797619831612.

Spisak, T., U. Bingel, and T. D. Wager. 2023. “Multivariate BWAS Can Be Replicable with Moderate Sample Sizes.” Nature 615: E4–7. https://doi.org/10.1038/s41586-023-05745-x.

Srivastava, Sanjay. 2012. “A Pottery Barn Rule for Scientific Journals.” The Hardest Science blog. https://thehardestscience.com/2012/09/27/a-pottery-barn-rule-for-scientific-journals.

Syed, M. 2023. “Replication or Generalizability? How Flexible Inferences Uphold Unfounded Universal Claims,” May. https://doi.org/10.31234/osf.io/znv5r.

Taylor, P. A., R. C. Reynolds, V. Calhoun, J. Gonzalez-Castillo, D. A. Handwerker, P. 2023. Bandettini, and G. Chen. 2023. “Highlight Results, Don’t Hide Them: Enhance Interpretation, Reduce Biases and Improve Reproducibility.” Neuroimage 274: 120138. The Turing Way Community. 2025. The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research. 1.2.3 ed. Zenodo. https://doi.org/10.5281/zenodo.15213042.

Tsang, E. W., and K. M. Kwan. 1999. “Replication and Theory Development in Organizational Science: A Critical Realist Perspective.” Academy of Management Review 24 (4): 759–80. https://doi.org/10.2307/259353.

Urminsky, O., and B. J. Dietvorst. 2024. “Taking the Full Measure: Integrating Replication into Research Practice to Assess Generalizability.” Journal of Consumer Research 51 (1): 157–68. https://doi.org/10.1093/jcr/ucae007.

Van Bavel, Jay J., Peter Mende-Siedlecki, William J. Brady, and Diego A. Reinero. 2016. “Contextual Sensitivity in Scientific Reproducibility.” Proceedings of the National Academy of Sciences 113 (23): 6454–59. https://doi.org/10.1073/pnas.1521897113.

Vazire, S., S. R. Schiavone, and J. G. Bottesini. 2022. “Credibility Beyond Replicability: Improving the Four Validities in Psychological Science.” Current Directions in Psychological Science 31 (2): 162–68. https://doi.org/10.1177/09637214211067779.

Voelkl, B., R. Heyard, D. Fanelli, K. E. Wever, L. Held, Z. Maniadis, and H. Würbel. 2025. “Defining Reproducibility.” https://doi.org/10.17605/OSF.IO/BR9SP.

Vohs, K. D., B. J. Schmeichel, S. Lohmann, Q. F. Gronau, A. J. Finley, S. E. Ainsworth, L. Alquist, et al. 2021. “A Multisite Preregistered Paradigmatic Test of the Ego-Depletion Effect.” Psychological Science 32 (10): 1566–81. https://doi.org/10.1177/0956797621989733.

Wagenmakers, E. J., Q. F. Gronau, and J. Vandekerckhove. 2019. “Five Bayesian Intuitions for the Stopping Rule Principle.” https://doi.org/10.31234/osf.io/5ntkd.

Wagenmakers, E.-J., T. Beek, L. Dijkhoff, Q. F. Gronau, A. Acosta, R. B. Adams, D. Albohn, et al. 2016. “Registered Replication Report: Strack, Martin, & Stepper (1988).” Perspectives on Psychological Science 11 (6): 917–28. https://doi.org/10.1177/1745691616674458.

Wallrich, Lukas. 2025. “Small Telescopes for Higher-Power Replications.” Personal blog. https://www.lukaswallrich.coffee/blog/small-telescopes-for-higher-power-replications/.

Ward, M. K., and A. W. Meade. 2023. “Dealing with Careless Responding in Survey Data: Prevention, Identification, and Recommended Best Practices.” Annual Review of Psychology 74 (1): 577–96. https://doi.org/10.1146/annurev-psych-040422-045007.

Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, and B. Mons. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 1–9. https://doi.org/10.1038/sdata.2016.18.

Willroth, E. C., and O. E. Atherton. 2024. “Best Laid Plans: A Guide to Reporting Preregistration Deviations.” Advances in Methods and Practices in Psychological Science 7 (1): 25152459231213802. https://doi.org/10.1177/25152459231213802.

Winter, N. R., R. Leenings, J. Ernsting, K. Sarink, L. Fisch, D. Emden, and T. Hahn. 2022. “Quantifying Deviations of Brain Structure and Function in Major Depressive Disorder Across Neuroimaging Modalities.” JAMA Psychiatry 79 (9): 879–88.

Yarkoni, T. 2013. “‘What We Can and Can’t Learn from the Many Labs Replication Project’.” Talyarkoni.org/Blog.

Zhou, Haotian, and Ayelet Fishbach. 2016. “The Pitfall of Experimenting on the Web: How Unattended Selective Attrition Leads to Surprising (yet False) Research Conclusions.” Journal of Personality and Social Psychology 111 (4): 493–504. https://doi.org/10.1037/pspa0000056.

Zhou, X., R. Wu, Y. Zeng, Z. Qi, S. Ferraro, L. Xu, and B. Becker. 2022. “Choice of Voxel-Based Morphometry Processing Pipeline Drives Variability in the Location of Neuroanatomical Brain Markers.” Communications Biology 5 (1): 913.

Zwaan, R. A., A. Etz, R. E. Lucas, and M. B. Donnellan. 2018. “Making Replication Mainstream.” Behavioral and Brain Sciences 41: e120. https://doi.org/10.1017/S0140525X17001972.

A Author Contributions

*shared first authorship
No.	Author	Contribution (CRediT)	Affiliation
1	Röseler, Lukas*	Conceptualization, Project Administration, Writing – original draft, Writing – review & editing	Münster Center for Open Science, University of Münster
2	Wallrich, Lukas*	Writing – original draft, Writing – review & editing	Birkbeck Business School, University of London
3	Hartmann, Helena	Writing – original draft, Writing – review & editing	Department for Neurology and Center for Translational Neuro- and Behavioral Sciences (C-TNBS), University Hospital Essen
4	Altegoer, Luisa	Writing – review & editing	University of Münster, Institute for Translational Psychiatry
5	Boyce, Veronica	Writing – review & editing	Department of Psychology, Stanford University
6	Field, Sarahanne M.	Writing – review & editing	Department of Pedagogy, University of Groningen
7	Goltermann, Janik	Writing – review & editing	University of Münster, Institute for Translational Psychiatry
8	Hüffmeier, Joachim	Writing – review & editing	TU Dortmund University
9	Pennington, Charlotte R.	Writing – review & editing	School of Psychology, Aston University, Birmingham
10	Pittelkow, Merle-Marie	Writing – review & editing	Berlin Institute of Health at Charité – Universitätsmedizin Berlin
11	Silverstein, Priya	Writing – review & editing	Center for Neuroscience and Cell Biology, University of Coimbra; Institute for Globally Distributed Open Research and Education
12	van Ravenzwaaij, Don	Writing – review & editing	Department of Psychology, University of Groningen
13	Azevedo, Flavio	Writing – original draft, Writing – review & editing	University of Utrecht, Department of Interdisciplinary Social Science

B Potential Conflicts of Interest

A large proportion of the authors are members of FORRT, an organization dedicated to integrating open and reproducible science into higher education. LR, LW, FA, and JG are inaugural editors of the in-development journal Replication Research (https://replicationresearch.org). Besides their conviction of the value of replications, their current project’s success relies on researchers conducting reproductions and replications. LR is the managing director of an institutional open science center and proponent of repetitive research. The authors declare that they have no further potential conflicts of interest.

C Funding

LR received funding from the University of Münster and the ‘Landesinitiative openac-cess.nrw’. LW and LR received funding from ‘UK Research and Innovation’. LW, HH, and FA received funding from the ‘Nederlandse Organisatie voor Wetenschappelijk Onder-zoek’. JG received funding by ‘Innovative Medizinische Forschung’ (IMF) of the medical faculty of the University of Münster (GO122301). HH was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 422744262 -TRR 289 (gefördert durch die Deutsche Forschungsgemeinschaft (DFG) – Projektnummer 422744262 – TRR 289).

D Acknowledgments

This work is an initiative from The Framework for Open and Reproducible Research Training (FORRT; https://forrt.org), and all core-team authors are active members of FORRT’s Replication Hub (https://forrt.org/replication-hub).

We thank Patrick Smela for valuable ideas on defining replication success and Abel Brodeur for suggestions about definitions and the relationship between reproducibility and replicability.

E Email templates

E.1 Asking for materials and data

Dear [name of author(s)],

We are conducting replication research using some of your research. Specifically, we [brief name of the phenomenon and study that will be replicated]. [We do this because … e.g., your research addresses a very important question.] Can you please send us the following materials to help us design a replication as close as possible to your original study?

[list of required materials/data/code]
…

Citation of original study: [add citation]

We are looking forward to your responses!

Thank you, [Your name]

E.2 Asking for comments on an experimental paradigm

Dear [name of authors],

We are planning a replication of some of your research. Specifically, we are aiming to replicate your study [study details and citation]. [we are interested in these findings because …] I’m writing to share a mock-up of the replication to get your feedback on whether this paradigm accurately captures the design of your study. Please let me know if you have any comments or concerns that you’d like to share. Here’s a link to my paradigm.

Any insights you have into details that differ from your own study would be much appreciated. I will be replicating your experiment on [planned recruitment sample]. [I know this is a deviation from the original population you tested, and I will note this sample decision prominently in any writeups.]

Thanks again, [Your name]

E.3 Asking for comments on replication results

Dear [name of author(s)],

We have conducted replication research using some of your research. Specifically, we [brief name of the phenomenon and study that was replicated]. In our study [description of results].

We want to provide you with an opportunity to comment on these findings. We plan to publish the replication report via [paper or publication platform, e.g., FORRT’s Replication Hub], which asks replication studies to be submitted alongside comments from the authors of the original study. Your comment – if you choose to give one – will be part of the report.

Citation of original study: [add citation]
Replication study: [add link to document or attach it to the e-mail]

We are looking forward to your responses!

Thank you, [Your name]

Notes

¹From an extreme inductive perspective that stresses that there is no logical foundation in inferring future events from previous events (Hume, 1748/2016), one could even argue that it may not make a difference whether one tries to make the same observation again under the same or different circumstances.

²On a different note, Vohs et al. (2021) published a study that was different from previous studies in that it did not replicate any previous study but was instead designed to be ideal to test the theory and estimate the average effect size and termed it “paradigmatic replication approach”. Given the present terminology, we do not consider this a replication.

³For registered reports, a journal reviews only the introduction and method and no data have been collected at this point of time. After an initial revision and “in-principle acceptance,” results are collected and the full report submitted. The second round of review is concerned only with the authors’ adherence to the preregistration and success to execute the planned study.

⁴ Note that a two-tailed test could be applied as well. Given that the original study has a clear effect and direction, one-tailed gives the original authors the benefit of the doubt.

Editors

Kathryn Zeiler
Editor-in-Chief

Stephen Pinfield
Handling Editor

Editorial assessment

by Stephen Pinfield

DOI: 10.70744/MetaROR.193.1.ea

All the reviewers are agreed that this book is a useful and timely contribution to work on reproducibility/replicability. Reviewers 1 and 2 state that the book is likely to be a point of reference for people wishing to engage in replication studies in particular, and that it provides helpful ‘how-to’ guidance. Reviewers draw attention to notable strengths of the book, with Reviewer 2 regarding the first section of the book, which defines key concepts, as strong, and Reviewer 1 making very positive comments about the section on inductive vs deductive perspectives on replication success in chapter 2. However, all the reviewers also have recommendations for substantial revisions to the book. There are observations in the reviews about repetition and inconsistencies across the book, including in how key concepts are understood, which Reviewer 3 attributes to different chapters probably being written by different contributors without sufficient oversight and evening out. This issue needs addressing so that, as Reviewer 1 comments, all the chapters more clearly relate to each other. Reviewer 3, who is the most critical of the work, suggests significant reorganisation of the book, with detailed comments about specific sections. These should be considered by the authors. Reviewer 1 pays careful attention to claims made in the book which would benefit from review and revision. Reviewer 2 draws attention to three major topics given, they believe, insufficient attention: data standardisation, constraints on generality, and incentives and career implications associated with reproducibility. Reviewer 3 suggests that the audience of the book is not clear, meaning that there is a lack of focus at times in the topics covered and the depth to which they are treated. All the reviewers make substantial comments which the authors are recommended to consider in detail in making revisions to the book.

Competing interests: None.

Peer review 1

Hannah Fraser

DOI: 10.70744/MetaROR.193.1.rv1

I was pleased to be asked to review this handbook. I think there’s a lot of value in this resource for anyone considering conducting a replication/repetitive study or anyone interesting in interpreting and contextualizing the results of replications. I think the authors did a great job at presenting the nuance of what information a replication study can provide, and this is something that is generally lacking. Overall, I think that the book would benefit from the sections being more clearly linked together, and there being more balance in which sections are described briefly vs in depth. I also think it would be worth discussing constraints on generality (https://journals.sagepub.com/doi/10.1177/1745691617708630) in this book. It ties into the hidden moderators work, and the discussion of deductive vs inductive perspectives on replication

I have provided a series of specific suggestions below that I hope will help the authors develop this resource into something that becomes the main resource for people considering replication studies. The comments below are given by section, with the relevant heading from the book followed by comments related to that section.

Summary

I suggest that ‘repeatedly testing’ could be ‘retesting’ or just ‘testing’. I think it’s pretty rare that the same study is tested multiple times (outside of a manylabs study design).

If the scope is limited to social science disciplines, I think it’s worth including that in the title.

Background

I suggest rephrasing this section. “Repeatability is the cornerstone of many sciences: A majority of the scientific progress rests on the successful accumulation of evidence for claims through reproduction and replications to establish robust discoveries. Reproductions and replications, that is repeated testing of a hypothesis with the same (reproduction) or different (replication) data, are necessary.” I would disagree that the majority of the scientific progress has involved explicit replication or reproduction which is implied here. I think it would be farer to say something like “Although these are often considered the conditions of research, without attempting to reproduce or replicate results, it’s unclear whether the conditions are being met”.

“Cumulative science without repetition is costly” this needs another clause. Something about trying to build on unreliable research resulting in research waste.

Figure 1.1 I suggest excluding the partial 2025 year because the stumpy little bar doesn’t provide interpretable information.

“Moreover, much of the guidance on replications is being developed actively” what does ‘developed actively’ mean? It sounds like this is a bad thing… or is just the narrow field thing that’s the bad bit?

Reproduction and replication

“Reproduction and replication should always be considered together and if possible, reproduction should come before replication. This is because, at the early stages of research, reproduction is much more cost efficient; first confirming whether the findings are reproducible can clarify whether a replication is worthwhile. Furthermore, if the research procedure consists of “moving away” from a specific finding in terms of changing the analysis code, materials, and dataset to test its generalizability or boundary conditions, a numerical reproduction (using the same data and same code) is the closest possible repetition of a finding and a useful foundation for further steps. We discuss multiple cases to illustrate the relationship between reproduction and replication in Table 2.1 (Note that a similar distinction is made by The Turing Way Community (2025) but uses a less specific terminology for reproductions.)” I think this section needs to move a bit later – fully introduce the concepts of reproduction and replication before making suggestions for implementation.

Table 2.1 I don’t agree with the possible interpretations of reproducible/replicable research here. I think it’s a bit of a stretch to suggest that something towards the ‘direct’ end of the replication spectrum constitutes a test of generalisability. It can also pick up whether the original result was reliable or driven by sampling bias, questionable research practices, mistakes, or fraud. I suggest replacing these with a quick example rather than interpretation. Maybe you could pose a theoretical original finding and then give an example of what the findings would look like for each row of the table instead.

2.3 types of replication

“We heavily rely on the typology provided by Hüffmeier et al. (2016) where different types of replications are defined by the closeness or similarity between original and replication study.” I think this needs to be linked to the following sentence with something like “determining what constitutes closeness or similarity is context specific and relies on a detailed understanding of the original study.” and then go into the next sentence

The section on inductive vs deductive perspectives on replication success is fantastic. I’d love to see this spoken about more widely.

Figure 2.1 doesn’t sit exactly right with the discussion of inductive bs deductive perspectives. Especially in terms of the last row ‘contextual variables’ I think it would be worth adding a little a note saying something along the lines of “from a deductive perspective, not all of these variables should be considered for a given study, depending on the underlying theory.

I love the discussion of the results from Landy et al. but I wonder if it’s worth drawing in findings from many-analyst style studies here which have become quite widespread and show a similar thing but just due to data analytic changes

3 choosing the target study

“Individual findings can be assessed through forensic meta-science tests (for an overview, see Heathers, 2025)” I think it would be worth giving a really quick line here about the kinds of things forensic meta-science tests detect

In this section “If the original paper reports multiple studies for the same phenomenon, researchers should check the proportion of significant studies and whether all of them confirm the hypothesis.” I would love to see the authors explore the counter narrative here. What if the authors represent a mixture of results? How should this be interpreted? Is it evidence of integrous reporting and/or does it suggest that the hypothesis is context dependent? Also, I would interpret things differently if the effects are all in the same direction but not all statistically significant as opposed to if some of the effects are strong negative, and some are strong positive effects

“If a study has only been replicated by the original authors, it can be indicative of nobody else being interested in the phenomenon (i.e., low replication value) or nobody else being able to provide evidence for it (i.e., high uncertainty). For example, it is possible that reports of failed replications are held back by reviewers due to an aversion to null findings, replications, or findings criticizing their own work” this feels like a number of interesting reasons squashed into a paragraph. I suggest discussing each of these in their own paragraphs.

“As replications can also be used to probe a phenomenon’s generalizability, a lack of variety in study designs can motivate a replication attempt. If there is reason to assume that a phenomenon is highly dependent on context (e.g., works only for graduate students, with English-speaking people, when people are incentivized, for the chosen stimuli, …), it can be replicated and extended in other contexts. More generally, when background factors are introduced to a study (e.g., there was a positive correlation in study X but researchers suspect it to vanish under condition M), the original finding needs to be replicated in a part of the new study for the argument to work. An added benefit of this is to help avoid later claims of ‘hidden moderators’ in original studies; an argument which has been used previously to refute the validity of replication study results (Zwaan et al., 2018).” This is an interesting and important point but I don’t think it belongs in the section on uncertainty – it feels like it disrupts the flow of the section.

3.4 (potential) researcher bias

“Certain studies are more likely to be chosen for replication than others” I would like to see more information about what types of studies are more likely to be chosen than others. Or is that what is discussed below? The references are different so I’m not certain.

4. Planning and Conducting Reproductions and Replications

Table 4.1. I think this table belongs in the Reproduction and replication

section instead of Table 2.1 – or as some kind of combined table. I think that the descriptions in the goals column could be made a little shorter to improve the display (start by removing ‘fear of the’ )

4.1 Post Publication Conversations.

Does this belong here? It almost feels like it belongs up in section 3 which mostly speaks to things to consider when selecting research to replicate. But section 3 would need to be reformulated as “selecting studies to replication”

4.2 Reproduction before Replication

“In addition, we recommend testing the robustness of the original finding by making small alterations to the data processing and analyses procedure (robustness reproductions)….” This section wants it’s own paragraph.

I think there should be a discussion here about what ‘the same means’ or a reference to where this is discussed later

5.4 Preregistration

I would remove this “It should be noted that a preregistered analysis plan or analysis script is much easier to create with access to data and reproductions are impossible with unavailable data, preregistration cannot exclude the risk of authors having already looked at the data, yet making fraudulent claims regarding data access in a preregistration is evidently academic misconduct. How much weight readers and reviewers will give to a preregistration based on data that could have been accessed already will differ, but generating it is a way to keep ourselves accountable and produce robust reproductions.” It’s a little confused and I don’t think concerns about other people potentially fraudulently creating pre-registrations, or how other people perceive them should factor into the choice to develop one for a reproduction/replication. Preregistration is vital in this context because you really really need to stick to the original research plan. So people should do them regardless of how others perceive them.

Also, given the difficulty of developing a useful preregistration for a conceptual replication without looking at the code and data, it might be worth suggesting developing the reproduction protocol on a randomly re-ordered version of the data so that model assumptions and distribution of data can be viewed without providing any directional information about whether tests confirm or disconfirm the original study

5.5 Deviations

I don’t think this bit belongs here “Note that some journals’ publishing reproductions require adherence to special requirements such a Registered Report format (e.g., Journal of Open Psychology Data) or including a minimum of two independent reproductions (e.g., Journal of Robustness Reports).” Because it doesn’t relate to deviations.

6.1 Preregistration of registered replication reports

“A preregistration without an analysis plan provides no safeguard against p-hacking (Brodeur, Cook, et al., 2024). Beware that these criteria can be structured sequentially. For example, if there is a manipulation check, it can be defined that it has to work for the replicability to actually be evaluated. Boyce et al. (2024) also found that repeating unsuccessful replications did not change the outcomes unless obvious weaknesses were fixed” This feels like incomplete advice about how to implement preregistration. I believe that what is needed here is just to say that you need to do a preregistration, and provide a link(s) to resources that help with that – it’s to complicated and context specific to describe here

“Finally, replication researchers need to deal with deviations from their preregistration in a transparent way. In principle, there is nothing wrong with deviating from what one had planned but most importantly, all changes should be listed, discussed, and it should be made transparent how the changes affected the results (for recommendations on changes and documentation, see Lakens (2024), and Willroth & Atherton (2024). If changes are noticed during the data collection, many platforms also allow the upload of amendments with preserved version history.” In the section about reproducibility, deviations get their own section header but here they’re in the preregistration section. It should be consistent in presentation. I prefer deviations with preregistration but either would be fine.

6.2.1 small telescope approach.

I haven’t heard of this approach before and I feel like I don’t understand it very well after reading it here. It seems like a very complex thing to calculate… and then there’s something about just applying a rule of thumb based on Simonsohn 2015. That doesn’t seem right. Could the authors have another go at the explanation.

6.3 Changes in the Methods

I would like to see this section tied back to the deductive/inductive reasoning bit. For example, discussing how the different changes relate to the theory underlying the test.

6.4 Piloting

The two paragraphs on piloting don’t feel like they speak to each other. It would be worth rewriting to link them more

6.4.1 Collaborating and Consulting with the Original Authors

This is all really interesting but there are too many examples listed. I recommend just keeping the most interesting one as an example and citing the others

6.5. Adversarial Collaborations

As it is written, I don’t think this section belongs in the book. The authors should consider expanding it and relating it to repetitive studies or removing it

7. Discussion

Table 7.1. I think this needs more unpacking and explanation. There’s a lot of information here and it’s all really important. I would recommend breaking it down into sections. For example:

7.1 Operationalising replication success: same data, same analysis.

Description of the broad categories of approaches, what they mean and their pros and cons.

Table of the specific versions people have used, how they relate to the categories you’ve specified, and references to articles that describe/have used them.

7.2 Operationalising replication success: same data, new analysis.

Description of the broad categories of approaches, what they mean and their pros and cons.

Table of the specific versions people have used, how they relate to the categories you’ve specified, and references to articles that describe/have used them.

7.3 Operationalising replication success: new data, same analysis.

Description of the broad categories of approaches, what they mean and their pros and cons.

Table of the specific versions people have used, how they relate to the categories you’ve specified, and references to articles that describe/have used them.

7.4 Operationalising replication success: new data, new analysis.

Description of the broad categories of approaches, what they mean and their pros and cons.

Table of the specific versions people have used, how they relate to the categories you’ve specified, and references to articles that describe/have used them.

7.5 general guidance on how to choose an appropriate operationalisation for your purpose and interpretation

7.2 interpreting divergent results.

I think this section needs to link explicitly with the section on how to operationalise replication success because the interpretation of divergent results should be be different depending on which operationalision you’re using. This should also be linked back to the deductive/inductive argument from earlier in the book

7.2.1 Hidden moderators

Could you describe more about how these hidden moderators lead to bias in replication studies. Bias towards what kind of results… or the selection of which kinds of studies.

Figure 7.1 – I love this figure and I would like to see it linked in with more parts of the text

Part III Advanced Topics and Applications

8: communicating and publishing. I think this needs to be unpacked a little more. Perhaps including separating publishing in a traditional sense from communicating in another way. Making sure to mention that you can do either or both.

9: field specific replication challenges

It feels a little incongruous to do a deep dive into neuroimaging here given how briefly topics are treated in the rest of the book. I suggesting pairing this back to be in balance with the rest of the book and adding another section about identifying the challenges that are specific to your field/topic of research because all have different challenges.

10.1 Reproductions and Replications Checklist

I think that a checklist is maybe not the right format for this – I think that checklists should be for more concrete things. These are more high-level recommendations. I agree with them completely though. Great recommendations.

Competing interests: None.

Peer review 2

Vittorio Iacovella

DOI: 10.70744/MetaROR.193.1.rv2

This is a complete and useful text.
It attempts to connect the practical task of repeating previously conducted studies with broader theoretical and scientific implications. It successfully augments standard “how-to” checklists and discussions regarding repetition attempts by introducing a rigorous theoretical framework to characterize these contributions. It achieves this by first fine-tuning the terminology (repetition, reproduction, replication), characterizing these tasks as distinct phases, and then establishing the scientific dynamic between them.
The discussion is punctual and well-referenced. Every step of the repetition task is thoroughly explained, including steps that are commonly underestimated, such as the choice of the target study. The sections are self-consistent, allowing casual readers to focus on individual topics. They are also enriched with many visual elements (e.g., figures, tables) that appear ready to be extracted for use in presentations and classes.

In my opinion, the strongest section is the first one, concerning the foundations, where repetitions are characterized as a fundamental scientific task. This is particularly relevant to the argument that Open Science practices are not merely bureaucratic tasks to fulfill for funding agencies, but rather high-level, scientifically and epistemologically meaningful operations of collaborative (in space and in time), modular, and progressive knowledge construction.

I appreciated the effort to uncover the “Pandora’s box” of replications in MRI. Such an effort, however, may require some fine-tuning and substantial integration from both theoretical and practical standpoints.

Finally, I would have appreciated the mention of at least three generic topics which, to my knowledge, were not sufficiently addressed:

Data Standardization: an ongoing discussion with references to data standardization would be beneficial. “FAIR-ification” reflects the philosophy behind this handbook: while it may sound like a practical task, finalizing it requires a clear understanding of the scientific implications of data planning, acquisition, and dissemination. Furthermore, FAIR-ification solves several problems mentioned in the manuscript, especially those related to the human-in-the-loop presence (e.g., those described in Section 5).
Constraints on Generality: An assessment of Constraints on Generality (briefly mentioned by referencing Simons et al., 2017, in Section II.6.3) is missing, particularly regarding the fact that participants are often recruited from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies. This could be linked to the excellent discussion on contextual sensitivity (Section I.2.3). This represents an opportunity to present this problem to larger, potentially unaware audiences and to link it to actual scientific arguments related to generalizability.
Incentives and Career Implications: Within the text, there is a commendable attempt (Section III.8 – Communicating and Publishing) to outline where to publish reproduction and replication efforts. However, the motivation for researchers to prioritize these tasks – given current academic incentive structures – is somewhat overlooked. The key reference to Koole and Lakens (2012) regarding rewards is effectively relegated to a single cell in Table 8.1. A more explicit discussion on the career implications (both risks and benefits) of engaging in replication work would be highly valuable, particularly for Early Career Researchers.

In addition to this general note, I enclose a document with further section-specific comments and a quick list of typos and/or rendering problems.

Overall, I enjoyed the read and remain available for any further discussion regarding my comments.

Additional detailed feedback:

Conventions

I will use the following convention to associate a comment with the related text:<part>.<section>-p<page>-[…]related text[…]g.:
II.2.1-p12-[…]studies using the same data[…]
and the following convention to associate a comment to the related Figure / Table{Figure, Table} <section>.<number>g.:
Figure 1.1

Section – specific comments

I.1 Background

Figure 1.1

In this first section, the authors provide an introduction to repeatability, including a plot showing the evolution of the “amount of replication studies” per “year” over the last 70 years. It is not completely clear whether “replication” is used here as a generic label collating both reproductions and replications.

In the text immediately preceding the figure (I.1-p9), the authors make an interesting point regarding the prevalence of replications (or reproductions/replications), presenting a 1% “rate” and citing Perry, T., Morris, R., & Lea, R. (2022). However, upon reading the reference, it is difficult to extract such information, and the overall rate of replication appears to be at least 10 times lower. It might be sufficient to compare the absolute number of replications per year (from the figure) – which amounts to a few hundred – to the overall scientific production, even within a single field.

Moreover, there appears to be a turning point in the volume of “replications” around 2010. It could be useful to either cite the replication/reproducibility crisis here (it is cited in the following section regardless) or add a vertical dashed line to the plot at year = 2010 (though this may be excessive).

I.2 Understanding replications and reproductions

In Section I.2, the authors provide a review of the meanings of the terms “replication” and “reproduction.” The review, particularly regarding replication, is extensive, structured, and easy to read. The distinction between direct and conceptual replication is crucial, yet very often overlooked.

As far as I understand, the text flows as follows: practical considerations, a general reference framework, a focus on replications, and a focus on reproductions. What stands out is that a significant amount of space has been devoted to replications, including a very interesting table that categorizes replication types based on what was replicated. In contrast, much less space is given to reproductions, which are essentially limited to two sentences. While the necessary details are eventually provided in Table 4.1, it might be useful to move this classification system within this section to create a better balance. Presenting the subcategories (numerical vs. robustness) in a table format earlier in the text would offer a more consistent reference framework for readers trying to distinguish between different repetition phases.

Table 2.1

The authors have assembled a useful table (which I can already picture being cited in presentations!) to navigate the findings space. I consider the introduction of potential sequentiality between reproducible and replicable findings – “before attempting a replication, let’s first test reproducibility” – to be a commendable guideline. However, I am not entirely sure about introducing the term “generalizable” here, as its usage is debated (see, for example, experimentology.io, Section 3: Replication). It might be sufficient to mention Section II.4, where available materials and actions are related to study purposes, including all terms mentioned here.

II.3 Choosing the Target Study

In Section II.3, the authors provide a framework to justify the reproduction/replication of previous studies: since it is not possible to reproduce/replicate everything, a system must be developed to decide what is worth repeating. I am not entirely sure if the subsections “Uncertainty > Theoretical contribution” and “Uncertainty > Availability of reproductions and replications” were intended to be enumerated subsections or simply parts of the “Uncertainty” discussion.

I would also add a few words regarding the construction of “replication packages,” an expression I found both concise and meaningful. It may be sufficient to briefly mention that the topic will be expanded upon in Section 5.1.

II.4 Planning and conducting Reproductions and Replications

In this section, the authors introduce both a taxonomy of reproductions/replications according to materials, actions, and purposes, and a sequential flow of operations one could perform. I find the diagram in Figure 4.1 interesting and concise. On the other hand, while the attempt in Figure 4.2 is ambitious, I find the diagram too cluttered. I am not entirely sure how best to restructure the information, but perhaps: rotating the whole decision tree to exploit the vertical dimension of the book page; introducing a vertical axis related to “repetition complexity”; introducing horizontal dotted lines corresponding to “repetition areas”; and transforming the numbered boxes into “repetition area” boxes that act as captions for the current phase.

II.5 Execution of Reproductions

In this section, the authors describe practical caveats when attempting to reproduce, outlining specific solutions for well-known problems, including gathering material and discussing potential deviations from original results.

I like the explicit mention of preregistrations as an open science practice “in action,” which turns out to be useful for open science practices generally. However, it is not completely clear to me whether the authors are referring to preregistering the reproduction (execution stage) or targeting preregistered studies for reproduction (planning stage). Perhaps a few preregistration concepts could be enumerated in the introduction to Section II.5 and distributed over the subsections. For example, preregistration can also be defined as time-stamped structured documents; within this structure, the authors of the preregistration should clearly identify, among other things, independent and dependent variables. This might be crucial for identifying claims (II.5.3).

II.6 Execution of Replications

In this section, the authors provide an extensive overview of potential aspects to consider when dealing with replications.

Quick comment on II.6.1: I appreciate mentioning preregistration/Registered Reports (RRs) as a generic trigger for openness. I would be cautious, however, about explicitly stating “A rejection due to the results is not possible at this point”, both because potential outcomes may vary according to journal policies, and to avoid inadvertently characterizing RRs solely as a way to secure a publication (I completely understand this was not the actual point of the section).

II.6.2.1: It may be interesting to expand the point related to “Small Telescope” effects by explicitly relating it to the “Estimating the Reproducibility of Psychological Science” study. The Open Science Collaboration’s paper is referenced many times throughout the manuscript and is also one of the focuses of Simonsohn (2015).

II.7 Discussion

In this section, the authors include a table from another study which serves as a quick but thorough reference linking human-readable conceptual questions with the categories of repetition introduced in Section I.2.1. The table is an excerpt from Heyard et al. (2025). While I appreciate that the excerpt is concise and focuses on the relationship between the repetition question and the approach, I would suggest producing a few different tables according to specific criteria. I am unsure whether a separation based simply on the type of reproducibility investigated is too basic. There could be another dimension along which several methods can be grouped; I do not know if such an effort is worthwhile, but it might at least improve the readability of the table.

The remaining part of the section is an interesting discussion on how to consider replication failures. I really like Figure 7.1. I would suggest expanding the figure explanation with a reference to the concepts introduced in Section I.2.1 (e.g., generalizability).

III.9 Field-Specific Replication Challenges: An example from MRI research

In this section, the authors provide an overview of the common problems researchers face when dealing with the replication of fMRI studies. Specifically (Section III.9.”0″.3), they attempt a quick review of problems in estimating power. While an extensive literature review falls outside the scope of the handbook, I believe there is room for improvement:

Mention could be made of the extensive work by Jeanette Mumford, starting from the classic 2012 paper: A power calculation guide for fMRI studies. In a handbook that does not limit references to peer-reviewed papers, I think it would also be worth citing Mumford’s YouTube channel, “mumfordbrainstats,” which serves as a quick but rigorous reference for practical solutions.
I would spend at least one sentence citing the impact of Eklund et al. (2016) (Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates) on the entire field (including implications for popular science, sensational titles, etc.).
Further contributions worth including are
- Durnez et al. (2016) (Power and sample size calculations for fMRI studies based on the prevalence of active peaks), which switches the focus from voxels to clusters, and
- Ellis et al. (2020) (Facilitating open-science with realistic fMRI simulation: validation and application), regarding simulation-based approaches.
I would also expand (by 1-2 sentences) the comment on Botvinik-Nezer et al. (2020), integrating the classic “70 teams, 70 different pipelines” observation, which is entirely accurate, by connecting it with the popularization of automated preprocessing pipelines (cited in 9.”0″.5).

Punctual comments (typo, choice of words, rendering, etc)

Overall

It could be useful to include a few lines at the beginning of each section to introduce the readers to the topic. This could be of help especially for casual readers / people interested in specific sections.
Throughout the text, the authors use paper, article or handbook. While on one hand this makes the text more varied, on the other hand it may generate confusion. I would suggest using “this handbook” and stick to that expression.
There are many references to webpages. There must be included a timestamp when the url was checked in order to avoid potential future misinterpretations

I do not know if it’s a problem with my pdf reader but the Table of Contents is rendered badly on the first page. Section 5: “Execution of reproductions” looks incomplete. Several sections (5.1 … 5.3) are not rendered on the final pdf document.

Section III.9

I do not know if it’s a problem with my pdf reader but the section numbering includes a 0, which is at least not homogeneous with respect to the rest of the text. I would avoid using zeros for section numbering, in any case.

Competing interests: None.

Peer review 3

Wolf Vanpaemel

DOI: 10.70744/MetaROR.193.1.rv3

The Handbook for Reproduction and Replication Studies has a laudable goal. Although replication is well established in theory, it is relatively new in practice, so having a centralized collection of recommended practices—both what to follow and what to avoid—is valuable. However, I find the idea stronger than the current execution. This is perhaps unsurprising, given that the version I reviewed (Preliminary Version 0.1) is explicitly labeled as preliminary throughout.

Assuming the authors intend to revise and expand the handbook, I offer this review in the spirit of highlighting areas where improvements or additions would be beneficial.

Organisation

As a handbook, it is not surprising to find little original insight or tools. Rather, its value could lie in how well it brings together, organizes, and explains existing ones. In its current form, the handbook falls short in this respect, with an organization that seems suboptimal. I suggest the following changes:

Chapter 1 (Background) currently consists of only about a page. Unless it is substantially expanded, it would be better integrated as a Section within Chapter 2 (Understanding Replications and Reproductions).
For consistency with the book title and the Section 2.1 title, and to reflect the typical chronological order, consider using Reproductions and Replications rather than Replications and Reproductions in the title of Chapter 2.
Similarly, Sections 2.3 (Replication) and 2.4 (Reproduction) could be reordered, since reproduction typically precedes replication chronologically.
Rename Part II to explicitly include Reproduction, not just Replication.
With the exception of its final sentence, it is unclear how Section 3.4 (Researcher Bias) fits within the topic of Chapter 3 (Choosing the Target Study).
Chapter 4’s title suggests it addresses planning and conducting replication and reproduction studies, but the content does not fully support this, as it seems to only concern planning.
Sections 5.2 (Contacting Authors) and 5.3 (Identification of Claims) are part of a chapter on reproduction, but are also relevant for replication research, not only for reproduction studies. Readers focused solely on replication may skip Chapter 5 and miss these sections. Consider moving these sections to Chapters 3 and/or 4 (which concern both reproductions and replications), or adding cross-references to these sections in Chapter 6 (which concerns replication). Similarly, Section 7.4 (Comments from the Original Study’s Authors) is in a chapter on Replication, but might be missed by readers just looking for help on reproduction studies.
The distinction between Sections 5.5 (Deviations) and 5.6 (Analysis) is unclear, as deviations typically arise as part of the analysis. Perhaps they could be integrated.
Chapter 7, which focuses on determining replication success, seems integral to the Execution of Replications, which is the exact topic of Chapter 6. It seems that Chapter 7 could be merged with Chapter 6. Further, Table 7.1 also concerns reproduction success, not just replication success, so it is ill-placed in a section of replication success. The part of the table that addresses reproduction success could be moved into a new table in the new section on Defining Reproduction Success I propose below. (If Table 7.1 is not split in two, at least change the rightmost column heading from “Reproducibility” to “Repetition.”)
Chapter 8 (Communication) is placed in the “Advanced Topics” part, yet reporting is a fundamental academic skill. Further, most of the meat is just a long table, which seems a better fit in the Appendix. The (little) content that is there would fit better in new Sections 5.8 (Reporting a Reproduction Study) and 6.7 (Reporting a Replication Study). If such sections were added, it is worth noting that much of the content in Chapter 5 is actually about reporting, so it could be moved to this new Section 5.8
In Chapter 10, consider moving the checklist to the appendix and discussing it in the text, so that Chapter 10 functions purely as a conclusion. Further, a single chapter of just 2 pages does not need its own Part (i.e., Part IV).

Several of the conceptual inconsistencies discussed below could also be resolved through improved organization, but I discuss these below.

Content

The intended target audience is not entirely clear. Is the handbook aimed at readers who are new to empirical research in general, or specifically at those new to replication and reproduction studies? If the former, important topics are missing, such as writing guidance, ethical considerations, literature review strategies, project management, and effective visualization, to name just a few. If the latter, some included material is unnecessary, as it is not specific to replication or reproduction. For instance, Section 6.1 contains little replication-specific content and may be already familiar to readers if they are expected to have a background in empirical research.

Chapter 7 (Discussion) misses important content on reproduction research. Given the handbook’s dual focus, I expected parallel sections to 7.1 (i.e., Defining Reproduction Success) and 7.2/7.3 (i.e., Interpreting Reproduction Failures) for reproduction. Alternatively, if Chapter 7 is integrated into Chapter 6, as I suggested above, this issue may be resolved more naturally, and Chapter 5 could then be expanded with sections on reproduction success and failure.

Another severely underdeveloped aspect is data integrity checking. Even when full reproducibility is not possible due to a lack of raw data, partial checks can still be conducted. While tools like statcheck and papercheck are mentioned (p. 27), these represent only a small subset of available approaches when there is no access to raw data. This type of “reproduction light” investigation deserves more than just five lines, especially since sharing raw data, which is needed to do a full-blown reproduction, is still not the norm. The handbook should clarify this and reference additional resources, such as Heathers (2025). Note that Heathers (2025) is mentioned in the handbook, but only in relation to selecting studies (p. 18), not in the context of doing a light-weight reproduction study (p. 27).

I was most underwhelmed by the lack of hands-on guidance, which is precisely what readers of a handbook would expect. For example, how should tools like Altmetric be used in practice? Now it is only name-dropped, but what steps could one take in practice? Or, more fundamentally, how should researchers navigate the many practical decisions involved in doing a replication study? One important issue is what I refer to as “the same and the same are not the same” (Vanpaemel, 2026), which addresses the ambiguity in determining what it means to do “the same” when doing a replication study. There are many aspects to this question, but an important one relates to the fact that a replicator could either make the same choices as the original authors or follow the same decision strategy. As an example, consider a case where original authors include covariates based on a statistically significant correlation with an outcome variable. Does doing the same involve including the same covariates as the original authors, or does it involve applying the same selection rule to their own data? This choice vs strategy problem poses a central, practical dilemma for all replicators who do not just talk about replication but actually make their hands dirty and do it, yet it is neither discussed nor acknowledged.

More generally, the handbook is at its best when it discusses (real-life) examples of actual replication research, and the good or bad practices associated with those cases. Any additions in this regard will surely increase the usefulness of the handbook.

Conceptual clarity

The different types and goals (and sorts of conclusions) of replications and reproductions are discussed at several places in the book. This is not bad per se, but they do not seem to be well aligned across these discussions. The types and associated goals are presented as an overview at three places: Table 4.1 (both reproduction and replication), Figure 4.1 (mostly reproduction), and Figure 4.2 (replication). A first problem is that these figures and the table are not internally consistent.

The reproduction types of Table 4.1 don’t match the reproduction types of Figure 4.1. In particular, Figure 4.1 but not Table 4.1 mentions a numerical reproduction, a recreate reproduction, and statcheck. Table 4.1 but not Figure 4.1 mentions a computational reproduction (which might just be the numerical reproduction?), a recoding reproduction (which might just be the recreate reproduction?), and a multiverse analysis.
The same discrepancy, but this time about replication, can be found by comparing Table 4.1 and Figure 4.2. In particular, Figure 4.2 but not Table 4.1 mentions two types of conceptual replications. Table 4.1, but not Figure 4.2, mentions a close replication with extension. Both the table and the figure include exact replication, but use it to refer to different types (the exact replication of Figure 4.2 is the internal replication of Table 4.1). Both the table and the figure include constructive replication, but Table 4.1 treats it as identical to a conceptual replication, whereas Figure 4.2 treats them as distinct.
A similar conceptual sloppiness relates to the goals of a replication. Replicability, excluding alternative explanations and testing boundary conditions are all mentioned as replication goals in Figure 4.2 but not in Table 4.1. Figure 4.2 talks about the validity of effect as a replication goal, whereas Table 4.1 mentions the validity of materials, variables, design, and theory, but not of an effect.

Secondly, the sloppiness is not restricted to Figures 4.1, 4.2, and Table 4.1. Frustratingly, the types and goals discussed in the overview of Table 4.1 and Figures 4.1 and 4.2 are not well-aligned with what is written elsewhere in the book, such as in Section 2.3 (Types of replication), Section 2.4 (Types of reproduction), and Chapter 7.

Section 2.4 mentions two types of reproduction (numerical and robustness), while Table 4.1 mentions three (or perhaps four; it is not clear what the exact status of a multiverse analysis is according to the authors). Section 2.4 should at least be made consistent with Chapter 4, but ideally, Section 2.4 should just be merged with Chapter 4.
Similarly, at several places, some aspects are mentioned as being the goal of (a type of) replication, but these are not mentioned at all in Table 4.1 and Figure 4.2: credibility (for a close replication; p. 13); the correctness of the hypothesis (p. 16); corroboration (p. 20); the veracity of the original claim (p. 20); the robustness of an existing finding (p.15; p. 54); the validity of a claim (p. 15). Why are credibility, correctness, corroboration, and veracity absent in Figure 4.2 and Table 4.1 (and is robustness only mentioned with reproduction)? Moreover, even if Section 2.3 (Types of replication) would perfectly align with the content of Chapter 4, it seems wasteful to just repeat it. I recommend trying to integrate Section 2.3 in Chapter 4.
A “multiverse analysis” is only mentioned in Table 4.1 but is not discussed further in the book.
Chapter 7 again (for the third time!) discusses different replication types and their different goals and interpretations, and yet again, a different spin is put on it. For example, in Table 7.1, the “Question answered” for the significance criterion talks about replication when the type includes “same data – different analysis”, a situation which in Table 4.1 is considered a robustness reproduction, not a replication.
Figure 7.1 yet again discusses different replication goals. The figure suggests that a close replication has something to say about robustness, which might be true, but this is at odds with Table 4.1, where robustness is related to reproduction and a close replication is concerned with generalizability.
A similar internal inconsistency relates to Figure 4.2 and Section 7.2. Both discuss explanations for failed replications, but do so very differently.

As an aside, I do not see how checking the adherence to the preregistration is part of a reproduction (p. 27), so this section might need to be moved (or removed, if not part of a replication or reproduction study).

Overall, it seems that different people have authored different chapters (which is totally fine), but did not read each other’s work (which is less fine). It should not be put on the reader to integrate all the different perspectives. Given how the types, goals, and interpretations are so dispersed across the handbook, it is very hard to get your head around it. While these differences may be reconcilable, I would expect the authors to do the work of streamlining and organizing these concepts, rather than leaving the burden on the reader.

Notes

I used generative AI solely to improve the clarity of my writing.

Competing interests: Some authors of the handbook are Associate Editors of the Methodology & Research Practice section at Collabra, where I currently serve as Senior Editor, and another author is the Editor-in-Chief of Collabra. Because maintaining positive working relationships is important to make the journal run, I might have been positively biased towards their work. However, this has not prevented me from providing a critical review. I have authored a book for a similar audience with similar aims (though more focused on reproducibility checks without access to raw data), which will be available soon on earsguide.io. Vanpaemel (2026). Evaluating and replicating studies: A practical guide.

Author response

DOI: 10.70744/MetaROR.193.1.ar

We thank the MetaROR editor and the three reviewers, Hannah Fraser, Vittorio Iacovella, and Wolf Vanpaemel, for their careful, constructive, and detailed engagement with Version 0.1 of the Handbook for Reproduction and Replication Studies. Reviewing an entire book at this level of detail is no easy task, and the feedback you provided will help us make this resource better.

We are encouraged that the reviewers see the handbook as a timely and valuable resource for researchers conducting, interpreting, and communicating reproduction and replication studies, and are grateful that they recognise the strengths of the conceptual foundations, the discussion of inductive and deductive perspectives, and the practical ambition of the project.

Reading the reviews together, we see a key area to focus our revisions on: the handbook currently reads in places as chapters developed in parallel without enough cross-talk between them, which reflects the collaborative and distributed history of the project. This shows up as inconsistencies in core concepts across chapters, uneven depth of treatment, and a lack of clarity on what is assumed about the reader. Addressing this will be the central focus of the revision.

In that, we will focus on the following key areas:

Strengthening the overall organisation and editorial coherence of the handbook, including clearer signposting, reduced repetition, better cross-referencing, and a more uniform tone that addresses newcomers to the field of repeatability research.
Harmonising the conceptual framework across chapters, figures, and tables, especially in relation to the types, goals, and interpretations of reproduction, replication, robustness, and generalisability.
Making the handbook more practically actionable by adding some worked examples, clearer decision guidance, and more detailed explanations of concepts and tools that are currently introduced too briefly.

We will also expand several substantive areas highlighted by the reviewers, including constraints on generality, data standardisation and FAIR practices, incentives and career implications, data-integrity checks and lightweight reproduction approaches, the interpretation of reproduction and replication success or failure, and the practical question of what it means to repeat “the same” study. Finally, we will address the many valuable section-specific comments concerning wording, supporting references, figures, tables, rendering issues, and consistency of terminology.

We will work through these revisions over the coming months. To make the process transparent, we will track changes and revision priorities on GitHub and add the repository link here: https://github.com/forrtproject/replication_handbook. Any interested colleagues are welcome to contribute to the handbook there, to make this an impactful community resource before publication in book form towards the end of 2026.