From misconduct to reform: Understanding perceptions of those who commit & call out questionable research practices

Savannah C. Lewis¹, Alexa M. Tullett²

¹ Psychology Department, The University of Alabama, Tuscaloosa, Alabama, USA
² Department of Psychology, University of Alabama, Tuscaloosa, Alabama, USA

Originally published on December 12, 2025 at:

https://doi.org/10.71240/lcyc.359952

Abstract

Scientific integrity depends on ethical and transparent research practices, yet surveys reveal that many researchers engage in questionable research practices (QRPs), ranging from minor issues (e.g., unclear preregistration) to major misconduct (e.g., data fabrication). This study investigates how people perceive researchers who commit QRPs (investigators) compared to those who report them (inspectors). Participants (N ≈ 566) will read three hypothetical vignettes describing an investigator engaging in a QRP and an inspector who reports it. Participants will then evaluate both researchers on trustworthiness and likeability. They will also rate the perceived trueness of the original finding, and compare both researchers on a range of positive attributes. Thus, the overall design is a 2 (role: investigator vs. inspector) x 3 (QRP severity: minor vs. moderate vs. major) fully within-subjects design. Multilevel models will test whether perceptions vary by role and QRP severity. This research will deepen our understanding of how accountability in science is socially evaluated, and how the severity of misconduct shapes views of both those who commit and those who call out QRPs.

Full text

Introduction

The replication crisis in psychology has highlighted shortcomings in common research practices that undermine the knowledge that is generated. These questionable research practices (QRPs), such as manipulating analyses to achieve statistical significance (p-hacking), hypothesizing after results are known (HARKing), prematurely examining data, small sample sizes, and insufficient transparency have been identified as key contributors to low replicability among scientific findings (Ioannidis, 2022; Nosek et al., 2018; Chambers, 2017; Sijtsma, 2016; Simmons et al., 2011).

The growing attention to the replication crisis has also sparked a range of perspectives on the roles of researchers. While some may view those who highlight scientific shortcomings as individuals that are upholding scientific integrity, others might see these individuals as disloyal or damaging to the reputation of the field. Similarly, researchers engaging in QRPs may be judged as unethical or, alternatively, as individuals navigating systemic pressures. Understanding how these figures are perceived in relation to each other can provide insight into how trust in psychological science is shaped.

Importantly, not all QRPs are regarded equally, which may influence how researchers are judged. For example, making up data or committing fraud is often seen as a high-severity offense and can lead to serious consequences, such as losing one’s job or academic degree. In contrast, QRPs like overgeneralizing findings or using small sample sizes tend to result in much lighter (or sometimes no) consequences, such as a commentary published about the paper or, if noticed during the review process, a request to revise the writing or collect more data.

While the consequences for QRPs vary and are often inconsistent, researchers are now beginning to rank the prevalence and perceived severity of these practices systematically (Larsson et al., 2023; Bottesini et al., 2022). Larsson et al., (2023) overall results found that the QRPs researchers perceived to be more severe tend to occur less frequently, whereas more common practices are viewed as less severe. Notably, HARKing was among the ten most frequently reported QRPs by researchers, whereas p-hacking, which was rated a high level of severity, was among the ten least frequently reported.

Meanwhile, Bottesini et al., (2022) examined the perceptions of participants on HARKing, P-hacking and data fraud. They found that HARKing, selective reporting, and p-hacking were rated unacceptable ~68–69% of the time (specifically 68.7%, 69.2%, and 68.3% respectively). They discovered that data fraud was deemed unacceptable by 81.3% of participants. Bottesini’s results seem to suggest that the severity of p-hacking and HARKing are perceived more similarly than Larsson et al., (2023) sample which suggest that prevalence of the QRP may influence perceptions.

Broadly, the literature suggests that QRPs are more common than many would expect. Swift et al. (2022) report frequencies of 65% for faculty and 50% for students. Other studies are even more pessimistic, with 90% (90.3%, 94%, and 96%) of their samples admitting to engaging in at least one QRP (Artino et al., 2019; Isbell et al., 2022; Larsson et al., 2023).

QRPs undermine the trust that science seeks to build with the public, with policymakers, and with future generations of researchers (Wingen et al., 2020; Anvari & Lakens, 2019). When published findings are biased, manipulated, or even fraudulent, researchers risk wasting time and resources exploring theories that may have little support. Conversely, researchers may also spend significant effort correcting exaggerated or incorrect claims. That is why it is essential to ensure our field produces reliable, credible research while weighing the need to correct misinformation or fraudulent research.

Despite growing recognition of how QRPs threaten the credibility of science, people do not always revise their beliefs when confronted with evidence that a result may be unreliable. Research on misinformation demonstrates that belief updating is difficult, even when corrections are clear and well-supported. False or misleading information can continue to influence memory and judgment long after it has been corrected because familiarity and source credibility bias how new information is integrated (Pennycook & Rand, 2021). Translating this insight to scientific contexts raises a critical question: when individuals learn that a researcher engaged in a questionable practice, do they adjust their belief in the validity of the original finding, or do they view the researcher’s behavior as unrelated to the result itself? In the current study, it is predicted that participants will recognize the connection between the original finding and the identified QRP, such that perceived trueness of the finding will decrease as QRP severity increases.

In an effort to minimize QRPs, the field has increasingly adopted responsible research practices (RRPs), which promote transparency, replicability, and robustness in research findings (Schooler, 2014; Uhlmann et al., 2019; Anderson & Maxwell, 2017; Nosek et al., 2015; Munafò et al., 2017; Miguel et al., 2014; Frankenhuis & Nettle, 2018; LeBel et al., 2017). While these practices do not completely eliminate QRPs, they do introduce barriers that make it more difficult, or less advantageous, to commit them. However, these barriers lose much of their efficacy if members of the scientific community actively monitor and examine each other’s work.

With the shift toward RRPs, scientists who take on this responsibility of monitoring, investigating, and “calling out” QRPs, often gain access to data or materials that allow them to detect and to report misconduct in ways that mirror the role of internal whistleblowers in traditional workplaces. The International Anti-Corruption Academy highlights four key characteristics of whistleblowing. They suggest that whistleblowing typically involves 1) wrongdoings connected to the workplace, 2) ethical, legal, or safety violations, 3) a decision to report, and 4) a concern for public interest (Scaturro, 2018). Taking these four components into consideration, this paper defines a scientific inspector as an individual who raises concerns about discrepancies, misconduct, or actions that compromise the integrity of scientific research, and reports these concerns to an appropriate authority, either externally or internally.

Although inspectors’ work can increase the quality of research that informs policy, treatments, funding, and future work (Fanelli, 2018; Wingen et al., 2020; Yong, 2017) they also receive more negative labels, like “data police” or “vigilantes.” This may be because they are sometimes perceived as disloyal or disruptive rather than as individuals upholding scientific standards (Cheliatsidou et al., 2023). The literature reveals that these individuals across disciplines often face professional retaliation, social isolation, and emotional distress as a result of their disclosures. According to self-reports compiled by the Lubalin & Matheson (1999), over 60% of scientific whistleblowers experienced at least one negative consequence, including being pressured to withdraw their allegation, ostracized by colleagues, threatened with lawsuits, or subject to reductions in research support. Approximately 10% reported significant career consequences, such as being fired or losing critical funding. Notably, however, fewer than 18% of those who experienced the most severe career impacts stated they would be unwilling to report misconduct again. These findings underscore both the personal cost and moral conviction associated with whistleblowing.

Lubalin et al. (1995) demonstrated that, while whistleblowers tend to experience more immediate professional retaliation, individuals who are accused (and eventually exonerated) suffered worse long term personal outcomes like poor mental or physical health. Together, these findings highlight that both roles endure negative consequences but different kinds. The whistleblower will face more social consequences while the accused researcher may experience more personal consequences. Lubalin et al. (1995) also suggests that whistleblowers are viewed less positively than their accused counterparts when in the midst of the investigation or reporting process. These findings are extremely important in terms of how a whistleblower will be perceived socially (i.e. integrity or likeability) during the process of blowing the whistle. This issue is particularly pressing in the growing number of early-career scientists, who may lack institutional power or support to navigate the potential fallout of whistleblowing.

Much of the existing literature has focused on the act of whistleblowing itself, examining individuals’ intentions to report misconduct (Abraham et al., 2023), the frequency of disclosures (Artino et al., 2019), barriers to reporting and strategies for overcoming them (Devine & Reaves, 2016), and general attitudes that discourage or enable whistleblowers (Cheliatsidou et al., 2023). However, research specifically on scientific whistleblowing remains limited and does not adequately explore the factors that influence how whistleblowers and wrongdoers are perceived by laypeople. This study addresses that gap by directly comparing perceptions of investigators and inspectors and further examines the moderating role of severity of the QRPs.

Although trustworthiness and likeability are often positively related, prior work suggests they can diverge in contexts involving norm enforcement. For example, Monin, et al., (2008) found that “moral rebels (i.e. those who refuse to go along with moral questionable tasks) are judged as principled and trustworthy but also tend to be disliked. Similarly, Cheliatsidou et al. (2023) report that whistleblowers are recognized as upholding ethical standards yet are often seen as disloyal or disruptive. These findings suggest that inspectors may be evaluated as more trustworthy because of their integrity, but less likeable because their actions generate social tension. Therefore, this study predicts that inspectors will be rated as higher on integrity than inspectors, but that this advantage will be reduced (or even reversed) for ratings of likeability. These trends will shift in favor of the inspectors as QRP severity increases. In other words, as QRPs become more severe, ratings of the integrity and likeability of inspectors relative to investigators should increase. In this way, integrity reflects judgments about a researcher’s reliability and integrity, whereas likeability reflects social ease and loyalty.

By investigating how integrity and likeability vary based on role and QRP severity, this study expands existing knowledge in two key ways. First, it introduces a novel framework for examining scientific whistleblowing outside of industrial contexts. Second, it explores how the nature of the offense interacts with perceived roles to shape social judgments in science. In doing so, this work contributes to a more nuanced understanding of how ethical accountability is recognized and rewarded, or not, by the broader public. Lastly, this project was modeled from the Ebersole et al, 2016 paper and therefore their scale of attributes is included to conduct an exploratory analysis. As the scientific community continues to emphasize transparency and responsible research practices, understanding the perceptions of scientific investigators and inspectors becomes crucial to fostering an environment where ethical accountability is supported rather than punished.

Method

The current study

The present study employs a 2 (role: investigator vs. inspector) × 3 (QRP severity: minor vs. moderate vs. major) fully within-subjects design. All participants will read three vignettes, one for each level of severity, that describe an investigator committing a QRP, and an inspector reporting it. For each vignette, perceptions of the integrity, likeability, and positive attributes of both researchers will be assessed as well as the perceived trueness of the original finding.

Research Question 1: How is integrity influenced by role and QRP severity?

Hypothesis 1a: Inspectors will be rated higher in integrity than investigators.
Hypothesis 1b: The relative integrity of inspectors versus investigators will increase as QRP severity increases.

Research Question 2: How is likeability influenced by role and QRP severity?

Hypothesis 2a: Investigators will be rated more likable than inspectors.
Hypothesis 2b: At low QRP severity, investigators will be rated as more likeable than inspectors, but at high QRP severity, inspectors will be rated as more likeable than investigators.

Research Question 3: Is the perceived trueness of the finding influenced by QRP severity?

Hypothesis 3: Perceived trueness will decrease as QRP severity increases.

Participants

Participants will be recruited through the University of Alabama participant pool and compensated with course credit. Participants will be excluded if they: 1) do not consent or 2) complete the study in less than seven minutes or 3) exhibit no variability in their responses (e.g., selecting all 5s or all 3s) across all trust and likeability items, and 4) fail to complete the trust and likeability ratings for at least one vignettes.

A priori power analyses were conducted using a multilevel modeling framework that accounted for clustering of repeated vignette ratings within participants. The design effect was calculated as a function of cluster size (6 ratings per participant) and varying intraclass correlation coefficients (ICCs). The ICC reflects the variances in responses within each individual. When a person is answering similarly across vignettes the ICC will be higher, meaning their responses are more alike and providing less information across vignettes. Whereas if their responses differ more across vignettes the ICC is lower which leads to a lower participant counts as they are providing unique information across vignettes. Required sample sizes ranged from 324 participants at ICC =.20 to 566 participants at ICC =.50 (see Table 1). Assuming a conservative ICC of.50, the required sample size was estimated at 566 participants to achieve 80% power for detecting a small effect size (d =.24, f² ≈.014) with Bonferroni-corrected α = 0.001 for five planned comparisons. Accordingly, the recruitment target is set at 566 usable participants. Data collection will continue until this threshold is reached, or until March 2026, whichever occurs first.

**Table 1.** A priori power analysis
Effect sizes	ICC	Total Long Rows	Participants Needed
d =.24, f² ≈.014)	0.20	1938	324
d =.24, f² ≈.014)	0.35	2665	445
d =.24, f² ≈.014)	0.40	2907	485
d =.24, f² ≈.014)	0.45	3150	525
d =.24, f² ≈.014)	0.50	3392	566

Note. Required sample size estimates were calculated using a multilevel design-effect approach with cluster size = 6 ratings per participant, α =.001 (Bonferroni-corrected), and 80% power to detect a small effect size (d =.24, f² ≈.014).

Materials

All materials, our initial proposal, and our pre-registration can be found on the project page at (https://osf.io/dvpt8/). Ethics approval, data, analysis code, and codebook will be added once available. Participants will be guided to a survey link created on the open-source software Formr (Arslan et al., 2020: https://diss-ss-ss.formr.org).

Vignettes

Each participant will read three vignettes. The vignettes were created using scenarios from Ebersole and colleagues (2016) as a guide. Ebersole and colleagues used a similar study design to compare replicators and original researchers. The current project’s vignettes adopt this same structure, but instead of a replication, the second researcher conducts a robust check or review then reports their concerns to the original researcher’s institute. All vignettes begin with a sentence about Researcher X (the scientific investigator) and continue with a sentence about the QRP they engaged in. The third and final sentence describes Researcher Y (the scientific inspector) writing a blog post providing evidence of their suspicions of Researcher X with a suggestion on the implications of the QRP. For example, you can see the structure of the vignettes in our selective reporting example;

Researcher X conducted a study and found some interesting results. Researcher Y reviewed the data and realized that Researcher X only reported the findings that confirmed their hypotheses, and failed to mention other findings that challenged their hypotheses. Researcher Y writes blog post presenting evidence on how Researcher X cherry-picked the findings they reported to make their results seem stronger. Researcher Y emphasizes that this lack of transparency misleads the scientific community and undermines the integrity of science.

Vignettes vary by severity of QRP. There are one scenario at each level: overreaching abstract (minor), selective reporting (moderate), and data fabrication (major). Severity was informed by a combination of theoretical and empirical sources. Kolstoe’s (2023) spectrum of QRPs conceptualizes QRPs as a continuum, from minor error to misunderstanding, sloppiness, incompetence, falsification, fabrication, and ends with criminal misconduct. Accordingly, overreaching abstract were classified as minor, as they represent misunderstanding or sloppiness.

To determine moderate QRP, evidence from Larsson et al., (2023) was utilized. They found that reported severity mean of 4.3 (on a 5-point scale) for selective reporting, cherry picking and the selective analytical choices which suggested these QRPs were moderately severe. The creation of the last level of QRP drew from both Bottesini et al., (2022) and Kolstoe. Bottesini et al., (2022) found that data fraud was deemed unacceptable by 81.3% of participants. These results lead to placing selective reporting in moderate and data fraud in the major category. Taking into consideration Kolstoe’s spectrum of questionable research practices, the present vignette also places data fraud on the more serious, falsified, or fraudulent end of the continuum.

Integrity.

To assess the perceived integrity of each researcher (investigator and inspector), the integrity section from the trust in scientists scale developed by Cologna and colleagues (2025) will be utilized. This scale is grounded in theoretical and empirical work conceptualizing trust as a multidimensional construct composed of four key dimensions: competence, benevolence, integrity and openness. The original full-length scale demonstrated high internal consistency across diverse cultural contexts, with Cronbach’s alpha =.93 and McDonald’s omega =.95, and the 12 items were found to load reliably onto four stable factors reflecting the theoretical dimensions.

While the original authors noted some limitations regarding cross-country measurement invariance, the scale showed strong psychometric properties overall. Our shortened version wil only focus on the integrity aspect of trust to reduce participant burden. In the current study, participants will rate both Researcher X and Researcher Y on the following items using a 5-point Likert scale (e.g. 1 = very unethical, 5 = very ethical)..

Integrity: “How ethical or unethical is Researcher X/Y?”
Integrity: “How sincere or insincere is Researcher X/Y?”
Integrity: “How honest or dishonest is Researcher X/Y to be transparent?”

These three items will be averaged into a composite integrity score.

Likeability

To assess likeability, the Likeability Scale (Reysen, 2005) was adapted and reduced to two items. Responses will be given using a 5-point Likert scale (1 = very unlikeable/unapproachable, 5 = very likeable/approachable) to the items “How approachable is Researcher X/Y” and “How likeable is Researcher X/Y?” These will be averaged into one likeability composite.

Positive Attributes

To assess the perceived positive of researchers, a shortened version of the nine comparisons questions from Ebersole et al. (2016) will be assessed. These comparisons aim to capture general perceptions of researchers from a layperson perspective. This scale will ask participants to choose between the inspector or investigator on favorable professional and social attributes for these nine items from Ebersole et al. (2016):

Which researcher is smarter: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
Which researcher is a better researcher: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
Which researcher is more creative: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
Which researcher would you rather be: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
Which researcher should you be: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
Which researcher is more like the most celebrated researcher: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
Which researcher is more like the typical researcher: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
Which researcher is more likely to keep a job: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
Which researcher is more likely to get a job: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?

Trueness of the Finding.

Perceived trueness of the original finding was measured with a single item adapted from Ebersole et al. (2016). Participants will use a 5-point Likert scale (1 = very confident the findings are false, 3- could go either way, 5 = very confident the findings are true) when responding to this item: “Based on what you know about the actions of Researcher X, how confident are you that their reported findings are true?”

Attention.

An attention check was used to assess whether participants paid attention during the experiment by requiring participants to select a specific response to the question (i.e., “Select the option “somewhat disagree” if you are paying attention.”). An exploratory analysis to determine test if excluding those who fail the attention impacts the data.

Procedure

To begin, participants will complete the informed consent form. Next, they will be shown the three vignettes in randomized order. After each vignette, participants will respond to items assessing integrity and likeability for researcher X (investigator) followed by researcher Y (inspector). Once they have completed those ratings for both the investigator and inspector, they will be prompted to select a researcher on a list of positive attributes and rate the perceived trueness of the investigator’s findings. It is important to note that participants never see the term “whistleblower,” “investigator”, or “inspector” scenarios only use the labels “Reported Researcher (X)” and “Reporting Researcher (Y).” At a random point between vignettes participants will answer the attention check question.

Finally, participants will respond to basic demographic items such as age, gender, sexual orientation, religious affiliation, race, and political orientation. The whole procedure will be conducted online and will take about 20 minutes.

Analysis Plan

All research questions will be tested using three-level multilevel models to account for repeated observations where individual ratings (Level 1) are nested with vignettes (Level 2) which are nested within participants (Level 3). In this context, a cluster refers to a grouping of observations that are more similar to each other than to observations in other groups due to shared sources of variance. Specifically, participant clusters consist of all ratings provided by a single participant, capturing individual tendencies such as general positivity or negativity biases, whereas vignette clusters consist of all ratings for a given vignette, capturing variance due to vignette-specific effects.

To control for participants who tend to rate everything similar across questions, participant-level random intercepts will account for a person’s general positivity/negativity biases, reducing the chance that the results are driven by a halo bias. Random intercepts were specified for both participants and vignettes, and random slopes were added for within-participant predictors where possible. A null model will be run to assess the variation accounted by each cluster to allow the intercept to vary.

Model 0: Y_ijk = γ₀₀₀ + ν_0k + μ_0jk + ε_ijk.

Here, Y_ijk represents the DV rating for vignette i, participant j, vignette cluster k; γ₀₀₀ is the grand mean rating; ν_0k is the random intercept for vignette (Level 2); μ_0jk is the random intercept for participant (Level 3); and ε_ijk is the trial-level residual error (Level 1). Intraclass correlations (ICCs) will be computed to partition variance across participants and vignettes.

Fixed effects, included Role (inspector = 1, investigator = –1; Level 2), QRP Severity (continuous; Level 1), and their interaction. If the full model fails to converge, or is singular, alternative models with reduced random slopes will be compared via AIC/BIC and likelihood-ratio tests. The simplest model that converges and retains substantively meaningful variance components will be reported. All models will be estimated with restricted maximum likelihood (REML) for variance components and maximum likelihood (ML) for fixed-effect comparisons. Fixed effects will be interpreted using adjusted degrees of freedom (via lmerTest). Planned contrasts will be conducted with emmeans, with Bonferroni-adjusted alpha (α = 0.001) for the five primary tests of researcher differences within severity levels.

Intraclass correlations (ICCs). For planning purposes, this project assumes an intraclass correlation (ICC) of 0.50 (i.e., moderate between-person variance). The design effect is calculated as 1+(m−1) ICC with m = 6 observations per participant per DV; this design effect inflates naive sample-size requirements and is used in our analytic power calculations. But for each DV, ICCs will be reported based on final model variance components, calculated as the proportion of variance attributable to participant clustering and vignette clustering where applicable (participant ICC = σ²_μ / (σ²_μ + σ²_ν + σ²_ε); vignette ICC = σ²_ν / (σ²_μ + σ²_ν + σ²_ε). These values will be compared to the assumed ICC = 0.50 used in a priori power analysis.

Validation and robustness checks. To ensure constructs are empirically distinct, the factor structure of the integrity and likeability will be examined first for measures using exploratory and/or confirmatory factor analysis, reporting factor loadings and interfactor correlations. This will help assess the potential influence of halo effects (i.e., global positive/negative evaluations). Correlations among composites and internal consistency will also be inspected, setting our minimum threshold for Cronbach’s alpha at α =.70. If any construct falls under our threshold, models will be tested at the item level (e.g., competence, integrity, likeability) to evaluate whether effects are robust across items rather than driven by a single measure.

As an additional robustness check, models will be re-run with available demographic covariates (e.g., gender, political orientation, religiosity) to test whether results hold across subgroups. While no formal hypothesize specifies a moderator effects of demographics, reporting these analyses will clarify whether findings are consistent across participant characteristics.

Because the project design allows for the possibility of a crossover interaction at low severity levels, an examination and plot of the Role × Severity interactions will be conducted to test whether, at the lowest QRP severity level, inspectors are rated less favorably than investigators.

RQ1 Integrity and RQ2 Likeability

To examine research questions 1 (integrity) and 2 (likeability), a step-up modeling approach will be used to systematically test how each variable contributes to the model. Planned contrasts will compare inspectors and investigators at each severity level. If the integrity composite fails factor or reliability checks, items will be modeled separately with the same MLM structure. Given the possibility that integrity and likeability may need to be split, hypotheses will be considered fully supported if effects generalize across items and partially supported if effects hold only for some items.

Model 1: Integrity/Likeability_ijk = γ₀₀₀ + γ₁₀₀ (Researcher_j) + ν_0k + μ_0jk + ε_ijk

Model 2: Integrity/Likeability_ijk = γ₀₀₀ + γ₀₁₀ (Severity_k) + ν_0k + μ_0k + ε_ijk

Model 3: Integrity/Likeability_ijk = γ₀₀₀ + γ₁₀₀ (Researcher_j) + γ₀₁₀ (Severity_k) + ν_0k + μ_0jk + ε_ijk

Model 4: Integrity/Likeability_ijk = γ₀₀₀ + γ₁₀₀ (Researcher_j) + γ₀₁₀ (Severity_k) + γ₁₁₀(Researcher_j × Severity_k) + ν_0k + μ_0jk + ε_ijk

In this model, γ₁₀₀ represents the between-vignette effect of researcher type (inspector vs. investigator), γ₀₁₀represents the within-participant effect of severity, and γ₁₁₀ represents the cross-level interaction of researcher × severity. Random intercepts ν_0k capture variance attributable to vignettes (Level 2) and μ_0jk capture variance attributable to participants (Level 3).

RQ3: Trueness of the Finding

To examine research question four, the main effect will be assessed by planned contrasts comparing inspectors and investigators at each severity level.

Model FT1: Trueness_ijk = γ₀₀₀ + γ₀₁₀ (Severity_k) + ν_0k + μ_0j + ε_ijk

**Table 2.** Multilevel Model Analysis Plan
Research Question	Hypothesis	Sampling plan	Analysis Plan	Rationale for sensitivity decisions	Interpretation given different outcomes	Theory that could be shown wrong by the outcomes
R1: How is integrity influenced by role and QRP severity?	H1a: Scientific inspectors will be rated higher in Integrity than scientific investigators	Participants will be recruited through the University of Alabama participant pool and compensated with course credit. Participants will be excluded if they: 1) do not consent or 2) complete the study in less than seven minutes or 3) exhibit no variability in their responses (e.g., selecting all 5s or all 3s) across all trust and likeability items, and 4) fail to complete the trust and likeability ratings for at least one vignettes. Participants will also be recruited through prolific and BeSample if funding opportunities arise.	Model 1: Integrity_ijk = γ₀₀₀ + γ₁₀₀ (Researcher_j) + ν_0k + μ_0jk + ε_ijk Random intercepts for participants (Level 3) and vignettes (Level 2); random slopes for within-participant predictors where possible.	The study was powered based on the smallest effect size of interest, defined as differences in ethical perceptions of researcher roles, drawn from prior literature. Assuming a conservative ICC of.50, the required sample size was estimated at 566 participants to achieve 80% power for detecting a small effect size (d =.24, f² ≈.014) with Bonferroni-corrected α = 0.001 for five planned comparisons.	Main effect of role: higher integrity ratings for inspector vs. investigator	If there is no effect, then it stands to question theories pertaining to the role of QRPs on laypersons perceptions.
R1: How is integrity influenced by role and QRP severity?	H1b: Integrity gap widens with QRP severity		Model 4: Integrity_ijk = γ₀₀₀ + γ₁₀₀ (Researcher_j) + γ₀₁₀ (Severity_k) + γ₁₁₀(Researcher_j × Severity_k) + ν_0k + μ_0jk + ε_ijk		Significant interaction: integrity difference between roles (inspector—investigator) increases with severity
R2: How is likeability influenced by role and QRP severity?	H2a: Scientific investigators will be rated as more likeable than scientific inspectors		Model 1: Likeability_ijk = γ₀₀₀ + γ₁₀₀ (Researcher_j) + ν_0k + μ_0jk + ε_ijk.;Random intercepts for participants (Level 3) and vignettes (Level 2); random slopes for within-participant predictors where possible.		Main effect of role: higher likeability for investigator vs. inspector	If there is no effect on likeability it brings questions to theories on social perceptions of whistleblowers as well as theories pertaining to perceptions of “tattletalers.”
	H2b: Likeability gap narrows (or even reverses) with the increase of QRP severity		Model 4: Likeability_ijk = γ₀₀₀ + γ₁₀₀ (Researcher_j) + γ₀₁₀ (Severity_k) + γ₁₁₀(Researcher_j × Severity_k) + ν_0k + μ_0jk + ε_ijk		Significant interaction: Likeability difference between roles (inspector—investigator) decreases with severity.
R3: Is the perceived trueness of the finding influenced by QRP severity?	H3: Perceived trueness will decrease as QRP severity increases.		Model FT1: Trueness_ijk = γ₀₀₀ + γ₀₁₀ (Severity_k) + ν_0k + μ_0j + ε_ijk random intercepts for participants		Main effect of severity: trueness decreases as severity increases	Should we find no effect on perceived trueness then it stands to question theories on the ability to update scientific beliefs when faced with misinformation.

Exploratory Analysis

Positive Attributes. To examine the positive attributes scores, an exploratory analysis will be conducted to view overall perceptions of researcher role. Participant descriptives will be calculated for each vignette and aggregated across vignettes. When examining researchers’ attributes for each individual vignette, frequencies will be report of the chosen researcher role for each of the nine attribute questions. This will be reported for each of the three vignettes. When aggregating across vignettes, effects are systematically assessed by utilizing a step-up modeling approach for perceived researcher attributes. Planned contrasts will compare inspectors and investigators at each severity level.

Model A1: Attributes_ijk = γ₀₀₀+ γ₁₀₀ (Researcher_j) + γ₁₀₁ (Smarter) + γ₁₀₂ (Ethical) + γ₁₀₃ (Better Researcher) + γ₁₀₄ (Creative) + γ₁₀₅ (Rather Be) + γ₁₀₆ (Should Be) + γ₁₀₇ (Celebrated) + γ₁₀₈ (Typical) + γ₁₀₉ (Job Security) + γ₁₁₀ (Job Outcome) + ν_0k + μ_0jk + ε_ijk

Model A2: Attributes_ijk = γ₀₀₀ + γ₁₀₁ (Smarter) + γ₁₀₂ (Ethical) + γ₁₀₃ (Better Researcher) + γ₁₀₄ (Creative) + γ₁₀₅ (Rather Be) + γ₁₀₆ (Should Be) + γ₁₀₇ (Celebrated) + γ₁₀₈ (Typical) + γ₁₀₉ (Job Security) + γ₁₁₀ (Job Outcome) + γ₀₁₀ (Severity_k) + ν_0k + μ_0jk + ε_ijk

Model A3: Attributes_ijk = γ₀₀₀ + γ₁₀₀ (Researcher_j) + γ₁₀₁ (Smarter) + γ₁₀₂ (Ethical) + γ₁₀₃ (Better Researcher) + γ₁₀₄ (Creative) + γ₁₀₅ (Rather Be) + γ₁₀₆ (Should Be) + γ₁₀₇ (Celebrated) + γ₁₀₈ (Typical) + γ₁₀₉ (Job Security) + γ₁₁₀ (Job Outcome) + γ₀₁₀ (Severity_k) + ν_0k + μ_0jk + ε_ijk

Model A4: Attributes_ijk = γ₀₀₀ + γ₁₀₀ (Researcher_j) + γ₁₀₁ (Smarter) + γ₁₀₂ (Ethical) + γ₁₀₃ (Better Researcher) + γ₁₀₄ (Creative) + γ₁₀₅ (Rather Be) + γ₁₀₆ (Should Be) + γ₁₀₇ (Celebrated) + γ₁₀₈ (Typical) + γ₁₀₉ (Job Security) + γ₁₁₀ (Job Outcome) + γ₀₁₀ (Severity_k) + γ₁₁₁(Researcher_j × Severity_k) + ν_0k + μ_0jk + ε_ijk

In this model, γ₁₀₀represents the within-participant effect of researcher type (inspector vs. investigator), γ₀₁₀represents the vignette-level effect of severity, and γ₁₁₁ represents the cross-level interaction of researcher × severity. Random intercepts ν_0k capture variance attributable to vignettes (Level 2) and μ_0jk capture variance attributable to participants (Level 3).

Supporting Information:

Supporting information is available here: https://osf.io/vhmzg

References

Abraham, J., Mangapul, C. J., Amaniputri, D. N., Manurung, R. H., & Ispurwanto, W. (2023). Intention to whistleblow: Perception of reporting skill mediates the predicting role of class consciousness and perceived probability of revenge (No. 12:1566). F1000Research. https://doi.org/10.12688/f1000research.142265.1

Anderson, S. F., & Maxwell, S. E. (2017). Addressing the “Replication Crisis”: Using Original Studies to Design Replication Studies with Appropriate Statistical Power. Multivariate Behavioral Research, 52(3), 305–324. https://doi.org/10.1080/00273171.2017.1289361

Anvari, F., & Lakens, D. (2018). The replicability crisis and public trust in psychological science. Comprehensive Results in Social Psychology, 3(3), 266–286. https://doi.org/10.1080/23743603.2019.1684822

Arslan, R. C., Walther, M. P., & Tata, C. S. (2020). formr: A study framework allowing for automated feedback generation and complex longitudinal experience-sampling studies using R. Behavior Research Methods, 52(1), 376–387. https://doi.org/10.3758/s13428–019–01236-y

Artino, A. R., Driessen, E. W., & Maggio, L. A. (2019). Ethical Shades of Gray: International Frequency of Scientific Misconduct and Questionable Research Practices in Health Professions Education. Academic Medicine, 94(1), 76–84. https://doi.org/10.1097/ACM.0000000000002412

Bottesini, J. G., Rhemtulla, M., & Vazire, S. (2022). What do participants think of our research practices? An examination of behavioural psychology participants’ preferences. Royal Society Open Science, 9(4), 200048. https://doi.org/10.1098/rsos.200048

Chambers, C. (2017). The seven deadly sins of psychology: a manifesto for individualing the culture of scientific practice / Chris Chambers. Princeton University Press.

Cheliatsidou, A., Sariannidis, N., Garefalakis, A., Passas, I., & Spinthiropoulos, K. (2023). Exploring Attitudes towards Whistleblowing in Relation to Sustainable Municipalities. Administrative Sciences, 13(9), 199. https://doi.org/10.3390/admsci13090199

Cologna, V., Mede, N. G., Berger, S., Besley, J., Brick, C., Joubert, M., Maibach, E. W., Mihelj, S., Oreskes, N., Schäfer, M. S., van der Linden, S., Abdul Aziz, N. I., Abdulsalam, S., Shamsi, N. A., Aczel, B., Adinugroho, I., Alabrese, E., Aldoh, A., Alfano, M., … Zwaan, R. A. (2025). Trust in scientists and their role in society across 68 countries. Nature Human Behaviour, 1–18. https://doi.org/10.1038/s41562–024–02090–5

Ebersole, C. R., Axt, J. R., & Nosek, B. A. (2016). Scientists’ Reputations Are Based on Getting It Right, Not Being Right. PLOS Biology, 14(5), e1002460. https://doi.org/10.1371/journal.pbio.1002460

Fanelli, D. (2018). Is science really facing a reproducibility crisis, and do we need it to? Proceedings of the National Academy of Sciences, 115(11), 2628–2631. https://doi.org/10.1073/pnas.1708272114

Frankenhuis, W. E., & Nettle, D. (2018). Open Science Is Liberating and Can Foster Creativity. Perspectives on Psychological Science, 13(4), 439–447. https://doi.org/10.1177/1745691618767878

Ioannidis, J. P. A. (2022). Correction: Why Most Published Research Findings Are False. PLOS Medicine, 19(8), e1004085. https://doi.org/10.1371/journal.pmed.1004085

Kolstoe, S. (2023). Defining the Spectrum of Questionable Research Practices (QRPs). UK Research Integrity Office. https://doi.org/10.37672/UKRIO.2023.02.QRPs

Isbell, D. R., Brown, D., Chen, M., Derrick, D. J., Ghanem, R., Arvizu, M. N. G., Schnur, E., Zhang, M., & Plonsky, L. (2022). Misconduct and Questionable Research Practices: The Ethics of Quantitative Data Handling and Reporting in Applied Linguistics. The Modern Language Journal, 106(1), 172–195. https://doi.org/10.1111/modl.12760

Larsson, T., Plonsky, L., Sterling, S., Kytö, M., Yaw, K., & Wood, M. (2023). On the frequency, prevalence, and perceived severity of questionable research practices. Research Methods in Applied Linguistics, 2(3), 100064. https://doi.org/10.1016/j.rmal.2023.100064

LeBel, E. P., Campbell, L., & Loving, T. J. (2017). Benefits of open and high-powered research outweigh costs. Journal of Personality and Social Psychology, 113(2), 230–243. https://doi.org/10.1037/pspi0000049

Lubalin, J., Ardini, M.-A. E., Matheson, J., & Research Triangle Institute, issuing body. (1995). Consequences of whistleblowing for the whistleblower in misconduct in science cases : final report. Research Triangle Institute.

Lubalin, J. S., & Matheson, J. L. (1999). The fallout: What happens to whistleblowers and those accused but exonerated of scientific misconduct? Science and Engineering Ethics, 5(2), 229–250. https://doi.org/10.1007/s11948–999–0014–9

Miguel, E., Camerer, C., Casey, K., Cohen, J., Esterling, K. M., Gerber, A., Glennerster, R., Green, D. P., Humphreys, M., Imbens, G., Laitin, D., Madon, T., Nelson, L., Nosek, B. A., Petersen, M., Sedlmayr, R., Simmons, J. P., Simonsohn, U., & Van Der Laan, M. (2014). Promoting Transparency in Social Science Research. Science, 343(6166), 30–31. https://doi.org/10.1126/science.1245317

Monin, B., Sawyer, P. J., & Marquez, M. J. (2008). The rejection of moral rebels: Resenting those who do the right thing. Journal of Personality and Social Psychology, 95(1), 76–93. https://doi.org/10.1037/0022–3514.95.1.76

Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie Du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1(1), 0021. https://doi.org/10.1038/s41562–016–0021

National Academy of Sciences, N. A. of E., and Institute of Medicine. (2009). On Being a Scientist: A Guide to Responsible Conduct in Research: Third Edition. National Academies Press. https://doi.org/10.17226/12192

Nicholls, A. R., Fairs, L. R. W., Toner, J., Jones, L., Mantis, C., Barkoukis, V., Perry, J. L., Micle, A. V., Theodorou, N. C., Shakhverdieva, S., Stoicescu, M., Vesic, M. V., Dikic, N., Andjelkovic, M., Grimau, E. G., Amigo, J. A., & Schomöller, A. (2021). Snitches Get Stitches and End Up in Ditches: A Systematic Review of the Factors Associated with Whistleblowing Intentions. Frontiers in Psychology, 12. https://doi.org/10.3389/fpsyg.2021.631538

Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., Buck, S., Chambers, C. D., Chin, G., Christensen, G., Contestabile, M., Dafoe, A., Eich, E., Freese, J., Glennerster, R., Goroff, D., Green, D. P., Hesse, B., Humphreys, M., … Yarkoni, T. (2015). Promoting an open research culture. Science, 348(6242), 1422–1425. https://doi.org/10.1126/science.aab2374

Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration systemic. Proceedings of the National Academy of Sciences, 115(11), 2600–2606. https://doi.org/10.1073/pnas.1708274114

Pennycook, G., & Rand, D. G. (2021). The Psychology of Fake News. Trends in Cognitive Sciences, 25(5), 388–402. https://doi.org/10.1016/j.tics.2021.02.007

Reysen, S. (2005). Construction of a new scale: The Reysen Likability Scale. Social Behavior and Personality, 33(2), 201–208.

Schooler, J. W. (2014). Metascience could rescue the ‘replication crisis.’ Nature, 515(7525), 9–9. https://doi.org/10.1038/515009a

Sijtsma, K. (2016). Playing with Data—Or How to Discourage Questionable Research Practices and Stimulate Researchers to Do Things Right. Psychometrika, 81(1), 1–15. https://doi.org/10.1007/s11336–015–9446–0

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

Sterling, S., Yaw, K., Plonsky, L., Larsson, T., & Kytö, M. (2025). Investigating researcher perceptions of Questionable Research Practices. Journal of Second Language Studies. https://doi.org/10.1075/jsls.00048.ste

Swift, J. K., Christopherson, C. D., Bird, M. O., Zöld, A., & Goode, J. (2022). Questionable research practices among faculty and students in APA-accredited clinical and counseling psychology doctoral programs. Training and Education in Professional Psychology, 16(3), 299–305. https://doi.org/10.1037/tep0000322

Uhlmann, E. L., Ebersole, C. R., Chartier, C. R., Errington, T. M., Kidwell, M. C., Lai, C. K., McCarthy, R. J., Riegelman, A., Silberzahn, R., & Nosek, B. A. (2019). Scientific Utopia III: Crowdsourcing Science. Perspectives on Psychological Science, 14(5), 711–733. https://doi.org/10.1177/1745691619850561

Wingen, T., Berkessel, J. B., & Englich, B. (2020). No Replication, No Trust? How Low Replicability Influences Trust in Psychology. Social Psychological and Personality Science, 11(4), 454–463. https://doi.org/10.1177/1948550619877412

Yong, E. (2017, April 5). How the GOP Could Use Science’s Individual Movement Against It. The Atlantic. https://www.theatlantic.com/science/archive/2017/04/reproducibility-science-open-judoflip/521952/

Appendix A Questionnaires

Response options varied for each item but are on a 5-point Likert Scale (e.g., 1 = very (unethical), 2 = somewhat (unethical), 3 = neither (ethical) nor (unethical), 4 = somewhat (ethical), 5 = very (ethical).

Integrity (T)- Cologna et al, 2025
1. How ethical or unethical is researcher X?
2. How ethical or unethical is researcher Y?
3. How honest or dishonest is Researcher X?
4. How honest or dishonest is Researcher Y?
5. How sincere or insincere is Researcher X?
6. How sincere or insincere is Researcher Y?
Likeable (L)- Reysen, 2005
1. How likeable or unlikeable is researcher X?
2. How likeable or unlikeable is researcher Y?
Approachable (L)- Reysen, 2005
1. How approachable or unapproachable is researcher X?
2. How approachable or unapproachable is researcher Y?
The original effects (trueness)- Ebersole et al, 2016
1. Based on what you know about the actions of Researcher X, how confident are you that their reported findings are true?
Positive Attributes- Ebersole et al., 2016
1. Which researcher is smarter: Reported Researcher or Reporting Researcher?
2. Which researcher is a better researcher: Reported Researcher or Reporting Researcher?
3. Which researcher is more creative: Reported Researcher or Reporting Researcher?
4. Which researcher would you rather be: Reported Researcher or Reporting Researcher?
5. Which researcher should you be: Reported Researcher or Reporting Researcher?
6. Which researcher is more like the most celebrated researcher: Reported Researcher or Reporting Researcher?
7. Which researcher is more like the typical researcher: Reported Researcher or Reporting Researcher?
8. Which researcher is more likely to keep a job: Reported Researcher or Reporting Researcher?
9. Which researcher is more likely to get a job: Reported Researcher or Reporting Researcher?

Appendix B Vignettes

Minor Severity QRP (Overreaching Abstracts)

Researcher X conducted a study and found some interesting results. They submitted the findings to a journal for publication, but the title and abstract implied the study had much broader implications than the study truly did.

Researcher Y after reading the abstract, writes a blog post highlighting how Researcher X’s misrepresentations can mislead readers about the scope and findings of the research. Researcher Y emphasizes that having imprecise titles and abstracts leads to research being inaccurately represented.

Moderate Severity QRP (Selective reporting)

Researcher Y writes blog post presenting evidence on how Researcher X cherry-picked the findings they reported to make their results seem stronger. Researcher Y emphasizes that this lack of transparency misleads the scientific community and undermines the integrity of science.

Major Severity QRP (Data Fabrication)

Researcher X conducted a study and found some interesting results. Upon reviewing the data, Researcher Y noticed that Researcher X made-up large sections of the dataset.

Researcher Y writes a blog post highlighting evidence of Researcher X’s fabricated data as a prime example of how this kind of fraud not only tarnishes the credibility of the research but can also lead to severe career repercussions for collaborators. Researcher Y emphasizes that data fabrication has devastating consequences for scientific credibility public trust in science.

Declarations

Ethics

IRB approval has been approved by the University of Alabama IRB. (IRB #25-11-9199)

Competing Interests

The authors declare that no conflicts of interest exist.

Funding

Savannah Lewis received an honorarium for the submission of this research plan from the Lifecycle Journal. The honorarium awarded played no role in the study design, data collection, analysis, decision to publish, or submission preparation.

Author Contributions

SCL- Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Visualization, Writing – original draft and Writing – review & editing AMT- Conceptualization, Methodology, Supervision, Validation, Writing – original draft, Writing – review & editing.

Editors

Kathryn Zeiler
Editor-in-Chief

Kathryn Zeiler
Handling Editor

Editorial assessment

by Kathryn Zeiler

DOI: 10.70744/MetaROR.314.1.ea

The research plan proposes a vignette-based study examining how participants evaluate the trustworthiness and likeability of both “inspectors,” researchers who whistle blow on colleagues who employ questionable research practices (QRPs), and “investigators,” those who commit QRPs. All three reviewers agree that the topic is timely and relevant to ongoing debates about research integrity and the social dynamics of scientific accountability, and that vignettes offer a reasonable way to investigate these questions. Reviewers diverge, however, on how close the current proposal is to being ready for data collection. Two reviewers characterize the submission as well-designed and offer largely minor suggestions, while the third identifies substantial conceptual and methodological concerns that would require significant revision. Common themes across all three reports include the need for a sharper conceptual framing of QRPs (including a clearer definition, better integration with the broader replication and whistleblowing literatures, and clarification of whose trust is being studied), concerns about the participant pool, and reservations about the use of a single vignette, which raises generalizability concerns.

The most consequential issue raised concerns the analysis plan: one reviewer argues that the proposed three-level nested model fails to reflect the data-generating process. Additional shared or complementary points include accessibility of open materials, the reliability of short measurement scales (especially the two-item likeability measure), the rationale for collecting sensitive demographic variables, and the clarity of role labels (i.e., “inspector” vs. “investigator”). Taken together, the reviews suggest that the project addresses a worthwhile and underexplored question and has a workable core design but that revisions to the theoretical framing, sampling strategy, measurement choices, and the statistical model would meaningfully strengthen the proposal before it proceeds to data collection.

Competing interests: None.

Peer review 1

Jelte Wicherts

DOI: 10.70744/MetaROR.314.1.rv1

In this Stage 1 RR submission, the authors present the plans of an interesting study of the effect of QRPs on the assessments of the integrity and likeability of whistleblowers and researchers and trust in findings. This is an well designed study with relevance to the debate on questionable research practices and the means to lower them in research practice. The authors present a review of studies on QRP use and whistleblowing and provide a good rationale for the study, which uses vignettes as a straightforward way to study these thorny issues in a relatively safe manner.

The study is well designed and well suited to test the proposed hypotheses, and I offer only minor points to improve it. These minor points are given by page.

Page 5. “undermine the knowledge that is generated” should read “undermine the credibility of results”

In the introduction, it is not immediately clear whose trust is at play (researchers, the lay public, methodologists?)

The literature on QRPs is quite broad and covers many different fields and practices, some of which are not directly related to results but might involve issues like plagiarism or authorship disputes. It should be clear which types of practices are relevant here and who assesses their impact on the trustworthiness (or replicability perhaps) of results. Page 6. The percentages appear to be based on self-report surveys with a host of methodological challenges including response biases and sampling and the use of varying questions.

“While these practices….commit them” I did not understand this sensentence.

Page 7. Perhaps it needs mention that the barriers to whistleblowing also include many instutional factors including journals and academic institutions often not willing to investigate potential misconduct.

Page 8. The use if “how” suggests that the study tries to dive into the social and psychological mechanisms of the relations of interest. Although the authors do provide some ideas about these, the hypotheses themselves are quite descriptive and do not dive very deep into social or psychological theories.

Page 9. Please provide more information on the characteristics of the planned sample (undergraduates?). Why was such a sample chosen (e.g., they are experienced and interested readers of research)?

What is the basis of the ICC values chosen here? Is there any similar research that speaks to these possible values?

Page 11. Why were only two items chosen for likeability? A 2-item scale is unlikely to be very reliable and perhaps an existing longer scale would be better.

Page 12. Perhaps it is better to use measure of trueness that has more answer options (e.g., 0-100 slider?)

Page 13. What is the reason for asking (sensitive) questions on age, gender, sexual orientation, religion, and political orientation? This only serves to heighten identifiability risk and it is unclear how they feature in exploratory analyses.

Page 14. It is perhaps better to use more items for the scales and to register a plan to discard poorly functioning items to heighten their reliabilities.

Competing interests: None.

Peer review 2

Robin Brooker

DOI: 10.70744/MetaROR.314.1.rv2

This is a highly interesting, timely, and novel study that makes a tangible and significant contribution to the field. It presents an investigation into whistleblowing in science, an area that remains relatively underexplored and underdeveloped.

Introduction:

There are other key and recent estimates of QRP engagement that should at least be cited, Gopalakrishna et al. (2022) and Schneider et al. (2024).
It might be useful to separate the introduction into particular subsections to add coherence, flow and structure. For example, separate discussions of QRPs from whistleblowing.

Method:

The reliance on a university participant pool raises concerns about sampling bias, as such pools typically overrepresent students. This may constrain the external validity of the findings and should be acknowledged as a limitation. To address this limitation, the authors should consider diversifying recruitment beyond the university participant pool. At a minimum, they should provide a full demographic breakdown of the sample and explicitly acknowledge the likely overrepresentation of students as a limitation affecting external validity.
I commend you on your a priori power analysis – detailed and comprehensive.
Although the authors provide a theoretical rationale for classifying the vignettes as minor, moderate, and major QRPs, this does not establish that participants perceived the scenarios as differing in severity as intended. Because only one scenario appears to be used at each severity level, differences between conditions may reflect idiosyncratic features of the specific vignettes rather than perceived QRP severity. I would therefore recommend including a manipulation check assessing perceived severity, or at minimum acknowledging the absence of such a check as a limitation.
Table 2 is comprehensive and provides a useful overview of the study materials. However, it contains a large amount of information and is therefore somewhat difficult to follow. The authors might consider simplifying the table, splitting it into multiple tables, or adding clearer visual structure to improve readability.

References:

Gopalakrishna G, ter Riet G, Vink G, Stoop I, Wicherts JM, Bouter LM (2022) Prevalence of questionable research practices, research misconduct and their potential explanatory factors: A survey among academic researchers in The Netherlands. PLoS ONE 17(2): e0263023. https://doi.org/10.1371/journal.pone.0263023
Schneider JW, Allum N, Andersen JP, Petersen MB, Madsen EB, Mejlgaard N, et al. (2024) Is something rotten in the state of Denmark? Cross-national evidence for widespread involvement but not systematic use of questionable research practices across all fields of research. PLoS ONE 19(8): e0304342. https://doi.org/10.1371/journal.pone.0304342

Competing interests: None.

Peer review 3

Tamás Nagy

DOI: 10.70744/MetaROR.314.1.rv3

Summary and overall evaluation

This Stage 1 registered report addresses an interesting and potentially important question: how individuals evaluate researchers who report questionable research practices (QRPs). The topic has clear relevance to ongoing discussions about research integrity and the social dynamics of scientific accountability.

However, in its current form, the manuscript is underdeveloped and contains several substantial conceptual and methodological issues. The theoretical framing lacks precision and depth, the design raises concerns about validity and generalizability, and the analysis plan includes a major misspecification. While the core idea has merit, the study as currently proposed lacks ambition and would require significant revision to meet the standards of a strong registered report.

General comments

Open materials: I was unable to access the materials via the provided links. This should be corrected, as transparency is a core requirement for registered reports.

Introduction

The introduction would benefit from substantial revision in terms of both content and structure.

First, the broader context of QRPs and the replication crisis is underdeveloped. A stronger motivation should situate the study within current evidence on replicability. For example, recent large-scale efforts such as the SCORE project suggest that only about half of findings replicate (see Sutherland et al., 2026). It would also be worth noting that these estimates are based on top-tier journals and may represent an upper bound on replicability.

Second, the manuscript lacks a clear and formal definition of QRPs. This is not a minor issue: how QRPs are defined (e.g., whether intentionality is required) directly determines what participants are evaluating. Providing a theoretically grounded definition (see Nagy et al., 2025) would substantially improve clarity.

Relatedly, the opening paragraph conflates conceptually distinct issues. It mixes QRPs (e.g., p-hacking, HARKing, optional stopping) with broader methodological limitations (e.g., small sample sizes) and more general features of the research process (e.g., analytic flexibility). These are not equivalent categories, and failing to distinguish them weakens the conceptual foundation of the study.

The discussion of whistleblowing is also somewhat misaligned with the research question. The “hero vs. traitor” dilemma is not specific to scientific misconduct, yet the manuscript treats it largely in isolation. It would strengthen the theoretical grounding to integrate findings from the broader whistleblowing literature (see e.g., Brotzeller et al., 2025). At the same time, the current text places relatively heavy emphasis on the personal costs of whistleblowing, which seems less relevant than attitudes toward whistleblowers per se.

The treatment of QRP prevalence also needs refinement. While the manuscript refers to the “10 most frequently reported QRPs,” this framing is misleading, as existing surveys typically include only a limited and ad hoc subset of practices (Lakens, 2022). This limitation should be acknowledged (see Nagy et al., 2025).

Stylistically, the introduction would benefit from improved coherence. Some paragraphs are only loosely connected, and the overall argument lacks a clear progression. Adding structure (e.g., subheadings) and stronger transitions would improve readability.

Finally, there is a likely typo or logical error in the hypotheses section: “inspectors will be rated as higher on integrity than inspectors”.

Method

There are several issues in the design and measurement that should be addressed.

First, the terminology used for the two roles is potentially confusing. For non-native speakers in particular, inspector and investigator can be semantically close. It would be preferable to use more clearly distinguishable labels (e.g., “accused researcher” vs. “reporting researcher”).

Second, the rationale for the key moderation hypothesis (Hypothesis 2b) is not sufficiently developed in the introduction. The predicted crossover interaction in likeability judgments requires stronger theoretical justification.

Third, the structure of the method section could be improved. It would be more conventional to describe the procedure before introducing the materials.

Fourth, the vignette design raises concerns. The QRPs, their descriptions, and their intended severity levels should be presented in a table for transparency. More importantly, having only a single vignette per (QRP × severity) condition is a serious limitation. Any observed effects may be driven by idiosyncratic features of specific vignettes rather than the intended manipulation. Including multiple vignettes per condition would allow for more reliable operationalization and better generalization beyond specific scenarios.

Fifth, in the “Positive Attributes” measure, it would be preferable to use more clearly differentiated labels for the two researchers to avoid confusion.

Finally, the attention check is not specified as an exclusion criterion in the participants section. This decision should be preregistered, as post hoc flexibility in exclusions can bias results.

Analysis plan

The analysis plan contains a major error and should be substantially revised.

The authors propose a three-level nested model (ratings within vignettes within participants), but this does not reflect the data-generating process. In the design, each participant rates multiple vignettes, and each vignette is rated by multiple participants. Thus, vignettes are not nested within participants; they are crossed with participants. Treating vignettes as a Level 2 unit implicitly assumes either (a) that each vignette belongs to a single participant, or (b) that vignette effects are participant-specific—both of which are incorrect.

This mis-specification is problematic because it typically leads to underestimated standard errors and inflated Type I error rates by ignoring uncertainty associated with the stimulus sample. Conceptually, the design involves repeated measures within participants and repeated use of stimuli across participants, and should therefore be modeled using crossed random effects.

A more appropriate random-intercept specification (in lme4 syntax) would be:

(1 | participant) + (1 | vignette)

Moreover, the inclusion of a separate “measurement” level is not well justified. Unless there are multiple repeated measurements of the same outcome per vignette within participant, there is no clear need for this additional level. If integrity, likeability, and related constructs are distinct outcomes, they would typically be analyzed in separate models (or in a multivariate framework), rather than incorporated via an extra hierarchical level.

The section discussing intraclass correlations (ICCs) is also unnecessary in this context. ICCs are estimated as part of the model and are primarily relevant for planning purposes (e.g., power analysis), not for specifying the model itself.

Finally, I would strongly recommend that the authors test their analysis plan—either using pilot data or simulated data—before revising this section. Given the current issues, this step would likely reveal additional problems and help ensure that the model aligns with the design.

Final assessment

In its current form, the manuscript is not yet suitable as a Stage 1 registered report. The topic is worthwhile, but the theoretical framing lacks clarity, the design has important limitations (particularly regarding stimulus sampling), and the analysis plan contains a critical error. Substantial revision is required to bring the proposal to a level where it can provide a rigorous and informative test of the research questions.

References

Brotzeller, F., van Houwelingen, G., Gollwitzer, M., & Fischer, M. (2025). Motive attributions shape judgments of whistleblowers’ moral characters. Personality & Social Psychology Bulletin, 01461672251340111, 1461672251340111. https://doi.org/10.1177/01461672251340111

Lakens, D. (2022). Improving Your Statistical Inferences. https://doi.org/10.5281/zenodo.6409077

Nagy, T., Hergert, J., Elsherif, M. M., Wallrich, L., Schmidt, K., Waltzer, T., Payne, J. W., Gjoneska, B., Seetahul, Y., Wang, Y. A., Scharfenberg, D., Tyson, G., Yang, Y.-F., Skvortsova, A., Alarie, S., Graves, K., Sotola, L. K., Moreau, D., & Rubínová, E. (2025). Bestiary of questionable research practices in psychology. Advances in Methods and Practices in Psychological Science, 8(3). https://doi.org/10.1177/25152459251348431

Sutherland, M. E., Smith, H., & Bray, N. (Eds.). (2026). Reliable research in the social and behavioural and sciences. Springer Nature. https://www.nature.com/collections/idajfifcfg

Competing interests: None.

Cite