Introduction
The replication crisis in psychology has highlighted shortcomings in common research practices that undermine the knowledge that is generated. These questionable research practices (QRPs), such as manipulating analyses to achieve statistical significance (p-hacking), hypothesizing after results are known (HARKing), prematurely examining data, small sample sizes, and insufficient transparency have been identified as key contributors to low replicability among scientific findings (Ioannidis, 2022; Nosek et al., 2018; Chambers, 2017; Sijtsma, 2016; Simmons et al., 2011).
The growing attention to the replication crisis has also sparked a range of perspectives on the roles of researchers. While some may view those who highlight scientific shortcomings as individuals that are upholding scientific integrity, others might see these individuals as disloyal or damaging to the reputation of the field. Similarly, researchers engaging in QRPs may be judged as unethical or, alternatively, as individuals navigating systemic pressures. Understanding how these figures are perceived in relation to each other can provide insight into how trust in psychological science is shaped.
Importantly, not all QRPs are regarded equally, which may influence how researchers are judged. For example, making up data or committing fraud is often seen as a high-severity offense and can lead to serious consequences, such as losing one’s job or academic degree. In contrast, QRPs like overgeneralizing findings or using small sample sizes tend to result in much lighter (or sometimes no) consequences, such as a commentary published about the paper or, if noticed during the review process, a request to revise the writing or collect more data.
While the consequences for QRPs vary and are often inconsistent, researchers are now beginning to rank the prevalence and perceived severity of these practices systematically (Larsson et al., 2023; Bottesini et al., 2022). Larsson et al., (2023) overall results found that the QRPs researchers perceived to be more severe tend to occur less frequently, whereas more common practices are viewed as less severe. Notably, HARKing was among the ten most frequently reported QRPs by researchers, whereas p-hacking, which was rated a high level of severity, was among the ten least frequently reported.
Meanwhile, Bottesini et al., (2022) examined the perceptions of participants on HARKing, P-hacking and data fraud. They found that HARKing, selective reporting, and p-hacking were rated unacceptable ~68–69% of the time (specifically 68.7%, 69.2%, and 68.3% respectively). They discovered that data fraud was deemed unacceptable by 81.3% of participants. Bottesini’s results seem to suggest that the severity of p-hacking and HARKing are perceived more similarly than Larsson et al., (2023) sample which suggest that prevalence of the QRP may influence perceptions.
Broadly, the literature suggests that QRPs are more common than many would expect. Swift et al. (2022) report frequencies of 65% for faculty and 50% for students. Other studies are even more pessimistic, with 90% (90.3%, 94%, and 96%) of their samples admitting to engaging in at least one QRP (Artino et al., 2019; Isbell et al., 2022; Larsson et al., 2023).
QRPs undermine the trust that science seeks to build with the public, with policymakers, and with future generations of researchers (Wingen et al., 2020; Anvari & Lakens, 2019). When published findings are biased, manipulated, or even fraudulent, researchers risk wasting time and resources exploring theories that may have little support. Conversely, researchers may also spend significant effort correcting exaggerated or incorrect claims. That is why it is essential to ensure our field produces reliable, credible research while weighing the need to correct misinformation or fraudulent research.
Despite growing recognition of how QRPs threaten the credibility of science, people do not always revise their beliefs when confronted with evidence that a result may be unreliable. Research on misinformation demonstrates that belief updating is difficult, even when corrections are clear and well-supported. False or misleading information can continue to influence memory and judgment long after it has been corrected because familiarity and source credibility bias how new information is integrated (Pennycook & Rand, 2021). Translating this insight to scientific contexts raises a critical question: when individuals learn that a researcher engaged in a questionable practice, do they adjust their belief in the validity of the original finding, or do they view the researcher’s behavior as unrelated to the result itself? In the current study, it is predicted that participants will recognize the connection between the original finding and the identified QRP, such that perceived trueness of the finding will decrease as QRP severity increases.
In an effort to minimize QRPs, the field has increasingly adopted responsible research practices (RRPs), which promote transparency, replicability, and robustness in research findings (Schooler, 2014; Uhlmann et al., 2019; Anderson & Maxwell, 2017; Nosek et al., 2015; Munafò et al., 2017; Miguel et al., 2014; Frankenhuis & Nettle, 2018; LeBel et al., 2017). While these practices do not completely eliminate QRPs, they do introduce barriers that make it more difficult, or less advantageous, to commit them. However, these barriers lose much of their efficacy if members of the scientific community actively monitor and examine each other’s work.
With the shift toward RRPs, scientists who take on this responsibility of monitoring, investigating, and “calling out” QRPs, often gain access to data or materials that allow them to detect and to report misconduct in ways that mirror the role of internal whistleblowers in traditional workplaces. The International Anti-Corruption Academy highlights four key characteristics of whistleblowing. They suggest that whistleblowing typically involves 1) wrongdoings connected to the workplace, 2) ethical, legal, or safety violations, 3) a decision to report, and 4) a concern for public interest (Scaturro, 2018). Taking these four components into consideration, this paper defines a scientific inspector as an individual who raises concerns about discrepancies, misconduct, or actions that compromise the integrity of scientific research, and reports these concerns to an appropriate authority, either externally or internally.
Although inspectors’ work can increase the quality of research that informs policy, treatments, funding, and future work (Fanelli, 2018; Wingen et al., 2020; Yong, 2017) they also receive more negative labels, like “data police” or “vigilantes.” This may be because they are sometimes perceived as disloyal or disruptive rather than as individuals upholding scientific standards (Cheliatsidou et al., 2023). The literature reveals that these individuals across disciplines often face professional retaliation, social isolation, and emotional distress as a result of their disclosures. According to self-reports compiled by the Lubalin & Matheson (1999), over 60% of scientific whistleblowers experienced at least one negative consequence, including being pressured to withdraw their allegation, ostracized by colleagues, threatened with lawsuits, or subject to reductions in research support. Approximately 10% reported significant career consequences, such as being fired or losing critical funding. Notably, however, fewer than 18% of those who experienced the most severe career impacts stated they would be unwilling to report misconduct again. These findings underscore both the personal cost and moral conviction associated with whistleblowing.
Lubalin et al. (1995) demonstrated that, while whistleblowers tend to experience more immediate professional retaliation, individuals who are accused (and eventually exonerated) suffered worse long term personal outcomes like poor mental or physical health. Together, these findings highlight that both roles endure negative consequences but different kinds. The whistleblower will face more social consequences while the accused researcher may experience more personal consequences. Lubalin et al. (1995) also suggests that whistleblowers are viewed less positively than their accused counterparts when in the midst of the investigation or reporting process. These findings are extremely important in terms of how a whistleblower will be perceived socially (i.e. integrity or likeability) during the process of blowing the whistle. This issue is particularly pressing in the growing number of early-career scientists, who may lack institutional power or support to navigate the potential fallout of whistleblowing.
Much of the existing literature has focused on the act of whistleblowing itself, examining individuals’ intentions to report misconduct (Abraham et al., 2023), the frequency of disclosures (Artino et al., 2019), barriers to reporting and strategies for overcoming them (Devine & Reaves, 2016), and general attitudes that discourage or enable whistleblowers (Cheliatsidou et al., 2023). However, research specifically on scientific whistleblowing remains limited and does not adequately explore the factors that influence how whistleblowers and wrongdoers are perceived by laypeople. This study addresses that gap by directly comparing perceptions of investigators and inspectors and further examines the moderating role of severity of the QRPs.
Although trustworthiness and likeability are often positively related, prior work suggests they can diverge in contexts involving norm enforcement. For example, Monin, et al., (2008) found that “moral rebels (i.e. those who refuse to go along with moral questionable tasks) are judged as principled and trustworthy but also tend to be disliked. Similarly, Cheliatsidou et al. (2023) report that whistleblowers are recognized as upholding ethical standards yet are often seen as disloyal or disruptive. These findings suggest that inspectors may be evaluated as more trustworthy because of their integrity, but less likeable because their actions generate social tension. Therefore, this study predicts that inspectors will be rated as higher on integrity than inspectors, but that this advantage will be reduced (or even reversed) for ratings of likeability. These trends will shift in favor of the inspectors as QRP severity increases. In other words, as QRPs become more severe, ratings of the integrity and likeability of inspectors relative to investigators should increase. In this way, integrity reflects judgments about a researcher’s reliability and integrity, whereas likeability reflects social ease and loyalty.
By investigating how integrity and likeability vary based on role and QRP severity, this study expands existing knowledge in two key ways. First, it introduces a novel framework for examining scientific whistleblowing outside of industrial contexts. Second, it explores how the nature of the offense interacts with perceived roles to shape social judgments in science. In doing so, this work contributes to a more nuanced understanding of how ethical accountability is recognized and rewarded, or not, by the broader public. Lastly, this project was modeled from the Ebersole et al, 2016 paper and therefore their scale of attributes is included to conduct an exploratory analysis. As the scientific community continues to emphasize transparency and responsible research practices, understanding the perceptions of scientific investigators and inspectors becomes crucial to fostering an environment where ethical accountability is supported rather than punished.
Method
The current study
The present study employs a 2 (role: investigator vs. inspector) × 3 (QRP severity: minor vs. moderate vs. major) fully within-subjects design. All participants will read three vignettes, one for each level of severity, that describe an investigator committing a QRP, and an inspector reporting it. For each vignette, perceptions of the integrity, likeability, and positive attributes of both researchers will be assessed as well as the perceived trueness of the original finding.
Research Question 1: How is integrity influenced by role and QRP severity?
Research Question 2: How is likeability influenced by role and QRP severity?
-
Hypothesis 2a: Investigators will be rated more likable than inspectors.
-
Hypothesis 2b: At low QRP severity, investigators will be rated as more likeable than inspectors, but at high QRP severity, inspectors will be rated as more likeable than investigators.
Research Question 3: Is the perceived trueness of the finding influenced by QRP severity?
Participants
Participants will be recruited through the University of Alabama participant pool and compensated with course credit. Participants will be excluded if they: 1) do not consent or 2) complete the study in less than seven minutes or 3) exhibit no variability in their responses (e.g., selecting all 5s or all 3s) across all trust and likeability items, and 4) fail to complete the trust and likeability ratings for at least one vignettes.
A priori power analyses were conducted using a multilevel modeling framework that accounted for clustering of repeated vignette ratings within participants. The design effect was calculated as a function of cluster size (6 ratings per participant) and varying intraclass correlation coefficients (ICCs). The ICC reflects the variances in responses within each individual. When a person is answering similarly across vignettes the ICC will be higher, meaning their responses are more alike and providing less information across vignettes. Whereas if their responses differ more across vignettes the ICC is lower which leads to a lower participant counts as they are providing unique information across vignettes. Required sample sizes ranged from 324 participants at ICC =.20 to 566 participants at ICC =.50 (see Table 1). Assuming a conservative ICC of.50, the required sample size was estimated at 566 participants to achieve 80% power for detecting a small effect size (d =.24, f² ≈.014) with Bonferroni-corrected α = 0.001 for five planned comparisons. Accordingly, the recruitment target is set at 566 usable participants. Data collection will continue until this threshold is reached, or until March 2026, whichever occurs first.
Table 1. A priori power analysis
|
Effect sizes
|
ICC
|
Total Long Rows
|
Participants Needed
|
|
d =.24, f² ≈.014)
|
0.20
|
1938
|
324
|
|
d =.24, f² ≈.014)
|
0.35
|
2665
|
445
|
|
d =.24, f² ≈.014)
|
0.40
|
2907
|
485
|
|
d =.24, f² ≈.014)
|
0.45
|
3150
|
525
|
|
d =.24, f² ≈.014)
|
0.50
|
3392
|
566
|
Note. Required sample size estimates were calculated using a multilevel design-effect approach with cluster size = 6 ratings per participant, α =.001 (Bonferroni-corrected), and 80% power to detect a small effect size (d =.24, f² ≈.014).
Materials
All materials, our initial proposal, and our pre-registration can be found on the project page at (https://osf.io/dvpt8/). Ethics approval, data, analysis code, and codebook will be added once available. Participants will be guided to a survey link created on the open-source software Formr (Arslan et al., 2020: https://diss-ss-ss.formr.org).
Vignettes
Each participant will read three vignettes. The vignettes were created using scenarios from Ebersole and colleagues (2016) as a guide. Ebersole and colleagues used a similar study design to compare replicators and original researchers. The current project’s vignettes adopt this same structure, but instead of a replication, the second researcher conducts a robust check or review then reports their concerns to the original researcher’s institute. All vignettes begin with a sentence about Researcher X (the scientific investigator) and continue with a sentence about the QRP they engaged in. The third and final sentence describes Researcher Y (the scientific inspector) writing a blog post providing evidence of their suspicions of Researcher X with a suggestion on the implications of the QRP. For example, you can see the structure of the vignettes in our selective reporting example;
Researcher X conducted a study and found some interesting results. Researcher Y reviewed the data and realized that Researcher X only reported the findings that confirmed their hypotheses, and failed to mention other findings that challenged their hypotheses. Researcher Y writes blog post presenting evidence on how Researcher X cherry-picked the findings they reported to make their results seem stronger. Researcher Y emphasizes that this lack of transparency misleads the scientific community and undermines the integrity of science.
Vignettes vary by severity of QRP. There are one scenario at each level: overreaching abstract (minor), selective reporting (moderate), and data fabrication (major). Severity was informed by a combination of theoretical and empirical sources. Kolstoe’s (2023) spectrum of QRPs conceptualizes QRPs as a continuum, from minor error to misunderstanding, sloppiness, incompetence, falsification, fabrication, and ends with criminal misconduct. Accordingly, overreaching abstract were classified as minor, as they represent misunderstanding or sloppiness.
To determine moderate QRP, evidence from Larsson et al., (2023) was utilized. They found that reported severity mean of 4.3 (on a 5-point scale) for selective reporting, cherry picking and the selective analytical choices which suggested these QRPs were moderately severe. The creation of the last level of QRP drew from both Bottesini et al., (2022) and Kolstoe. Bottesini et al., (2022) found that data fraud was deemed unacceptable by 81.3% of participants. These results lead to placing selective reporting in moderate and data fraud in the major category. Taking into consideration Kolstoe’s spectrum of questionable research practices, the present vignette also places data fraud on the more serious, falsified, or fraudulent end of the continuum.
Integrity.
To assess the perceived integrity of each researcher (investigator and inspector), the integrity section from the trust in scientists scale developed by Cologna and colleagues (2025) will be utilized. This scale is grounded in theoretical and empirical work conceptualizing trust as a multidimensional construct composed of four key dimensions: competence, benevolence, integrity and openness. The original full-length scale demonstrated high internal consistency across diverse cultural contexts, with Cronbach’s alpha =.93 and McDonald’s omega =.95, and the 12 items were found to load reliably onto four stable factors reflecting the theoretical dimensions.
While the original authors noted some limitations regarding cross-country measurement invariance, the scale showed strong psychometric properties overall. Our shortened version wil only focus on the integrity aspect of trust to reduce participant burden. In the current study, participants will rate both Researcher X and Researcher Y on the following items using a 5-point Likert scale (e.g. 1 = very unethical, 5 = very ethical)..
-
Integrity: “How ethical or unethical is Researcher X/Y?”
-
Integrity: “How sincere or insincere is Researcher X/Y?”
-
Integrity: “How honest or dishonest is Researcher X/Y to be transparent?”
These three items will be averaged into a composite integrity score.
Likeability
To assess likeability, the Likeability Scale (Reysen, 2005) was adapted and reduced to two items. Responses will be given using a 5-point Likert scale (1 = very unlikeable/unapproachable, 5 = very likeable/approachable) to the items “How approachable is Researcher X/Y” and “How likeable is Researcher X/Y?” These will be averaged into one likeability composite.
Positive Attributes
To assess the perceived positive of researchers, a shortened version of the nine comparisons questions from Ebersole et al. (2016) will be assessed. These comparisons aim to capture general perceptions of researchers from a layperson perspective. This scale will ask participants to choose between the inspector or investigator on favorable professional and social attributes for these nine items from Ebersole et al. (2016):
-
Which researcher is smarter: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
-
Which researcher is a better researcher: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
-
Which researcher is more creative: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
-
Which researcher would you rather be: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
-
Which researcher should you be: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
-
Which researcher is more like the most celebrated researcher: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
-
Which researcher is more like the typical researcher: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
-
Which researcher is more likely to keep a job: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
-
Which researcher is more likely to get a job: Reported Researcher, Neither Reporter Researcher nor Reporting Researcher, or Reporting Researcher?
Trueness of the Finding.
Perceived trueness of the original finding was measured with a single item adapted from Ebersole et al. (2016). Participants will use a 5-point Likert scale (1 = very confident the findings are false, 3- could go either way, 5 = very confident the findings are true) when responding to this item: “Based on what you know about the actions of Researcher X, how confident are you that their reported findings are true?”
Attention.
An attention check was used to assess whether participants paid attention during the experiment by requiring participants to select a specific response to the question (i.e., “Select the option “somewhat disagree” if you are paying attention.”). An exploratory analysis to determine test if excluding those who fail the attention impacts the data.
Procedure
To begin, participants will complete the informed consent form. Next, they will be shown the three vignettes in randomized order. After each vignette, participants will respond to items assessing integrity and likeability for researcher X (investigator) followed by researcher Y (inspector). Once they have completed those ratings for both the investigator and inspector, they will be prompted to select a researcher on a list of positive attributes and rate the perceived trueness of the investigator’s findings. It is important to note that participants never see the term “whistleblower,” “investigator”, or “inspector” scenarios only use the labels “Reported Researcher (X)” and “Reporting Researcher (Y).” At a random point between vignettes participants will answer the attention check question.
Finally, participants will respond to basic demographic items such as age, gender, sexual orientation, religious affiliation, race, and political orientation. The whole procedure will be conducted online and will take about 20 minutes.
Analysis Plan
All research questions will be tested using three-level multilevel models to account for repeated observations where individual ratings (Level 1) are nested with vignettes (Level 2) which are nested within participants (Level 3). In this context, a cluster refers to a grouping of observations that are more similar to each other than to observations in other groups due to shared sources of variance. Specifically, participant clusters consist of all ratings provided by a single participant, capturing individual tendencies such as general positivity or negativity biases, whereas vignette clusters consist of all ratings for a given vignette, capturing variance due to vignette-specific effects.
To control for participants who tend to rate everything similar across questions, participant-level random intercepts will account for a person’s general positivity/negativity biases, reducing the chance that the results are driven by a halo bias. Random intercepts were specified for both participants and vignettes, and random slopes were added for within-participant predictors where possible. A null model will be run to assess the variation accounted by each cluster to allow the intercept to vary.
Model 0: Yijk = γ000 + ν0k + μ0jk + εijk.
Here, Yijk represents the DV rating for vignette i, participant j, vignette cluster k; γ000 is the grand mean rating; ν0k is the random intercept for vignette (Level 2); μ0jk is the random intercept for participant (Level 3); and εijk is the trial-level residual error (Level 1). Intraclass correlations (ICCs) will be computed to partition variance across participants and vignettes.
Fixed effects, included Role (inspector = 1, investigator = –1; Level 2), QRP Severity (continuous; Level 1), and their interaction. If the full model fails to converge, or is singular, alternative models with reduced random slopes will be compared via AIC/BIC and likelihood-ratio tests. The simplest model that converges and retains substantively meaningful variance components will be reported. All models will be estimated with restricted maximum likelihood (REML) for variance components and maximum likelihood (ML) for fixed-effect comparisons. Fixed effects will be interpreted using adjusted degrees of freedom (via lmerTest). Planned contrasts will be conducted with emmeans, with Bonferroni-adjusted alpha (α = 0.001) for the five primary tests of researcher differences within severity levels.
Intraclass correlations (ICCs). For planning purposes, this project assumes an intraclass correlation (ICC) of 0.50 (i.e., moderate between-person variance). The design effect is calculated as 1+(m−1) ICC with m = 6 observations per participant per DV; this design effect inflates naive sample-size requirements and is used in our analytic power calculations. But for each DV, ICCs will be reported based on final model variance components, calculated as the proportion of variance attributable to participant clustering and vignette clustering where applicable (participant ICC = σ2μ / (σ2μ + σ2ν + σ2ε); vignette ICC = σ2ν / (σ2μ + σ2ν + σ2ε). These values will be compared to the assumed ICC = 0.50 used in a priori power analysis.
Validation and robustness checks. To ensure constructs are empirically distinct, the factor structure of the integrity and likeability will be examined first for measures using exploratory and/or confirmatory factor analysis, reporting factor loadings and interfactor correlations. This will help assess the potential influence of halo effects (i.e., global positive/negative evaluations). Correlations among composites and internal consistency will also be inspected, setting our minimum threshold for Cronbach’s alpha at α =.70. If any construct falls under our threshold, models will be tested at the item level (e.g., competence, integrity, likeability) to evaluate whether effects are robust across items rather than driven by a single measure.
As an additional robustness check, models will be re-run with available demographic covariates (e.g., gender, political orientation, religiosity) to test whether results hold across subgroups. While no formal hypothesize specifies a moderator effects of demographics, reporting these analyses will clarify whether findings are consistent across participant characteristics.
Because the project design allows for the possibility of a crossover interaction at low severity levels, an examination and plot of the Role × Severity interactions will be conducted to test whether, at the lowest QRP severity level, inspectors are rated less favorably than investigators.
RQ1 Integrity and RQ2 Likeability
To examine research questions 1 (integrity) and 2 (likeability), a step-up modeling approach will be used to systematically test how each variable contributes to the model. Planned contrasts will compare inspectors and investigators at each severity level. If the integrity composite fails factor or reliability checks, items will be modeled separately with the same MLM structure. Given the possibility that integrity and likeability may need to be split, hypotheses will be considered fully supported if effects generalize across items and partially supported if effects hold only for some items.
Model 1: Integrity/Likeabilityijk = γ000 + γ100 (Researcherj) + ν0k + μ0jk + εijk
Model 2: Integrity/Likeabilityijk = γ000 + γ010 (Severityk) + ν0k + μ0k + εijk
Model 3: Integrity/Likeabilityijk = γ000 + γ100 (Researcherj) + γ010 (Severityk) + ν0k + μ0jk + εijk
Model 4: Integrity/Likeabilityijk = γ000 + γ100 (Researcherj) + γ010 (Severityk) + γ110(Researcherj × Severityk) + ν0k + μ0jk + εijk
In this model, γ100 represents the between-vignette effect of researcher type (inspector vs. investigator), γ010 represents the within-participant effect of severity, and γ110 represents the cross-level interaction of researcher × severity. Random intercepts ν0k capture variance attributable to vignettes (Level 2) and μ0jk capture variance attributable to participants (Level 3).
RQ3: Trueness of the Finding
To examine research question four, the main effect will be assessed by planned contrasts comparing inspectors and investigators at each severity level.
Model FT1: Truenessijk = γ000 + γ010 (Severityk) + ν0k + μ0j + εijk
Table 2. Multilevel Model Analysis Plan
|
Research Question
|
Hypothesis
|
Sampling plan
|
Analysis Plan
|
Rationale for sensitivity decisions
|
Interpretation given different outcomes
|
Theory that could be shown wrong by the outcomes
|
|
R1: How is integrity influenced by role and QRP severity?
|
H1a: Scientific inspectors will be rated higher in Integrity than scientific investigators
|
Participants will be recruited through the University of Alabama participant pool and compensated with course credit. Participants will be excluded if they: 1) do not consent or 2) complete the study in less than seven minutes or 3) exhibit no variability in their responses (e.g., selecting all 5s or all 3s) across all trust and likeability items, and 4) fail to complete the trust and likeability ratings for at least one vignettes. Participants will also be recruited through prolific and BeSample if funding opportunities arise.
|
Model 1: Integrityijk = γ000 + γ100 (Researcherj) + ν0k + μ0jk + εijk
Random intercepts for participants (Level 3) and vignettes (Level 2); random slopes for within-participant predictors where possible.
|
The study was powered based on the smallest effect size of interest, defined as differences in ethical perceptions of researcher roles, drawn from prior literature. Assuming a conservative ICC of.50, the required sample size was estimated at 566 participants to achieve 80% power for detecting a small effect size (d =.24, f² ≈.014) with Bonferroni-corrected α = 0.001 for five planned comparisons.
|
Main effect of role: higher integrity ratings for inspector vs. investigator
|
If there is no effect, then it stands to question theories pertaining to the role of QRPs on laypersons perceptions.
|
|
H1b: Integrity gap widens with QRP severity
|
Model 4: Integrityijk = γ000 + γ100 (Researcherj) + γ010 (Severityk) + γ110(Researcherj × Severityk) + ν0k + μ0jk + εijk
|
Significant interaction: integrity difference between roles (inspector—investigator) increases with severity
|
|
R2: How is likeability influenced by role and QRP severity?
|
H2a: Scientific investigators will be rated as more likeable than scientific inspectors
|
Model 1: Likeabilityijk = γ000 + γ100 (Researcherj) + ν0k + μ0jk + εijk.; Random intercepts for participants (Level 3) and vignettes (Level 2); random slopes for within-participant predictors where possible.
|
Main effect of role: higher likeability for investigator vs. inspector
|
If there is no effect on likeability it brings questions to theories on social perceptions of whistleblowers as well as theories pertaining to perceptions of “tattletalers.”
|
|
H2b: Likeability gap narrows (or even reverses) with the increase of QRP severity
|
Model 4: Likeabilityijk = γ000 + γ100 (Researcherj) + γ010 (Severityk) + γ110(Researcherj × Severityk) + ν0k + μ0jk + εijk
|
Significant interaction: Likeability difference between roles (inspector—investigator) decreases with severity.
|
|
R3: Is the perceived trueness of the finding influenced by QRP severity?
|
H3: Perceived trueness will decrease as QRP severity increases.
|
Model FT1: Truenessijk = γ000 + γ010 (Severityk) + ν0k + μ0j + εijk random intercepts for participants
|
Main effect of severity: trueness decreases as severity increases
|
Should we find no effect on perceived trueness then it stands to question theories on the ability to update scientific beliefs when faced with misinformation.
|
Exploratory Analysis
Positive Attributes. To examine the positive attributes scores, an exploratory analysis will be conducted to view overall perceptions of researcher role. Participant descriptives will be calculated for each vignette and aggregated across vignettes. When examining researchers’ attributes for each individual vignette, frequencies will be report of the chosen researcher role for each of the nine attribute questions. This will be reported for each of the three vignettes. When aggregating across vignettes, effects are systematically assessed by utilizing a step-up modeling approach for perceived researcher attributes. Planned contrasts will compare inspectors and investigators at each severity level.
Model A1: Attributesijk = γ000 + γ100 (Researcherj) + γ101 (Smarter) + γ102 (Ethical) + γ103 (Better Researcher) + γ104 (Creative) + γ105 (Rather Be) + γ106 (Should Be) + γ107 (Celebrated) + γ108 (Typical) + γ109 (Job Security) + γ110 (Job Outcome) + ν0k + μ0jk + εijk
Model A2: Attributesijk = γ000 + γ101 (Smarter) + γ102 (Ethical) + γ103 (Better Researcher) + γ104 (Creative) + γ105 (Rather Be) + γ106 (Should Be) + γ107 (Celebrated) + γ108 (Typical) + γ109 (Job Security) + γ110 (Job Outcome) + γ010 (Severityk) + ν0k + μ0jk + εijk
Model A3: Attributesijk = γ000 + γ100 (Researcherj) + γ101 (Smarter) + γ102 (Ethical) + γ103 (Better Researcher) + γ104 (Creative) + γ105 (Rather Be) + γ106 (Should Be) + γ107 (Celebrated) + γ108 (Typical) + γ109 (Job Security) + γ110 (Job Outcome) + γ010 (Severityk) + ν0k + μ0jk + εijk
Model A4: Attributesijk = γ000 + γ100 (Researcherj) + γ101 (Smarter) + γ102 (Ethical) + γ103 (Better Researcher) + γ104 (Creative) + γ105 (Rather Be) + γ106 (Should Be) + γ107 (Celebrated) + γ108 (Typical) + γ109 (Job Security) + γ110 (Job Outcome) + γ010 (Severityk) + γ111(Researcherj × Severityk) + ν0k + μ0jk + εijk
In this model, γ100 represents the within-participant effect of researcher type (inspector vs. investigator), γ010 represents the vignette-level effect of severity, and γ111 represents the cross-level interaction of researcher × severity. Random intercepts ν0k capture variance attributable to vignettes (Level 2) and μ0jk capture variance attributable to participants (Level 3).
Supporting Information:
Supporting information is available here: https://osf.io/vhmzg
References
Abraham, J., Mangapul, C. J., Amaniputri, D. N., Manurung, R. H., & Ispurwanto, W. (2023). Intention to whistleblow: Perception of reporting skill mediates the predicting role of class consciousness and perceived probability of revenge (No. 12:1566). F1000Research. https://doi.org/10.12688/f1000research.142265.1
Anderson, S. F., & Maxwell, S. E. (2017). Addressing the “Replication Crisis”: Using Original Studies to Design Replication Studies with Appropriate Statistical Power. Multivariate Behavioral Research, 52(3), 305–324. https://doi.org/10.1080/00273171.2017.1289361
Anvari, F., & Lakens, D. (2018). The replicability crisis and public trust in psychological science. Comprehensive Results in Social Psychology, 3(3), 266–286. https://doi.org/10.1080/23743603.2019.1684822
Arslan, R. C., Walther, M. P., & Tata, C. S. (2020). formr: A study framework allowing for automated feedback generation and complex longitudinal experience-sampling studies using R. Behavior Research Methods, 52(1), 376–387. https://doi.org/10.3758/s13428–019–01236-y
Artino, A. R., Driessen, E. W., & Maggio, L. A. (2019). Ethical Shades of Gray: International Frequency of Scientific Misconduct and Questionable Research Practices in Health Professions Education. Academic Medicine, 94(1), 76–84. https://doi.org/10.1097/ACM.0000000000002412
Bottesini, J. G., Rhemtulla, M., & Vazire, S. (2022). What do participants think of our research practices? An examination of behavioural psychology participants’ preferences. Royal Society Open Science, 9(4), 200048. https://doi.org/10.1098/rsos.200048
Chambers, C. (2017). The seven deadly sins of psychology: a manifesto for individualing the culture of scientific practice / Chris Chambers. Princeton University Press.
Cheliatsidou, A., Sariannidis, N., Garefalakis, A., Passas, I., & Spinthiropoulos, K. (2023). Exploring Attitudes towards Whistleblowing in Relation to Sustainable Municipalities. Administrative Sciences, 13(9), 199. https://doi.org/10.3390/admsci13090199
Cologna, V., Mede, N. G., Berger, S., Besley, J., Brick, C., Joubert, M., Maibach, E. W., Mihelj, S., Oreskes, N., Schäfer, M. S., van der Linden, S., Abdul Aziz, N. I., Abdulsalam, S., Shamsi, N. A., Aczel, B., Adinugroho, I., Alabrese, E., Aldoh, A., Alfano, M., … Zwaan, R. A. (2025). Trust in scientists and their role in society across 68 countries. Nature Human Behaviour, 1–18. https://doi.org/10.1038/s41562–024–02090–5
Ebersole, C. R., Axt, J. R., & Nosek, B. A. (2016). Scientists’ Reputations Are Based on Getting It Right, Not Being Right. PLOS Biology, 14(5), e1002460. https://doi.org/10.1371/journal.pbio.1002460
Fanelli, D. (2018). Is science really facing a reproducibility crisis, and do we need it to? Proceedings of the National Academy of Sciences, 115(11), 2628–2631. https://doi.org/10.1073/pnas.1708272114
Frankenhuis, W. E., & Nettle, D. (2018). Open Science Is Liberating and Can Foster Creativity. Perspectives on Psychological Science, 13(4), 439–447. https://doi.org/10.1177/1745691618767878
Ioannidis, J. P. A. (2022). Correction: Why Most Published Research Findings Are False. PLOS Medicine, 19(8), e1004085. https://doi.org/10.1371/journal.pmed.1004085
Kolstoe, S. (2023). Defining the Spectrum of Questionable Research Practices (QRPs). UK Research Integrity Office. https://doi.org/10.37672/UKRIO.2023.02.QRPs
Isbell, D. R., Brown, D., Chen, M., Derrick, D. J., Ghanem, R., Arvizu, M. N. G., Schnur, E., Zhang, M., & Plonsky, L. (2022). Misconduct and Questionable Research Practices: The Ethics of Quantitative Data Handling and Reporting in Applied Linguistics. The Modern Language Journal, 106(1), 172–195. https://doi.org/10.1111/modl.12760
Larsson, T., Plonsky, L., Sterling, S., Kytö, M., Yaw, K., & Wood, M. (2023). On the frequency, prevalence, and perceived severity of questionable research practices. Research Methods in Applied Linguistics, 2(3), 100064. https://doi.org/10.1016/j.rmal.2023.100064
LeBel, E. P., Campbell, L., & Loving, T. J. (2017). Benefits of open and high-powered research outweigh costs. Journal of Personality and Social Psychology, 113(2), 230–243. https://doi.org/10.1037/pspi0000049
Lubalin, J., Ardini, M.-A. E., Matheson, J., & Research Triangle Institute, issuing body. (1995). Consequences of whistleblowing for the whistleblower in misconduct in science cases : final report. Research Triangle Institute.
Lubalin, J. S., & Matheson, J. L. (1999). The fallout: What happens to whistleblowers and those accused but exonerated of scientific misconduct? Science and Engineering Ethics, 5(2), 229–250. https://doi.org/10.1007/s11948–999–0014–9
Miguel, E., Camerer, C., Casey, K., Cohen, J., Esterling, K. M., Gerber, A., Glennerster, R., Green, D. P., Humphreys, M., Imbens, G., Laitin, D., Madon, T., Nelson, L., Nosek, B. A., Petersen, M., Sedlmayr, R., Simmons, J. P., Simonsohn, U., & Van Der Laan, M. (2014). Promoting Transparency in Social Science Research. Science, 343(6166), 30–31. https://doi.org/10.1126/science.1245317
Monin, B., Sawyer, P. J., & Marquez, M. J. (2008). The rejection of moral rebels: Resenting those who do the right thing. Journal of Personality and Social Psychology, 95(1), 76–93. https://doi.org/10.1037/0022–3514.95.1.76
Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie Du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1(1), 0021. https://doi.org/10.1038/s41562–016–0021
National Academy of Sciences, N. A. of E., and Institute of Medicine. (2009). On Being a Scientist: A Guide to Responsible Conduct in Research: Third Edition. National Academies Press. https://doi.org/10.17226/12192
Nicholls, A. R., Fairs, L. R. W., Toner, J., Jones, L., Mantis, C., Barkoukis, V., Perry, J. L., Micle, A. V., Theodorou, N. C., Shakhverdieva, S., Stoicescu, M., Vesic, M. V., Dikic, N., Andjelkovic, M., Grimau, E. G., Amigo, J. A., & Schomöller, A. (2021). Snitches Get Stitches and End Up in Ditches: A Systematic Review of the Factors Associated with Whistleblowing Intentions. Frontiers in Psychology, 12. https://doi.org/10.3389/fpsyg.2021.631538
Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., Buck, S., Chambers, C. D., Chin, G., Christensen, G., Contestabile, M., Dafoe, A., Eich, E., Freese, J., Glennerster, R., Goroff, D., Green, D. P., Hesse, B., Humphreys, M., … Yarkoni, T. (2015). Promoting an open research culture. Science, 348(6242), 1422–1425. https://doi.org/10.1126/science.aab2374
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration systemic. Proceedings of the National Academy of Sciences, 115(11), 2600–2606. https://doi.org/10.1073/pnas.1708274114
Pennycook, G., & Rand, D. G. (2021). The Psychology of Fake News. Trends in Cognitive Sciences, 25(5), 388–402. https://doi.org/10.1016/j.tics.2021.02.007
Reysen, S. (2005). Construction of a new scale: The Reysen Likability Scale. Social Behavior and Personality, 33(2), 201–208.
Schooler, J. W. (2014). Metascience could rescue the ‘replication crisis.’ Nature, 515(7525), 9–9. https://doi.org/10.1038/515009a
Sijtsma, K. (2016). Playing with Data—Or How to Discourage Questionable Research Practices and Stimulate Researchers to Do Things Right. Psychometrika, 81(1), 1–15. https://doi.org/10.1007/s11336–015–9446–0
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
Sterling, S., Yaw, K., Plonsky, L., Larsson, T., & Kytö, M. (2025). Investigating researcher perceptions of Questionable Research Practices. Journal of Second Language Studies. https://doi.org/10.1075/jsls.00048.ste
Swift, J. K., Christopherson, C. D., Bird, M. O., Zöld, A., & Goode, J. (2022). Questionable research practices among faculty and students in APA-accredited clinical and counseling psychology doctoral programs. Training and Education in Professional Psychology, 16(3), 299–305. https://doi.org/10.1037/tep0000322
Uhlmann, E. L., Ebersole, C. R., Chartier, C. R., Errington, T. M., Kidwell, M. C., Lai, C. K., McCarthy, R. J., Riegelman, A., Silberzahn, R., & Nosek, B. A. (2019). Scientific Utopia III: Crowdsourcing Science. Perspectives on Psychological Science, 14(5), 711–733. https://doi.org/10.1177/1745691619850561
Wingen, T., Berkessel, J. B., & Englich, B. (2020). No Replication, No Trust? How Low Replicability Influences Trust in Psychology. Social Psychological and Personality Science, 11(4), 454–463. https://doi.org/10.1177/1948550619877412
Yong, E. (2017, April 5). How the GOP Could Use Science’s Individual Movement Against It. The Atlantic. https://www.theatlantic.com/science/archive/2017/04/reproducibility-science-open-judoflip/521952/
Appendix A Questionnaires
Response options varied for each item but are on a 5-point Likert Scale (e.g., 1 = very (unethical), 2 = somewhat (unethical), 3 = neither (ethical) nor (unethical), 4 = somewhat (ethical), 5 = very (ethical).
-
Integrity (T)- Cologna et al, 2025
-
How ethical or unethical is researcher X?
-
How ethical or unethical is researcher Y?
-
How honest or dishonest is Researcher X?
-
How honest or dishonest is Researcher Y?
-
How sincere or insincere is Researcher X?
-
How sincere or insincere is Researcher Y?
-
Likeable (L)- Reysen, 2005
-
How likeable or unlikeable is researcher X?
-
How likeable or unlikeable is researcher Y?
-
Approachable (L)- Reysen, 2005
-
How approachable or unapproachable is researcher X?
-
How approachable or unapproachable is researcher Y?
-
The original effects (trueness)- Ebersole et al, 2016
-
Based on what you know about the actions of Researcher X, how confident are you that their reported findings are true?
-
Positive Attributes- Ebersole et al., 2016
-
Which researcher is smarter: Reported Researcher or Reporting Researcher?
-
Which researcher is a better researcher: Reported Researcher or Reporting Researcher?
-
Which researcher is more creative: Reported Researcher or Reporting Researcher?
-
Which researcher would you rather be: Reported Researcher or Reporting Researcher?
-
Which researcher should you be: Reported Researcher or Reporting Researcher?
-
Which researcher is more like the most celebrated researcher: Reported Researcher or Reporting Researcher?
-
Which researcher is more like the typical researcher: Reported Researcher or Reporting Researcher?
-
Which researcher is more likely to keep a job: Reported Researcher or Reporting Researcher?
-
Which researcher is more likely to get a job: Reported Researcher or Reporting Researcher?
Appendix B Vignettes
-
Minor Severity QRP (Overreaching Abstracts)
Researcher X conducted a study and found some interesting results. They submitted the findings to a journal for publication, but the title and abstract implied the study had much broader implications than the study truly did.
Researcher Y after reading the abstract, writes a blog post highlighting how Researcher X’s misrepresentations can mislead readers about the scope and findings of the research. Researcher Y emphasizes that having imprecise titles and abstracts leads to research being inaccurately represented.
-
Moderate Severity QRP (Selective reporting)
Researcher X conducted a study and found some interesting results. Researcher Y reviewed the data and realized that Researcher X only reported the findings that confirmed their hypotheses, and failed to mention other findings that challenged their hypotheses.
Researcher Y writes blog post presenting evidence on how Researcher X cherry-picked the findings they reported to make their results seem stronger. Researcher Y emphasizes that this lack of transparency misleads the scientific community and undermines the integrity of science.
-
Major Severity QRP (Data Fabrication)
Researcher X conducted a study and found some interesting results. Upon reviewing the data, Researcher Y noticed that Researcher X made-up large sections of the dataset.
Researcher Y writes a blog post highlighting evidence of Researcher X’s fabricated data as a prime example of how this kind of fraud not only tarnishes the credibility of the research but can also lead to severe career repercussions for collaborators. Researcher Y emphasizes that data fabrication has devastating consequences for scientific credibility public trust in science.