Published at MetaROR
November 13, 2025
Table of contents
Curated
Article
Reliability, bias and randomisation in peer review: a simulation
Alex Hulkes1, Cillian Brophy2, Ben Steyn3
Originally published on August 28, 2025 at:
Editors
Ludo Waltman
Ludo Waltman
Editorial Assessment
by Ludo Waltman
This article presents a simulation study of the use of partial randomization in research funding allocation. The article has been reviewed by two reviewers. Reviewer 1 considers the article to be timely and methodologically sound. The reviewer praises the clear presentation of the methods and results. Reviewer 2 stresses that understanding the trade-offs involved in partial randomization is becoming ever more important given the growing adoption of partial randomization by research funders. The reviewers have various recommendations for improving the article. Reviewer 1 suggests there may be an opportunity to obtain deeper insights from some of the unexpected simulation results, which could strengthen the relevance of the study for funders. Both reviewers ask for an explanation of the surprising findings presented in Figure 7. The reviewers also challenge some of the assumptions underlying the simulation model used in the study, and they offer suggestions for clarifying the interpretation of some of the results of the study.
Recommendations for enhanced transparency
- Include in the body of the article a comprehensive data availability statement. Make publicly available the UKRI data (used in Figure 2) or explain why the data cannot be shared.
- All source code is publicly available in GitHub. GitHub URLs are not permanent. Issue a DOI for a GitHub repository and include it in the article.
- Add a competing interest statement. Authors should report all competing interests, including not only financial interests, but any role, relationship, or commitment of an author that presents an actual or perceived threat to the integrity or independence of the research presented in the article. If no competing interests exist, authors should explicitly state this.
- Add a funding source statement. Authors should report all funding in support of the research presented in the article. Grant reference numbers should be included. If no funding sources exist, explicitly state this in the article.
Peer Review 1
This paper addresses a timely issue that has recently received substantial attention: the potential use of partial randomisation in grant allocation procedures. Through a simulation study, it aims to better understand the implications of introducing randomisation for the reliability of the allocation process and the extent that biases can effect funding decisions. In the simulated grant review process, they indeed find that increased randomisation has the potential to reduce bias, while at the same time leading to what the authors refer to as ‘less reliable’ decisions.
This paper has the potential to contribute to ongoing debates about innovations in grant allocation processes, particularly those related to efficiency and fairness of such processes. The manuscript provides a decent introduction to this topic, mentioning several aspects that critically inform the debate and then explaining why it will focus on some of them. The simulation method are described in sufficient detail and seem methodologically sound, even though it is clear that many alternative choices could have been made and the impact of making these choices is not always clear. The findings are also presented in a clear and structured matter.
My main concern in relation to the manuscript relates to the nature of the simulation approach and the interpretability of the findings. The authors themselves repeatedly point at “the imprecise nature of the simulated data” (e.g. p13) or the simplicity of their simulation model, advising readers to be cautious with using their findings in real-world contexts or outright advising not to read too much into certain observations. While the authors’ modesty in presenting their work is highly appreciated, this does raise the question to what extent the findings can inform ongoing debates. Obviously, in such simulation studies, many simplifications and assumptions have to be made, all of which directly affect the results. Moreover, it has repeatedly been shown how social dynamics among selection panels are crucial in determining funding decisions. The simulation models, currently do not attempt to capture these dynamics, potentially further reducing their applicability to real-world scenarios.
In the conclusion, the authors themselves raise the question of what practical advice funders might take away from this simulation, concluding that it depends on their risk appetite. To me, however, the most important take away of the study is its pointers to phenomena of interest that require further experimentation. The authors themselves point out that some findings are as expected (which to me is mostly an indication of the appropriateness of their simulation), while others seem more surprising or counter intuitive (such as the findings in Fig 7). Tracing back what causes these findings, which might either be artifacts of the simulation model or some inherent property of funding decision processes, seems to me the exciting part of the manuscript’s results. Perhaps the authors could incorporate some suggestions along these lines in their paper.
Below I present additional specific comments and suggestions that might help the authors to further develop their manuscript:
- The model used to simulate the decision process now assumes that bias will necessarily lead to a lower score for the proposal under review. I guess in a realistic scenario, bias could equally lead to higher scores. Maybe the authors could integrate this element of symmetry into their model.
- The authors choose to refer to the ‘reliability’ of a decision process, which is measured by the accuracy and F1 score. I wonder whether simply sticking to the term ‘accuracy’ instead of reliability would do more justice to the nature of the measure. What is being measured is the distance between the simulated decision and the simulated ground truth, for which accuracy seems to me like an appropriate label.
- I have some doubt about one of the core assumptions made by the authors: a loss of information leads to a decrease in bias. First of all, I would argue that in this context, the information itself is not biased, but only an interpretation of information could be. But even then, I would argue that more information does not necessarily lead to more bias. Rather, it is mostly in the context of a lack of sufficient information that people tend to fall back to stereotypes or other preconditioned frames that ultimately could lead to biased decisions (e.g. only knowing that a person belongs to group X, could trigger stereotype projections of X’s characteristics on the individual, whereas more information about this person could reduce the need to fall back to such stereotype projections and hence reduce bias). Hence, the authors might want to reconsider this assumption.
- The manuscript repeatedly refers to simulated scores ‘reasonably closely’ matching those of real UKRI proposals. This would benefit from a little more specification. Figure 2 gives some information on this, but without numeric values on the y-axis is somewhat difficult to interpret. Nevertheless, the figure indicates that simulated scores are particularly more likely to be in the 2-3 range than those of actual proposals. A brief discussion on the potential implications of this for the study’s findings would be helpful.
- In section 4.2 the authors mention that their findings suggest “that the bias factors may be somewhat too large.” This assumes that the authors had some expectation about what is a realistic divergence of the ‘ground truth’ due to bias. Can the authors elaborate on this expectation and what it is based on?
- The authors, in their results section, fairly descriptively present their findings, sometimes indicating whether some result should be considered surprising or not, but otherwise largely refraining from interpreting the findings. I would like to encourage the authors to go a step further in an attempt to interpret their results. In particular, a reader might be interested to learn about (a) the extent to which the findings are (only) an artifact of choices made in the simulations and (b) how we should interpret some of the more surprising findings and what process might be causing them (e.g. in relation to figure 7). The authors build on this latter statement in their conclusions (“The middle ground may not be a safe place to reside when it comes to implementing partial randomization”), making it all the more relevant to provide some interpretation of the results and a description of the mechanism that causes this effect.
- In section 4.5 the author mention that “higher meeting reliability (F1) lead to higher mean fundability of funded proposals”. I was somewhat surprised by this statement, because the causal link seems not to be warranted here. Both the F1 and mean fundability of proposals are derivatives of the input data, rather than the one being the consequence of the other. Hence, a statement about correlation rather than causation seems more appropriate here, unless I have misunderstood the authors’ simulations.
- The references (both in-text citations and the reference list) now only mention the first author of the cited works.
Peer Review 2
Hulkes and colleagues explore the important trade-off that underlies much of the academic debate on the desirability of allocating research funding by lot. On the one hand, lotteries may help curb the costs of peer review and undesirable biases in funding decisions. On the other hand, lotteries may also impair the ability of reviewers and review panels to identify and select the most promising project proposals. As more funders adopt some form of randomization in their processes, understanding this trade-off is becoming ever more important.
To date, only very few studies have tackled this issue. The paper re-examines some of the extant theoretical work with a computer simulation experiment. Crucially, the paper takes a different approach to previous work: grant peer review is modeled hinging on a different, reasonable set of assumptions, and bias and correctness of decisions are operationalized differently. This allows to compare results across the two implementations. To the extent that results from two alternative implementations agree with one another, the validity and generalizability of these results is strengthened; and when results come apart, we learn that some of the diverging assumptions can be consequential. This paper does both: it corroborates conclusions from previous research, but it also finds new intriguing effects that were never observed or reported before.
I list here my comments in a bulleted list, arranging them by theme.
Modeling assumptions
- To simulate a meeting, the authors draw a random set of proposals from a large pool of pre-generated proposals. It may happen, as is pointed out at the start of Section 2.3, that the same proposal is chosen for more than one meeting. This modeling approach is convincing, I think. What I find less convincing is the argument made in Section 2.3 that re-uses are “in some ways similar to resubmission of a rejected proposal […]”.I do not find this to be a strong motivation for allowing re-uses in the simulation because it implies that resubmissions would be identical to the original submission. However, real-world resubmitted proposals are not typically identical to the original submission: they may have been updated to incorporate feedback from the previous submission; or they may be changed in order to better fit the new funding call to which they are resubmitted. To iterate, the modeling choice is not inherently problematic per se – I agree that occasional ‘re-uses’ of the same proposal are probably inconsequential. However, I see re-uses as a mere artifact, rather than a realistic modeling choice. Therefore, if you agree with me, my advice is to remove this argument, without necessarily changing the underlying modeling assumption. The text at very top of page 7 hints at the possibility of preventing this artifact – I suppose by drawing proposals from the pool without replacement. This would be a valid solution, too.
- While my previous point was about a modeling assumption that was justified as realistic whereas I think it is not, the situation is reversed for another modeling assumption. In the third paragraph of 2.3, a very much realistic assumption is written off as a simplification. Here the text explains that the ranking of proposals and funding decisions are based on the average of the scores assigned to the proposals by reviewers. The “average” rule is introduced as a “necessary simplification”. I have two things to say in this regard. First is that this is strictly not a “necessary” simplification – it is not necessary because it could be as easily done in some other way.
Secondly, I do not think this is a “simplification” at all. Some funders mandate that panels consider score averages in some of the steps involved in the construction of a ranking of proposals. To cite an example I’m familiar with: in some calls by Research Ireland, the peer review protocol instructs the panel quite explicitly to average scores in order to rank proposals. Quoting from the 2026 call document for their postdoctoral fellowships: “If the total average score is the same between two or more applications, applications with the same average scores will be ranked according to the higher average score under the highest weighted category”.
Furthermore, even for funders that do not explicitly mandate averaging, research shows that average score of a proposal’s grades is a very good predictor of the final funding decision (see, e.g., Pina et al. 2021). This suggests that the modeling assumption is more than a necessary simplification: rather, it can be presented as a realistic assumption that reflects how rankings are constructed – following formal or informal protocols – by panels in different research funding organizations.
Idea attribution
- I have a comment on the attribution of the idea of partial randomization in science funding. Paragraph 2 cites Avin (2019) for having identified “the oldest reference to [partial randomization] in an academic journal”. Indirectly this is crediting Greenberg’s opinion piece in The Lancet (Greenberg, 1998) for originating the idea. For one, I think that Greenberg should be credited directly by citing their work (Greenberg 1998) rather than (or in addition to) Avin who reports on it.
Secondly, I wonder whether Greenberg really was the first to write about partial randomization in science funding – in the way we intend it today. Their 1998 proposal is indeed to distribute funding randomly among qualified researchers “whose qualifications and projects have been certified as respectable” – suggesting a partially randomized approach. However, in Greenberg’s view, this ‘respectability check’ is meant to replace peer review, and would only take “a fraction” of its cost. Arguably, many contemporary proponents of partial randomization – and, as far as I know, all funders running partial randomization – are not replacing traditional peer review with a much simpler ‘respectability check’. Partial lotteries today still involve a peer review panel that operates, for the most part, the way it always has.
Therefore, my opinion based on how I interpret Greenberg’s words is that the idea of partial randomization in science funding – the way is implemented today – is to be credited to Brezis (2007), whereas I would credit Greenberg (1998) for being the first to propose randomization ‘tout-court’.
In conclusion, I would recommend citing Brezis (2007) if you agree with me that Brezis’ proposal is a better fitting precursor to partial randomization as we know it today; and to explicitly cite Greenberg (1998) if you credit their opinion piece as the oldest published example. Citing both – Greenberg for proposing lotteries and Brezis for proposing modern partial lotteries – would also be a fair solution.
Results
- I have two comments about Figure 5. The first comment is about the level of bias in this figure. Is this figure showing all simulations from all levels of bias lumped together (vs showing simulations from a specific value of bias)? Either way, this should be made clear in the text or caption.
- The second comment is about the unit of analysis – i.e., the points on the scatterplot – which, here, are combinations of randomization and bias levels. As the caption points out, some combinations will appear very often in the data, whereas others will only have been simulated once or a few times. The caption also points out that edge cases – those with very high or low randomization or bias – are rare, and thus have little bearing on the regression line shown in the figure. I wonder whether the same plot would be more effective it was showing individual simulations rather than parameter combinations: it would then show that rare combinations are, well, rare. However, I can imagine that a scatterplot with all 100k simulations will be extremely busy: would then a heatmap (along the lines of Figure 4) be a solution?
- About Section 4.4 and Figure 7. This section concludes observing some non-linearities in the relationship between bias, randomization, and reliability – i.e., the diagonal ‘beam’. I am more than a bit intrigued by this effect. The paper does not attempt an explanation for it, yet this is a non-trivial and completely unexpected effect. I wonder whether this is a substantive result or an artifact. Can you offer a speculation, if not an explanation?
Smaller points
- A curiosity about Figure 2. If I understand correctly, this is showing how actual and simulated review scores are distributed, with vertical black lines showing how a more granular value space is discretized into the six available grades. If I understood correctly, what I’m curious about is why there is a more granular value space to begin with: why is there not a single frequency value for each of the intervals on the X-axis, but there seems to be a more continuous distribution? If I had to guess, I would say that the red line is actually showing reviewer scores averaged across criteria – and because there are multiple criteria, the average score of a reviewer for a proposal is on a scale that is more granular than the 1-to-6 scale used to score each criterion. Either way, I would recommend explaining this in the caption.
- Second half of page 5: “In (Feliciani 2024) the intrinsic fundability is called a reference evaluation”. The concepts of intrinsic fundability and reference evaluation are indeed similar, and in this context they refer to the same element of the simulation. However, since they are two distinct concepts, I would rephrase this sentence in a way that avoids implying they are one and the same. For example: “The ground truth for the evaluation of proposals is modeled in a similar way by Feliciani et al. (2024). They also draw from a beta distribution …”, or “Feliciani et al. (2024) do not explicitly model intrinsic fundability, relying instead on a different ground truth. Nonetheless, the ground truth in their simulations is modeled in a similar way, i.e., by drawing from a beta distribution…”, or similar.
- The second paragraph of section 2.3 describes the two types of simulated meetings. The description of the second type is unclear – I struggled to see how it was not a rephrasing of what was said for the first type. In case it can be of help to debug where I get lost here: I take the word “groups” to be the same as “sets of reviews”. Thus, I think that the whole paragraph can be summarized as follows – notice how I split the second type into two: Each simulated meeting can be of three kinds: (1) a meeting where no review is biased; (2) a meeting where all reviews are biased; (3) a meeting where each review has a 50% likelihood of being biased. For each meeting of the third kind, we also simulate, in parallel, a meeting of the first or second kind – determined at random, with equal chance – with the same set of proposals.
If that sounds about right, then the first two paragraphs of section 2.3 can be considerably streamlined; else, there is a more fundamental problem understanding how the experiment was set up.
Furthermore, there’s a typo at the beginning of the very next paragraph: “In both types of meeting […]”. - I have a few recommendations or comments about Figure 1 – the figure showing various parameterizations of the beta distribution. First, for consistency with the text, consider replacing “shape 1” and “shape 2” with α and β. This avoids the need to explain panel titles in the caption. Second, the beta distribution is used in other parts of the simulation as well – for example, to determine the funding rate of a given meeting (α = 6, β = 14). Furthermore, in case you haven’t considered it, I wonder whether it would help the reader to have all possible beta distribution in this figure, beyond just the distributions used for drawing review scores. I can think of arguments why not to include other uses of the beta distribution, so I suppose the figure could work either way. Third, Figure 1 is introduced in section 4.1, but mentions of the beta distribution appear much earlier in the paper. Perhaps Figure 1 would then be more useful if introduced earlier.
- The second-last paragraph of Section 5.2. (“Given the apparent ease […]”) is particularly hard to follow due to the way it is phrased. I would recommend using a simpler sentence structure.
- The style of references – both in line and in the reference list – does not seem to comply with any standard I know (e.g., APA 7
- For reproducibility, consider publicly sharing your code on an open repository such as CoMSES (comses.net) or Zenodo (zenodo.org).
References
Avin, S. (2019). Mavericks and lotteries. Studies in History and Philosophy of Science Part A, 76, 13-23.
Brezis E. S. (2007) ‘Focal Randomisation: An Optimal Mechanism for the Evaluation of R&D Projects’, Science and Public Policy, 34: 691–8.
Feliciani, T., Luo, J., & Shankar, K. (2024). Funding lotteries for research grant allocation: An extended taxonomy and evaluation of their fairness. Research Evaluation, 33, rvae025.
Greenberg, D. S. (1998). Chance and grants. The Lancet, 351(9103), 686.
Pina, D. G., Buljan, I., Hren, D., & Marušić, A. (2021). A retrospective analysis of the peer review of more than 75,000 Marie Curie proposals between 2007 and 2018. Elife, 10, e59338.
Author Response
We would like to thank the two reviewers for their comments. We have provided a new version of the original submission which has been revised in light of these comments. Where we have not made a revision, or where the comments are of a more general nature, we will indicate the fact in this author response.
Our response is structured around the two reviews in turn.
Reviewer 1
“[the authors’ modesty in presenting their work] does raise the question to what extent the findings can inform ongoing debates. Obviously, in such simulation studies, many simplifications and assumptions have to be made, all of which directly affect the results. Moreover, it has repeatedly been shown how social dynamics among selection panels are crucial in determining funding decisions. The simulation models, currently do not attempt to capture these dynamics, potentially further reducing their applicability to real-world scenarios.”
We agree with this observation. The description of the simulation is rather reticent, as a result of our deliberate intention to avoid over-claiming on the basis of a model which, as the reviewer suggests, does not even begin to capture the richness of even the simplest review process. At this stage we would prefer to leave it to readers of the work to make their own judgement about the validity of findings. We do however welcome the support of the reviewer in suggesting that interpretation could go a little further, if that is what readers wish to do.
“Tracing back what causes these findings, which might either be artifacts of the simulation model or some inherent property of funding decision processes, seems to me the exciting part of the manuscript’s results. Perhaps the authors could incorporate some suggestions along these lines in their paper.”
We have added an additional paragraph at the end of section 4.4 which indicates an extension of the work that was not included in the document for reasons of brevity, and because it was not thoroughly investigated. We do not intend to explore these results any further and would welcome it if others were able to do so, using this or another model.
“The model used to simulate the decision process now assumes that bias will necessarily lead to a lower score for the proposal under review. I guess in a realistic scenario, bias could equally lead to higher scores. Maybe the authors could integrate this element of symmetry into their model.”
This is correct. The overall assumption is that bias, as implemented, is net negative. We have taken this approach as much of the interest in application of partial randomisation in relation to bias reduction assumes this. We have added additional explanatory text in the relevant section and altered the abstract to make it clear that the model is based on net negative bias.
“The authors choose to refer to the ‘reliability’ of a decision process, which is measured by the accuracy and F1 score. I wonder whether simply sticking to the term ‘accuracy’ instead of reliability would do more justice to the nature of the measure. What is being measured is the distance between the simulated decision and the simulated ground truth, for which accuracy seems to me like an appropriate label.”
This is an unfortunate result of the technical use of the word ‘accuracy’ in a classification context, where ‘accuracy’ has the specific meaning set out in the text. Accuracy refers to the specific measure, so we had to find an alternate word with no technical meaning that described the same quality. Reliability seemed to us to be the best fit as it does not, as far as we are aware, have a particular meaning in classification measurement.
“I have some doubt about one of the core assumptions made by the authors: a loss of information leads to a decrease in bias. First of all, I would argue that in this context, the information itself is not biased, but only an interpretation of information could be. But even then, I would argue that more information does not necessarily lead to more bias. Rather, it is mostly in the context of a lack of sufficient information that people tend to fall back to stereotypes or other preconditioned frames that ultimately could lead to biased decisions (e.g. only knowing that a person belongs to group X, could trigger stereotype projections of X’s characteristics on the individual, whereas more information about this person could reduce the need to fall back to such stereotype projections and hence reduce bias). Hence, the authors might want to reconsider this assumption.”
This comment reflects a lack of precision in our writing, for which we apologise and in response to which we have made what we hope are appropriate modifications in the text. When we refer to ‘information’ in this context, it is strictly the information embodied in the reviewers’ scores for a simulated proposal. The reviewer is of course correct that if there is a lack of information that allows a reviewer to assess fundability, all they have to go on is their biases.
“The manuscript repeatedly refers to simulated scores ‘reasonably closely’ matching those of real UKRI proposals. This would benefit from a little more specification.
Figure 2 gives some information on this, but without numeric values on the y-axis is somewhat difficult to interpret. Nevertheless, the figure indicates that simulated scores are particularly more likely to be in the 2-3 range than those of actual proposals. A brief discussion on the potential implications of this for the study’s findings would be helpful.”
We have adjusted the text accordingly, in both the main body and the caption to Figure 2.
“In section 4.2 [sic, actually 4.3] the authors mention that their findings suggest “that the bias factors may be somewhat too large.” This assumes that the authors had some expectation about what is a realistic divergence of the ‘ground truth’ due to bias. Can the authors elaborate on this expectation and what it is based on?”
We have added explanation of why this is our belief to the text. It centres on the lack of any real-world biases that are associated with success rate differences of this size.
“The authors, in their results section, fairly descriptively present their findings, sometimes indicating whether some result should be considered surprising or not, but otherwise largely refraining from interpreting the findings. I would like to encourage the authors to go a step further in an attempt to interpret their results. In particular, a reader might be interested to learn about (a) the extent to which the findings are (only) an artifact of choices made in the simulations and (b) how we should interpret some of the more surprising findings and what process might be causing them (e.g. in relation to figure 7).
The authors build on this latter statement in their conclusions (“The middle ground may not be a safe place to reside when it comes to implementing partial randomization”), making it all the more relevant to provide some interpretation of the results and a description of the mechanism that causes this effect.”
Having considered this helpful encouragement, we are still reluctant to go further in providing explanation of observations that may be artefacts of (the simplicity of) the simulation. The extension to section 4.4 goes some way to addressing the reviewer’s suggestion and we hope that they and other readers find it useful.
“In section 4.5 the author mention that “higher meeting reliability (F1) lead to higher mean fundability of funded proposals”. I was somewhat surprised by this statement, because the causal link seems not to be warranted here. Both the F1 and mean fundability of proposals are derivatives of the input data, rather than the one being the consequence of the other. Hence, a statement about correlation rather than causation seems more appropriate here, unless I have misunderstood the authors’ simulations.”
The reviewer is entirely correct and we have changed the text accordingly.
“The references (both in-text citations and the reference list) now only mention the first author of the cited works.”
As the target outlet for this work has no specific formatting guidelines, we have opted simply for the citation system built in to MS Word.
Reviewer 2
“What I find less convincing is the argument made in Section 2.3 that re-uses are “in some ways similar to resubmission of a rejected proposal […]”. I do not find this to be a strong motivation for allowing re-uses in the simulation because it implies that resubmissions would be identical to the original submission. However, real-world resubmitted proposals are not typically identical to the original submission: they may have been updated to incorporate feedback from the previous submission; or they may be changed in order to better fit the new funding call to which they are resubmitted. To iterate, the modeling choice is not inherently problematic per se – I agree that ‘re-uses’ of the same proposal are probably inconsequential. However, I see re-uses as a mere artifact, rather than a realistic modeling choice. Therefore, if you agree with me, my advice is to remove this argument, without necessarily changing the underlying modeling assumption. The text at very top of page 7 hints at the possibility of preventing this artifact – I suppose by drawing proposals from the pool without replacement. This would be a valid solution, too.”
We have overplayed this interpretation, perhaps making a virtue out of a non-necessity, and have adjusted the text accordingly. The re-use of reviewer score sets is not problematic for the simulation as one set that is [6, 3, 3] is exactly the same as a distinct set that has the same scores. With so many in the pool to draw on, re-use has no appreciable effect but our original lack of clarity clearly is not helpful.
“I do not think [the use of average scores to determine rankings] is a “simplification” at all. Some funders mandate that panels consider score averages in some of the steps involved in the construction of a ranking of proposals. To cite an example I’m familiar with: in some calls by Research Ireland, the peer review protocol instructs the panel quite explicitly to average scores in order to rank proposals. Quoting from the 2026 call document for their postdoctoral fellowships: “If the total average score is the same between two or more applications, applications with the same average scores will be ranked according to the higher average score under the highest weighted category”.
Furthermore, even for funders that do not explicitly mandate averaging, research shows that average score of a proposal’s grades is a very good predictor of the final funding decision (see, e.g., Pina et al. 2021). This suggests that the modeling assumption is more than a necessary simplification: rather, it can be presented as a realistic assumption that reflects how rankings are constructed – following formal or informal protocols – by panels in different research funding organizations.”
We are grateful to the reviewer for this support for the approach, and have added further information to the text to indicate that the assumption may be more realistic than we had first believed based on personal experience in UKRI processes.
“In conclusion, I would recommend citing Brezis (2007) if you agree with me that Brezis’ proposal is a better fitting precursor to partial randomization as we know it today; and to explicitly cite Greenberg (1998) if you credit their opinion piece as the oldest published example. Citing both – Greenberg for proposing lotteries and Brezis for proposing modern partial lotteries – would also be a fair solution.”
We will do just that. Unfortunately we were unable to access the original publication by Greenberg (paywalled) and so had to cite Avin’s citation. But we are happy to rely on the reviewer’s interpretation in this regard. And grateful for their insight.
“I have two comments about Figure 5. The first comment is about the level of bias in this figure. Is this figure showing all simulations from all levels of bias lumped together (vs showing simulations from a specific value of bias)? Either way, this should be made clear in the text or caption.”
Each point comes with its own level of bias, as in Figure 4. We have altered the caption of Figure 5 to make this more clear.
“I wonder whether [Figure 5] would be more effective it was showing individual simulations rather than parameter combinations: it would then show that rare combinations are, well, rare. However, I can imagine that a scatterplot with all 100k simulations will be extremely busy: would then a heatmap (along the lines of Figure 4) be a solution?”
The reviewer has correctly inferred the reason why Figure 5 is presented as it is. It would be computationally challenging for us to present the work as suggested. But we hope that the gist of the results is still accessible to readers.
“About Section 4.4 and Figure 7. This section concludes with observing some non-linearities in the relationship between bias, randomization, and reliability – i.e., the diagonal ‘beam’. I am more than a bit intrigued by this effect. The paper does not attempt an explanation for it, yet this is a non-trivial and completely unexpected effect. I wonder whether this is a substantive result or an artifact. Can you offer a speculation, if not an explanation?”
The other reviewer made a similar statement, and we have partially addressed it as above. In general we are a bit reluctant to read much more into it, but other work suggests that it may be a feature, rather than an artefact. We do not intend to pursue any of this work further.
“A curiosity about Figure 2. If I understand correctly, this is showing how actual and simulated review scores are distributed, with vertical black lines showing how a more granular value space is discretized into the six available grades. If I understood correctly, what I’m curious about is why there is a more granular value space to begin with: why is there not a single frequency value for each of the intervals on the X-axis, but there seems to be a more continuous distribution?”
The reviewer is entirely correct. We used a continuous (density) distribution rather than the real discrete distribution, purely for aesthetic reasons. We have clarified this in the caption for Figure 2.
“Second half of page 5: “In (Feliciani 2024) the intrinsic fundability is called a reference evaluation”. The concepts of intrinsic fundability and reference evaluation are indeed similar, and in this context they refer to the same element of the simulation. However, since they are two distinct concepts, I would rephrase this sentence in a way that avoids implying they are one and the same. For example: “The ground truth for the evaluation of proposals is modeled in a similar way by Feliciani et al. (2024). They also draw from a beta distribution …”, or “Feliciani et al. (2024) do not explicitly model intrinsic fundability, relying instead on a different ground truth. Nonetheless, the ground truth in their simulations is modeled in a similar way, i.e., by drawing from a beta distribution…”, or similar.”
We have changed the text appropriately, and thank the reviewer for the clarification.
“The second paragraph of section 2.3 describes the two types of simulated meetings. The description of the second type is unclear – I struggled to see how it was not a rephrasing of what was said for the first type. In case it can be of help in debugging where I get lost: I take the word “groups” to be the same as “sets of reviews”. Thus, I think that the whole paragraph can be summarized as follows – notice how I split the second type into two: Each simulated meeting can be of three kinds: (1) a meeting where no review is biased; (2) a meeting where all reviews are biased; (3) a meeting where each review has a 50% likelihood of being biased. For each meeting of the third kind, we also simulate, in parallel, a meeting of the first or second kind – determined at random, with equal chance – with the same set of proposals.
If that sounds about right, then the first two paragraphs of section 2.3 can be considerably streamlined; else, there is a more fundamental problem understanding how the experiment was set up.”
We have reviewed the text and tried to clarify it further. There are no meetings where no reviews are biased, so type 1 above does not exist. Nor does type 2, as the first type of meeting comprises a mixture of (potentially) unbiased, partially biased and completely biased proposals. Type 3 above is correctly described by the reviewer.
“I have a few recommendations or comments about Figure 1 – the figure showing various parameterizations of the beta distribution. First, for consistency with the text, consider replacing “shape 1” and “shape 2” with α and β. This avoids the need to explain panel titles in the caption. Second, the beta distribution is used in other parts of the simulation as well – for example, to determine the funding rate of a given meeting (α = 6, β = 14). Furthermore, in case you have not considered it, I wonder whether it would help the reader to have all possible beta distribution in this figure, beyond just the distributions used for drawing review scores. I can think of arguments why not to include other uses of the beta distribution, so I suppose the figure could work either way. Third, Figure 1 is introduced in section 4.1, but mentions of the beta distribution appear much earlier in the paper. Perhaps Figure 1 would then be more useful if introduced earlier.”
We agree that switching between ‘shape 1’ and alpha is not ideal, but must admit that the difficulty of representing Greek characters in R prevented us from making the labelling more consistent. We chose to keep the different beta distributions distinct for clarity, but agree that there is something to be said for presenting all of them together. As this is a matter of preference we have decided not to change the layout. The points made about the early introduction of the concept of the beta distribution are thought-provoking. While we may have assumed a level of awareness of the beta distribution that some readers might not have, we also wish to avoid taking too much of a diversion in the text to explain this background information. We hope that the reviewer understand our choice here, and that not altering the text is acceptable to them.
“The second-last paragraph of Section 5.2. (“Given the apparent ease […]”) is particularly hard to follow due to the way it is phrased. I would recommend using a simpler sentence structure.”
A fair point – it is a bit flowery. We have re-written in a simpler way.
“The style of references – both in line and in the reference list – does not seem to comply with any standard I know (e.g., APA 7th), and there seem to be a few typos in the reference list. I recommend double-checking for both style and typos.”
Also identified by the other review and addressed as above. The typos are a feature of how the bibliography in the original Word document works so unfortunately we are unable to correct them.
“For reproducibility, consider publicly sharing your code on an open repository such as CoMSES (comses.net) or Zenodo (zenodo.org).”
Thank you for the suggestion. We have added the code to Zenodo and included a link in the new version of the text: 10.5281/zenodo.17532196













