Published at MetaROR

November 13, 2025

Table of contents

Cite this article as:

Hulkes, A., Brophy, C., & Steyn, B. (2025, August 28). Reliability, bias and randomisation in peer review: a simulation. https://doi.org/10.31235/osf.io/4gqce_v1

Curated

Article

Reliability, bias and randomisation in peer review: a simulation

Alex Hulkes1, Cillian Brophy2, Ben Steyn3

1 Economic and Social Research Council (ESRC), a part of UK Research and Innovation (UKRI) and the UK Metascience Unit
2 UK Metascience Unit
3 UK Metascience Unit

Originally published on August 28, 2025 at: 

Abstract

For a variety of reasons, including a need to save time and a desire to reduce biases in outcomes, some funders of research have started to use partial randomisation in their funding decision processes. The effect that randomisation interventions have on the reliability of those processes should, it is argued, be a consideration in their use, but this key aspect of their implementation remains under-appreciated. Using a simple specification of a research proposal peer review process, simulations are carried out to explore the ways in which decision reliability, bias, extent of decision randomisation and other factors interact. As might be expected, based on both logic and existing knowledge, randomisation has the potential to reduce bias, but it may also reduce decision reliability as inferred from the F1 score and accuracy of a simulated binary (successful, rejected) decision outcome classification process. Bias is also found, in one sense and qualitatively, to be rather insensitive to partial randomisation as it is typically applied in real-world situations. The simple yet apparently effective specification of the simulation of reviewer scores implemented here may also provide insights into the distribution of merit across research funding proposals, and of assessment of them.

1 Introduction

Partial randomisation is a term used to describe an approach to the assessment of items (often, and particularly in the context explored here, research funding proposals) which relies on a mix of expert opinion and chance to identify those which will experience a favourable outcome.

Although it is a concept that can be applied to many areas requiring prioritisation of claims on a scarce resource1, in this work the focus is on the assessment and funding, or rejection, of research proposals.

Partial randomisation of research proposal funding decisions specifically is far from being a new idea: the earliest reference to it in an academic journal was identified (Avin, 2019) in 1998. Since then it has attracted both supporters and detractors, each deploying a range of arguments for and against based on principle, theory and evidence (Buckley Woods, 2021) (Horbach, 2022) (Davies, 2025).

Avin (Avin, 2019) provides, as a list of 11 propositions, an excellent summary of the many considerations and arguments relating to partial randomisation that a research funder should perhaps consider before making use of the approach. One of these propositions forms the conceptual starting point for this work:

“P5. A main putative shortcoming of funding by lottery is its lack of reliability, but if it comes close to or matches the reliability of peer review in some domains, then other features of the lottery will make it a more favourable selection mechanism.”

Established forms of research proposal assessment tend to prioritise a desire to determine reliably which of them are, in terms of (some conceptualisation of) intrinsic research merit, the most deserving of funding. The language changes (‘quality’, ‘excellence’, ‘merit’…) and the definitions of the terms vary, but in the end most such systems aim in some way to make it more likely that ‘better’ proposals will be funded. If such an approach cannot deliver this benefit it may make sense at least to consider alternative methods.

Many arguments favouring the use of partial randomisation rest on one of two assumptions (Avin, 2019): either that it is not possible to distinguish, in terms of their merit, between a notable fraction of proposals that are of sufficient merit to be funded or that, while it might be possible to distinguish proposals in this way, the return of the effort expended to achieve this end is too small for it to be worthwhile trying. In these situations, so the argument goes, applying some level of randomisation to funding decisions will have benefits in terms of time saved and, potentially, other matters including reduced bias in funding decisions, with little cost in terms of decreased reliability of decision making and the consequences of that outcome.

If it is literally impossible to distinguish between funding proposals a full lottery becomes, in some senses, the least biased option (Feliciani 2024). In cases where there is some limited ability to distinguish based on merit but the return on the effort needed to do this is deemed insufficient, partial randomisation may save time, making some form of that approach the preferred option.

There is some evidence (Bendiscioli, 2022) that the time saved by the use of partial randomisation might be rather small. If the time saving is small, the benefits of more traditional forms of allocation may outweigh the costs, but this depends on the return on the effort required to carry out a more comprehensive assessment.

Randomisation of proposals in competition for funding will inevitably reduce bias in decisions, because randomisation removes information from the assessment process. Where this information is irrelevant to the assessment process its removal is undoubtedly a good thing. But, unless it is impossible to derive information that is, for the purposes of assessment, relevant and useful from research funding proposals, randomisation will also remove potentially beneficial information, degrading the reliability of decisions.

The trade-off is clear, if somewhat complex: if a funder spends more time assessing individual proposals they will probably make a more reliable assessment of those proposals’ merits and may as a result end up with outcomes that are, in the long run, better. But this could be an expensive process and may introduce bias. Spend less time and make decisions based on (partial) randomisation and the results of the process are less likely to display funding decision-related bias of various kinds while potentially making the process itself less resource-intensive, but those decisions are also less likely to lead to the funding of an optimal set of projects.

This balance of competing requirements is described clearly by Feliciani, Luo and Shankar (Feliciani 2024). Their prior work is notable in demonstrating, through simulation, the ways in which the reliability of research proposal assessment might be influenced by various factors that together define a decision process. The results described here are similar enough to that work that we will draw frequently on its concepts and highlight specifically where this work approaches similar issues in different ways2.

The simulations described here address the question of how information loss resulting from the use of partial randomisation might affect decision making, both positively in terms of reduced bias and negatively in terms of reduced reliability. There are of course other potential outcomes and implications of using, or not using, partial randomisation to make funding decisions but they are not explored here. In Section 2 we provide more details on the process of simulation. Section 3 contains a brief overview of the aims of the simulations while Section 4 describes their results. Section 5 summarises the conclusions we reach on the basis of the simulations, both those relevant to partial randomisation and those of a more general nature.

2   Method

Simulations were carried out in R version 4.3.1 (R Core Team, 2023). Code used for the simulations is available at https://github.com/CillianUKRI/DPR_MetascienceAIFellowships.

2.1  Overview

The simulation process follows two broad approaches, each of which has several distinct stages.

In the first broad approach, we simulate proposals and decision processes which better reflect reality, in that the distributions of (simulated) reviewer scores relating to (simulated) proposals match those seen in real-life UK Research and Innovation (UKRI – the employer of two of the authors) peer review processes and the levels of bias are in the ranges we might anticipate.

In the second broad approach, the full space of potential bias strength is systematically explored, along with the full space of potential usage of randomisation. While this approach results in the simulation of many unrealistic review processes, it is necessary to show how the reliability of a decision process might vary across all potential decision-making spaces.

The simulations are carried out in two parts. In the first a large number of ‘proposals’ is simulated to create a pool from which proposals can be drawn for later use. In the second a large number of sets of proposals assessed in competition with each other – ‘meetings’ – is used to derive a funding decision for each of the simulated proposals. The reliability of the process overall is determined on the basis of the reliability of the individual funding decisions in a ‘meeting’.

2.2  Simulation of ‘proposals’

A simulated proposal consists of the set of scores given to that proposal by its reviewers. These scores are of two types, biased and unbiased, so that a proposal in the pool is manifested as a set of unbiased scores and, based on those unbiased scores, a set of biased scores. Depending on the purpose of the simulation, the scores used to represent a proposal in a meeting may be the biased set, the unbiased set or a mix of the two.

Simulations start with the assumption that funding proposals have an intrinsic level of merit – their ‘fundability’ – that their assessors can estimate on average reliably but with error. Here the simulations differ from those in (Feliciani 2024) which do not make this assumption, but are in accord with a number of other simulations or investigations of peer review processes that do rely on it (Feliciani 2019).

At its core the process of simulating a proposal is quite straightforward. In a way which is similar to (Feliciani 2024) a beta distribution is used as the fundability score distribution for a proposal, taking values in the interval [0, 1]. The desired number of reviews for a proposal is simulated by drawing the desired number of times from the chosen distribution. To mimic the scoring scale used in UKRI, the [0, 1] interval is divided into six equal parts so that each simulated review is assigned an integer score of 1 to 6.

The selection of the beta distribution parameters which are used to generate proposals (by generating reviewer scores) is based on a qualitative match of the results of the simulation with observed UKRI review score distributions. The beta distribution has two parameters – typically named α and β – which between them determine the mean and variance of the distribution, and hence of the simulated reviewer scores for a simulated proposal. A qualitatively good match with an observed distribution of mean reviewer scores on real UKRI proposals is obtained by allowing α to vary uniformly in the range 2 to 20 and β to vary uniformly in the range 4 to 7. The complete specification of the distribution of scores of simulated proposals can thus be written as

Xproposal_scores ~ Beta( U(2, 20), U(4, 7) )

In (Feliciani 2024) the intrinsic fundability is called a reference evaluation and is distributed as beta(10, 3), quite similar to the values used here. Drawing the reviewers’ scores directly from the specified beta distribution has the advantage of automatically confining the score between 0 and 1 and eliminating the need separately to add ‘noise’ to the reviewers’ scores, as a defined variance is a property of the underlying beta distribution itself.

To simulate a proposal, the number of draws from a beta distribution with the desired specification is determined as a Poisson variable. To match common practice, in UKRI particularly but likely across a number of funders, the mean number of simulated reviews comprising a proposal is set as 3. Where the number of reviews is fewer than 3 it is set to 3 so that overall the number of reviews comprising a proposal will be distributed as:

Xreviewer_count ~ max( 3, Poisson(3) )

In this simulation, bias is implemented in a somewhat simpler way than that in (Feliciani 2024). Once the set of reviewer scores, in the interval [0,1], for a simulated proposal has been created, a bias factor for that proposal is drawn from another beta distribution:

Xbias ~ Beta(6, 1.5)

The mean bias is thus 0.8, meaning that a biased review’s fundability will on average be about 80% of the fundability of an unbiased review, but with potential for considerable variation around this level.

In order to maintain the same variance when sampling biased reviewers the following re- parameterisations of the initial beta distribution are used to create a new beta distribution from which the biased review scores will be drawn:

Where μ` is the mean fundability of the original beta distribution for the simulated proposal multiplied by that proposal’s bias factor, and σ2 is its variance (which undergoes no change). Values of α` and β`are then used to create a set of biased reviewer scores by again sampling the desired number of times from the newly-specified and now suitably-biased beta distribution.

The sets of biased review scores and unbiased review scores, and derived summaries, for that proposal are then available to use to create simulated assessment meetings. The process starts with the simulation of a large number of proposals (typically 500,000) which is used as a reservoir from which proposals can be drawn to create meetings.

2.3  Simulation of ‘meetings’

The process of simulating meetings makes use of two versions of the proposals simulated, to produce two different kinds of meeting, although the set of proposals drawn from is the same in both cases. The same proposal may be used as part of more than one meeting. Re-use (which is in some ways similar to resubmission of a rejected proposal on the assumption that its assessment is unchanged3) could be prevented, but with no effect on the outcomes of the simulations.

In the first type of meeting the aim is to understand how the reliability of the assessment process varies with the level of randomisation applied. For these meetings, which are intended to be as realistic as the simulation framework allows, scores comprising a proposal are a random mixture of the biased and unbiased scores generated in relation to that proposal, with equal probability. In the second type of meeting the aim is to determine differences in success rates of submitting groups which are and are not subject to bias. Here either the biased or the unbiased set of reviews (randomly determined, so that there is a 50:50 mix of proposals which are subject to bias and those which are not) is used to represent each proposal, and success rates are calculated for proposals with bias and without bias.

In both types of meeting the scores are summarised as their mean and no further information other than the level of bias associated with the set of scores and the proposal’s inherent fundability is carried forward into the simulation of meetings. Simulated proposals in a simulated meeting are ranked according to their mean reviewer score, so the scores themselves are not needed for the simulation. This is a necessary simplification of how funding decisions in real meetings might be made.

Reviewer scores are only a proxy for the full range of information that might go into creating a ranking and reaching funding decisions. In reality the considerations of meetings at which proposals are assessed can be complex (Gallo, 2023) likely reflecting averages, ranges, divergences, varying norms and the social processes surrounding their handling (Raclaw J., 2017). A more complex formulation of the role of an assessment panel could be incorporated into these simulations, but extra complexity requires additional, potentially incorrect, assumptions and so we have chosen not to do so.

Meetings are composed by selecting at random a stochastically-determined number of simulated proposals from the pool of simulated proposals. In line with common implementations of partial randomisation, at each meeting each proposal will experience one of three outcomes: it might be ‘funded outright’ because it is of the very highest merit; it might be subject to randomisation, with only some of those randomised being funded; or it might be rejected outright because it is of insufficient merit to receive funding under any circumstances.

The average number of proposals at a meeting is set as 30 so that

Nmeeting ~ Poi(30)

Each meeting has its own success rate drawn from a beta(6, 14) distribution, to give an expected success rate across all proposals of 30%. The parameters underlying the proposal count and success rate distributions are somewhat arbitrarily chosen, but reflect likely real-world situations tolerably well. The success rate assigned to a meeting is preliminary, as only whole numbers of proposals can be funded. The preliminary rate is used to determine how many proposals will be funded (Nfunded) with the actual success rate at the meeting being:

Application of partial randomisation to the meeting occurs once Nfunded is established. By definition in some simulations, but only those seeking to simulate more realistic meetings, it is assumed that approximately 50% (based on the nearest integer number) of all those proposals funded will be proposals which are funded outright (Nfunded_outright). The other ~50% will be those which are funded as a result of the randomisation process (Nfunded_randomly).

To determine the number of proposals which will be subject to randomisation, a random percentage of all those proposals which will not be funded outright (that is, Nmeeting Nfunded_outright) is determined by selection uniformly from the interval [0, 1]. This percentage is again transformed into an integer number of proposals which will be subject to randomisation, with appropriate rounding (Nrandomised). Finally, the number of proposals that will be rejected outright is determined:

Nrejected_outright = Nmeeting Nfunded_outright – Nrandomised

As the process of simulating meetings is, by its nature, subject to chance, the rules needed to specify and characterise a meeting can be quite complicated. Various contingencies need to be accounted for, particularly if the extent of randomisation is allowed to vary, as occurs in some simulations. For example, it may be that by chance only one proposal is in the randomisation group, and so specific rules must be applied to accommodate this based on assumptions about how a funder might respond to the specific situation outlined. These rules, as implemented in the simulations, are summarised in Table 1.

[Table 1 here]

Here a direct comparison with the approach in (Feliciani 2024) may be useful. The stochastic nature of the process of composing a meeting means that, if all these outcomes are allowed, many of the ‘Types’ of selection procedures described in (Feliciani 2024) may occur in the set of meetings generated. That is, meetings may feature no randomisation at all, identify no proposals to be funded outright, may or may not reject any proposals outright, or have any composition between these extremes. In cases where the aim of the simulation is to investigate more realistic funding scenarios, types which feature no randomisation or which have complete randomisation will be excluded.

The (Feliciani 2024) concept of the ‘bypass set’ is the group of proposals funded outright (of size Nfunded_outright) while the ‘lottery pool’ is those proposals subject to randomisation (of size Nrandomised). A key difference between this work and that in (Feliciani 2024) is that in that work the size of Nrejected_outright is determined by use of the evaluation scores, as would likely happen in a real meeting. Here though it is simply defined as any proposal that is not either funded outright or subject to randomisation. This can be thought of as a simulation of the (perhaps arbitrary or at least highly qualitative and subjective) decision about where a ‘Sufficient merit’ (Feliciani 2024) threshold might be placed by a meeting’s participants.

Unlike with the simulation of proposal scores there is no benchmark against which the plausibility of simulated meetings which make use of partial randomisation can be assessed. While there are some isolated examples of the use of partial randomisation in real-life decision processes, they are few and far between and detailed data describing their composition and outcomes is not publicly available.

2.4 Determining decision outcomes

At this point it is probably helpful to give a brief summary of where in the process we have reached. A meeting comprises Nmeeting simulated proposals, each represented by i) a stochastically-determined number of reviewer scores drawn from {1,2,3,4,5,6} which is summarised as a mean reviewer score and ii) a true fundability in the interval [0, 1]. For each meeting we know i) how many proposals will be funded, of which we know how many will be funded outright and how many will be funded as the result of randomisation, ii) how many will be rejected through randomisation iii) how many will be rejected outright and iv) how many will see their funding decision subject to randomisation. These values allow us to derive, for that meeting, an overall success rate, the proportion of decisions that is subject to randomisation and indeed any other feature that might usefully describe that meeting.

The next step in the simulation process is to rank the proposals in each meeting and determine funding decisions as defined by the ranking. This is done twice for each meeting, first to establish the meeting ranking/outcome based on the reviewer scores and then to establish the ‘true’ ranking/outcome based on the inherent fundabilities, taking the same conceptual approach as (Feliciani 2024).

Proposals are first ordered by the mean of their reviewer scores. Nfunded instances of a funding decision of ‘Funded’ are added to the meeting, in the order of the ranking. Nmeeting Nfunded instances of a funding decision of ‘Rejected’ are used to fill the remaining decisions. Then those funding decisions which exist in relation to proposals subject to randomisation are reordered randomly. In meetings with no randomisation this last step, by definition, does not happen. The meeting’s proposals are then reordered according to their inherent fundability, to give the ‘true’ ordering, with the relevant numbers of (now true) ‘Funded’ and ‘Rejected’ outcomes further added. This step allows for each proposal in a meeting comparison of the actual and ideal outcome, which is essential in the next step.

In contrast to (Feliciani 2024) who use the Spearman correlation between the assessed and ideal outcomes of a meeting as the indicator of ‘epistemic correctness’ for that meeting, we use here indicators of decision reliability based on binary classification problems (Berrar, 2025). These are based on counts of true and false positive (TP and FP) and true and false negative (TN and FN) funding decisions. For example, if a proposal’s ranking based on mean reviewer score leads to a decision of ‘Funded’ while the ranking based on underlying fundability indicates that the same proposal ought to have been ‘Rejected’, the decision is recorded as being a false positive.

Although it is simple enough to calculate a large number of metrics for the reliability of the simulated meeting process, here we use only the accuracy and the F1 metric calculated for the meeting’s decisions as a whole and defined as:

Note that the F1 measure ignores the true negative rate and so cannot be improved by focusing on the identification of rejected proposals. As typically there will be many more rejected proposals than there are successful proposals, this suggests that the F1 score is a reasonable choice for an indicator of decision reliability in this case.

3 Overview of simulations

The simulations have three aims:

A1.To establish, for simulations which mimic as far as is practicable likely properties of real peer review processes, how the reliability (summarised with accuracy and F1 score) of decisions varies with the extent of randomisation of funding decisions

A2.To explore through simulation the extent to which the use of randomisation has the potential to reduce the effects of bias in funding decisions

A3.To explore through simulation the way in which the extent of use of randomisation and the level of bias interact to determine the reliability of the peer review process as simulated.

For A1, the simulations assume that around 50% of proposals that are funded will be funded as the result of randomised decisions. The rest of the parameters defined, and rules applied, are as already described. As the parameters can range stochastically over a wide range of values, a large number of simulated proposals and meetings is needed to ensure sufficient coverage of results.

For this reason the simulations relating to A1 draw from a pool of 500,000 proposals to compose 100,000 meetings, calculating the accuracy and F1 of each meeting.

For A2, the implementation is broadly similar except that the pool of simulated proposals is a 50:50 mix of proposals which have reviewer scores which are subject to bias (of varying levels) and those which are not. Each proposal is thus associated with membership of a group (the members of which are either subject to bias or not subject to bias). This membership allows a simple calculation and comparison of group success rates, the expectation being that the success rate of the group which is subject to bias will be lower than that of the group which is not subject to bias, with this difference tending to decrease as the use of randomisation becomes more extensive. The question is not whether this happens (it is inevitable that it will) but to what extent this happens.

For A3, the simulations work systematically through a pre-determined range of levels of randomisation and bias, using otherwise realistic parameters to determine decision reliability at each randomisation-bias level. Clearly the simulation will become more computationally demanding as the granularity of the parameters increases, but with the benefit of increasing power to evaluate change as the underlying parameters change. As a compromise, both randomisation and bias are allowed to range from 0% to 100% in increments of 5% giving 21 * 21 = 441 separate sets of simulations. Each of these simulations assembles 10,000 meetings from a pool of 100,000 simulated proposals, the set of proposals necessarily being newly generated for each meeting.

4 Results

4.1 Characteristics of simulated proposals and meetings

Figure 1 shows 16 example distributions of the kind used to generate proposal scores, with each individual facet of the plot showing the relevant values of α and β. The line shows the density of values in that particular distribution. The higher the density, the more likely it is that a score drawn from the distribution will have that value. The actual underlying fundability of a simulated proposal is defined as the mean value of the distribution (), while the break points used to transform continuous scores into discrete scores on a 1-6 scale are also shown. The same review score distribution specifications are used in simulations addressing each of A1, A2 and A3.

[Figure 1 here]

Reviewer score distributions have been specified so that in aggregate the mean scores from the simulated proposals match, reasonably closely, those associated with real UKRI proposals. No attempt has been made to optimise this match in a more rigorous way. Actual (left) and simulated (right) distributions of scores are shown in Figure 2. While it is not essential that the simulated scores reflect perfectly those seen in a real peer review system, the qualitatively good match provides some reassurance that the simulated proposals will support simulation of at least moderately realistic processes.

[Figure 2 here]

Figure 3 shows the distribution of meeting award rates that is sampled from when determining how many proposals will be funded in the meeting process used for simulations addressing A1, A2 and most of A3. While the average success rate of the simulated meetings will be 30%, rates between 10% and 50% will occur quite frequently in simulations.

[Figure 3 here]

Again, this distribution of meeting success rates has no theoretical basis and is used only as a plausible baseline.

4.2 Effects of bias and randomisation on accuracy and F1

The general relationship, seen in the simulations, between reliability and the level of bias is shown in Figure 4. Both accuracy and F1 are used to summarise reliability, but Figure 4 shows little visual difference in outcomes across the two indicators.

[Figure 4 here]

The most interesting feature is the relatively clear decrease in F1 (the effect is less pronounced for accuracy) at higher levels of randomisation. There is also an apparent decrease in F1 at very low levels of randomisation, but this likely reflects a lack of data on meetings with those compositions, and also the unusual (in terms of these simulations) characteristics of meetings that have such low levels of randomisation.

Figure 5 shows how F1 and accuracy vary with the extent of randomisation alone. The removal of the effect of bias simplifies the picture somewhat.

[Figure 5 here]

As expected by logic, and as would be anticipated in light of the results of (Feliciani 2024), both F1 and accuracy decline with increasing randomisation. The F1 score appears to be more strongly influenced by the degree of randomisation than is the accuracy, as shown by the steeper slope of the regression line for F1. But the difference is slight and the imprecise nature of the simulated data suggest that it would be unwise to read too much into this observation. In addition, and as already noted, there is no perfect measure of binary classification reliability so the fact that two different measures respond in different ways to changes in randomisation is not unexpected.

4.3  Sensitivity of award rate differences to randomisation

For A2 the simulations are used to determine how an inter-group success rate difference might vary as the extent of randomisation varies, for a fixed model of bias. Figure 6 shows the density of simulated meetings having the levels of randomisation and success rate difference indicated on the x and y axes respectively. Note that here the simulations use what are intended to be plausible parameters, so much of the space is again empty of simulated meetings.

[Figure 6 here]

The underlying fundability of proposals does not differ between the groups which are and are not subject to bias, meaning that the difference in award rates would be expected to be around (but not exactly) zero percentage points were there no bias. The simulated award rate difference is quite large, typically about 30 percentage points although this is doubtless amplified by the specifics of the simulation, suggesting that the bias factors may be somewhat too large. The feature of interest is less the rate difference itself than is the way in which it tends towards zero as the degree of randomisation increases.

If randomisation had no effect on the difference in award rates the feature that looks, in Figure 6, like a comet’s tail would be vertical. If there was a strong effect, it would intersect with the red dashed line, drawn at a zero percentage point difference in success rates between the two groups, rather quickly. What we see can reasonably be characterised as a slow drift which only just manages to reach the zero line, and then only at high levels of randomisation. This suggests that randomisation might have only a moderate effect on this measure of the reliability (or perhaps in the terminology of (Feliciani 2024) ‘unbiased fairness’) of a decision process.

4.4 Mapping the randomisation space

In pursuit of A3, and taking advantage of the simulation’s ability to force less realistic input values, decision reliabilities across the full space of possibilities for bias and randomisation are mapped in Figure 7. The reliability indicator used is the mean of meeting F1 measures for all meetings in the relevant simulation. Results for accuracy are quite similar and so are not shown here.

[Figure 7 here]

As expected, decision reliability is at its lowest when all or nearly all proposals are randomised, as indicated by the existence of the black column at the far right of the plot. In general, the most reliable decision processes are those with the least bias and the lowest levels of randomisation. These are the points towards the top left of Figure 7. Supporting findings shown earlier in the analysis, when bias is relatively low increased use of randomisation results in a smoothly decreasing reliability. That is, when moving from left to right at the top of Figure 7, the F1 measure declines in a consistent way with no sharp drop offs.

More interesting features start to appear when reviewer bias is stronger. For example, the decline in reliability with increasing randomisation stops being smooth. Instead, there is a clear drop off at greater than ~25% randomisation, followed by a slight improvement which starts at a level of randomisation that depends on the level of bias. This manifests in Figure 7 as a diagonal ‘beam’ of improved reliability extending from top left to bottom right, and a vertical column of increased reliability towards the left.

4.5  Decision reliability and aggregate outcomes

Less reliable processes will lead to decisions which do not optimise for the intended characteristic (in this case taken to be the inherent fundability of research proposals.) These simulations may also illuminate the extent to which the reliability of a decision process matters in terms of its effect on the desired outcome of that process. Figure 8 shows the relationship between the F1 score of a meeting and the mean underlying fundability of the proposals funded at that meeting.

[Figure 8 here]

Because the underlying fundabilities of proposals, and hence of proposals funded at meetings, are determined by the parameters of the simulation the actual values in Figure 8 are of little interest. Instead it is the indicated relationship between the two variables that matters. As expected, higher meeting reliability (F1) leads to higher mean fundability of funded proposals (note that as the proposals at a meeting are sampled randomly from the same pool of proposals, on average these simulated meetings will have the same mean fundability of all their proposals.)

The increase in mean fundability as F1 increases from 0 to 1 is about 1.5 standard deviations of the data. This is, relative to the data, a substantial increase in mean fundability. It is smaller in absolute terms because the underlying distribution of proposal fundabilities (Figure 2) is rather narrow, limiting the extent of change that is possible.

5 Conclusions

5.1 Partial randomisation

Simple logic leaves no doubt that randomisation, even if applied partially, has the potential to reduce bias in funding decision processes. Randomisation implies discarding information. If that information is biased, the loss of information comes with a loss of bias. It is also, we would argue, not seriously in doubt that the use of randomisation will result in the loss of potentially useful information which could help make funding decisions more reliable4. The question for a funder of research using decision processes which rely on the submission and assessment of proposals in competition with each other is to what extent, if any, randomisation ought to be used to achieve the greatest overall benefit. The answer to that question depends on various matters, including the relative value that the funder places on reliability and broadly defined ‘fairness’ of decisions.

An out-and-out lottery (in the terms of (Feliciani 2024) ‘Type 4 processes’) in which all proposals received are placed into a randomisation pool would likely be problematic in its implementation. It could well be unpopular with many (Liu M., 2020), and such a process may well encourage submission of large numbers of proposals (because each would have the same chance of being funded.) This is why randomisation will typically be applied in a partial way so that only once the most deserving proposals are identified are other proposals funded at random.

The simulations described in this analysis support these suppositions and are broadly in agreement with the findings of (Feliciani 2024). Full randomisation may result in unreliable, if also unbiased, decision making. Partial randomisation will reduce bias but at the cost of reliability. The complete absence of randomisation exposes applicants to the full effects of whatever biases are present but maximises the use of, and return on investment in gathering, relevant information.

Being a simulation this work has a fundamental limitation. While we have made efforts to mimic a real-world peer review process, the simulation is only as good as the assumptions that underlie it. Echoing the statements made in (Feliciani 2024) and (Feliciani 2022), it should be understood that this simulation does not amount to a claim that this is what happens in reality or that the processes built into the simulation quantify what might happen in a funder’s systems. The simulation generates many numbers that could in principle be used to characterise a peer review process, but it would be overreach to state that it demonstrates that a typical peer review process has an F1 score of 0.81 (or whatever value seems to fit the simulation’s parameters best).

Another, less fundamental, limitation arises from the fact that the implementation of bias in the simulation (single source, simple) is a simplification of how bias will exist in the real world (multi-source, interacting). More complex representations of bias in the simulation suggest themselves quite readily but in the absence of their implementation it is not advisable to use these results to quantify actual bias in any way.

Those limitations aside, is there any practical advice that a research funder might be able to take away from this simulation? This of course depends on the risk appetites of the funder, but if they were willing to use it as a guide, inspection of Figure 7 suggests a rule of thumb. Under realistic conditions, it might be a good idea when using partial randomisation in a decision process that matches the assumptions of the simulation reasonably well either to limit the extent of randomisation to no more than 25% of the proposals received, or to randomise much more extensively. The middle ground may not be a safe place to reside when it comes to implementing partial randomisation.

5.2  Is there a property of funding applications that peer review is ‘measuring’?

In 2011 a paper was published which demonstrated that it was possible to simulate, quite persuasively, the hunting behaviour of grey wolves using just two simple rules (Muro 2011). The authors took pains to indicate that:

“It is not our intention to argue that wolves lack significant communicative and cognitive skills, but rather to suggest a model that can explain their behaviors without assuming that they have special abilities or hierarchical social skills.”

The work seeped into the public consciousness, in the English-speaking world at least, through popular science publications (Marshall, 2011) with the point of most interest being whether this was ‘just a simulation’ or whether the model had revealed what wolves actually do when they hunt as a pack.

In this work we have shown that simple rules defining the simulation of research proposals and their associated reviewer scores enable the reproduction of the aggregate behaviour of reviewers of research proposal applications to a major research funding agency (UKRI). Plausibly, similar rules with different parameters might be able to do the same for similar data of any funder.

Given the apparent ease with which observed applicant and reviewer behaviour can be simulated with just these assumptions, is it reasonable to believe that we might understand that behaviour in light of the terms with which the simulation is defined? That is, that three propositions hold in relation to what those who write research proposals and the peer reviewers of those proposals are actually doing: i) proposals have an inherent fundability or merit that can in principle be determined ii) that reviewers are in general able to assess this property of a proposal and would, if there enough of them, converge on the right answer, but that iii) individually they do so with varying reliability that reflects some other property or properties of the proposal itself.

Occam’s razor (Charlesworth, 1956) suggests this might be the case, but such an interpretation contrasts with the assumptions that others have made when developing similar simulations or in related work. For example, the idea that there might even be such a thing as intrinsic merit is challenged (Feliciani 2024) (Feliciani 2022) and the reliability of reviewers’ judgements (that is, their ability as a group to discern meaningfully a proposal’s inherent merit) is often questioned (Jerrim J., 2023) (Feliciani 2022). A literal interpretation of the meaning of the simulations in this work does not imply that consistency in peer review does not matter, but it does imply that it is at least not an impossible goal: there is a characteristic of research proposals that it is worth attempting to judge. However these interpretations and questions, though intriguing, are a by- product of the main thrust of the simulations.

 

CRediT author statement: Alex Hulkes: conceptualization, methodology, software, writing – original draft, writing – review and editing, visualization, validation; Cillian Brophy: conceptualization, methodology, writing – review and editing; Ben Steyn: conceptualization, methodology, writing – review and editing.

 

Table 1. Rules used to compose simulated meetings.
Situation Rule or outcome
Nrandomised = 1 Define that two proposals are in the randomisation pool
Nrejected_outright < 0 Nrejected_outright = 0
Nrandomised = Nmeeting All proposals are randomised (randomisation rate = 100%).
Nrandomised = 0 Proposals are funded based on numerical ranking only
Nfunded = 1 Meeting results taken as indicated; no randomisation occurs
Nfunded = 2 or 3 & Nrejected_outright = 0 One proposal funded outright, all the rest subject to randomisation to determine the remaining 1 or 2 to be funded
Nfunded = 2 or 3 & Nrejected_outright ≠ 0 One proposal funded outright, the indicated number rejected outright and all the rest subject to randomisation to determine the remaining 1 or 2 to be funded
Nfunded at least 4, an even number &

Nrejected_outright = 0

Nfunded_outright = Nfunded/2, all proposals randomised
Nfunded at least 4, an even number & Nrejected_outright ≠ 0 Nfunded_outright = Nfunded/2, mixture of randomised and rejected outright proposals seen
Nfunded at least 4, an odd number Nfunded_outright = Nfunded/2 rounded in unbiased fashion

 

Figure 1

Example score distributions underlying simulation of proposals and their reviews. The shape 1/alpha shape2/beta parameters are constrained to be in the range 2 to 20 and 4 to 7 respectively. The line indicating the pdf is drawn using ggplot2’s built-in geom_line function to join 100 evenly-spaced points along the x axis.

Figure 2

Actual and simulated distributions of mean reviewer scores for a proposal, shown as a density. Actual scores are those for UKRI Research Council applications received in calendar year 2023. Simulated scores are the unbiased scores only. The density is derived with ggplot2’s geom_density function and its default values for the underlying density() call (that is 512 points, Gaussian kernel.) In reality the distribution is discrete as proposals, real and simulated, will have only a limited range of mean scores.

Figure 3

Distribution of award rates used when simulating meetings. This is a beta (6, 14) distribution. Its expected value is 0.3 (or an award rate of 30%). The line indicating the pdf is drawn using ggplot2’s built-in geom_line function to join 100 evenly-spaced points along the x axis.

Figure 4

Accuracy (left) and F1 score (right) of meetings varying by degree of randomisation (y-axis) and level of bias (x-axis). Shading indicates the value of the reliability metric. Meetings with characteristics towards the edges of the data will be rarer, and so the metrics will be more noisy. Tiles in the grid are based on slices of the data in 0.4% (bias) and 0.8% (randomisation) increments

Figure 5

Relationship between extent of randomisation and accuracy (left) and F1 score (right) for simulated meetings. The regression is carried out with ggplot2’s built-in geom_smooth function and its default values for ‘lm’, but weighted by the number of meetings represented by each point, so that points with little data have less influence. While edge features are prominent visually, they carry little weight in data terms. Each data point represents one of the 0.4% by 0.8% tiles shown in Figure 4.

Figure 6

Inter-group (subject to bias, not subject to bias) award rate difference for simulated meetings. Lighter shades indicate increased density (ggplot2’s built-in geom_density_2d function) of meetings with the characteristic specified. The subject to bias group receives reviewer ratings which are on average about 80% of the ratings received by the group not subject to bias.

Figure 7

Variation of the meeting F1 measure with the extent of randomisation (x-axis) and the degree of bias (y-axis). Bias is implemented as the % of the unbiased score that a biased score will on average receive. Each combination in the 21 x 21 grid is based on the results of 10,000 simulated meetings built up from samples of 100,000 simulated proposals. Lighter shades indicate more reliable decisions/binary classifications.

Figure 8

Relationship between the mean fundability score of proposals funded at a meeting (y-axis) and the reliability of decisions made at that meeting as summarised in an F1 score (x-axis.) As the underlying simulation has many hundreds of thousands of meetings and hence data points, only 1% of data points (randomly selected) are shown in the plot. The general relationship between F1 and mean fundability of funded proposals is shown with a red line derived with ggplot2’s built-in geom_smooth function.

Notes

1 The general idea of awarding a resource in a way that mixes chance and more purposive approaches is quite ancient. Athenian democracy involved both eligibility filters and random processes to select magistrates who were considered to be representative of the population.

2 The work described here was first shared publicly in June 2025 as part of a report by the UK Metascience Unit. UK Metascience Unit (2025) A Year in Metascience. UK Department for Science, Innovation and Technology and UK Research and Innovation. DOI: 10.6084/m9.figshare.29210066.

3 It would be possible to simulate a resubmitted proposal that has been reassessed simply by drawing a new set of scores from the initial beta distribution. This refinement would not make any difference to the results seen.

4 While some might claim that peer review processes are truly random and amount to little more than decision-making theatre, this view is discounted here as being too nihilistic. Clearly it will be possible to discern the relative merits of different items, and it is only the extent to which this is feasible and worthwhile that is seriously in dispute.

References

Avin, S. (2019). Mavericks and lotteries. Studies in History and Philosophy of Science Part A, 76, 13-23. doi:doi.org/10.1016/j.shpsa.2018.11.006

Bendiscioli, S. F.-B. (2022). The experimental research funder’s handbook. Research on Research Institute. doi:doi.org/10.6084/m9.figshare.19459328.v5

Berrar, D. (2025). Performance Measures for Binary Classification. In C. M. Ranganathan S. (Ed.), Encyclopedia of Bioinformatics and Computational Biology (Second Edition) (Vol. 1, pp. 645- 662). Elsevier. doi:https://doi.org/10.1016/B978-0-323-95502-7.00033-6

Buckley Woods, H. W. (2021). Why draw lots? Funder motivations for using partial randomisation to allocate research grants. Report, Research on Research Institute. doi:doi.org/10.6084/m9.figshare.17102495.v2

Charlesworth, M. (1956). Aristotle’s Razor. Philosophical Studies, 6(0), 105-112. doi:https://doi.org/10.5840/philstudies1956606

Davies, C. I. (2025). Sceptics and champions: participant insights on the use of partial randomization to allocate research culture funding. Research Evaluation, 34. doi:doi.org/10.1093/reseval/rvaf006

Feliciani T., L. K. (2022). Peer reviewer topic choice and its impact on interrater reliability: A mixed- method study. Quantitative Science Studies, 3(3), 832-856. doi:https://doi.org/10.1162/qss_a_00207

Feliciani T., M. M. (2022). Designing grant-review panels for better funding decisions: Lessons from an empirically calibrated simulation model. Research Policy, 51(4). doi:https://doi.org/10.1016/j.respol.2021.104467

Feliciani, T. L. (2019). A scoping review of simulation models of peer review. Scientometrics, 121, 555-594. doi:https://doi.org/10.1007/s11192-019-03205-w

Feliciani, T. L. (2024). Funding lotteries for research grant allocation: An extended taxonomy and evaluation of their fairness. Research Evaluation, 33. doi:doi.org/10.1093/reseval/rvae025

Gallo, S. P. (2023). A new approach to grant review assessments: score, then rank. Res Integr Peer Rev, 8. doi:https://doi.org/10.1186/s41073-023-00131-7

Horbach, S. P. (2022). Partial lottery can make grant allocation more fair, more efficient, and more diverse. Science and Public Policy, 49(4), 580-582. doi:doi.org/10.1093/scipol/scac009

Jerrim J., V. R. (2023). Are peer reviews of grant proposals reliable? An analysis of Economic and Social Research Council (ESRC) funding applications. The Social Science Journal, 60(1), 91-109. doi:https://doi.org/10.1080/03623319.2020.1728506

Liu M., C. V. (2020). The acceptability of using a lottery to allocate research funding: a survey of applicants. Res Integr Peer Rev, 5(3). doi:https://doi.org/10.1186/s41073-019-0089-z

Marshall, M. (2011, October 19). Retrieved July 24, 2025, from New Scientist: https://www.newscientist.com/article/mg21228354-700-wolf-packs-dont-need-to-cooperate- to-make-a-kill/

Muro C., E. R. (2011). Wolf-pack (Canis lupus) hunting strategies emerge from simple rules in computational simulations. Behavioural Processes, 28(3), 192-197. doi:https://doi.org/10.1016/j.beproc.2011.09.006

Raclaw J., F. C. (2017). Laughter and the Management of Divergent Positions in Peer Review Interactions. J Pragmat., 113, 1-15. doi:https://doi.org/10.1016/j.pragma.2017.03.005

R Core Team (2023). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org

Editors

Ludo Waltman
Editor-in-Chief

Ludo Waltman
Handling Editor

Editorial Assessment

by Ludo Waltman

DOI: 10.70744/MetaROR.191.1.ea

This article presents a simulation study of the use of partial randomization in research funding allocation. The article has been reviewed by two reviewers. Reviewer 1 considers the article to be timely and methodologically sound. The reviewer praises the clear presentation of the methods and results. Reviewer 2 stresses that understanding the trade-offs involved in partial randomization is becoming ever more important given the growing adoption of partial randomization by research funders. The reviewers have various recommendations for improving the article. Reviewer 1 suggests there may be an opportunity to obtain deeper insights from some of the unexpected simulation results, which could strengthen the relevance of the study for funders. Both reviewers ask for an explanation of the surprising findings presented in Figure 7. The reviewers also challenge some of the assumptions underlying the simulation model used in the study, and they offer suggestions for clarifying the interpretation of some of the results of the study.

Recommendations for enhanced transparency

  • Include in the body of the article a comprehensive data availability statement. Make publicly available the UKRI data (used in Figure 2) or explain why the data cannot be shared.
  • All source code is publicly available in GitHub. GitHub URLs are not permanent. Issue a DOI for a GitHub repository and include it in the article.
  • Add a competing interest statement. Authors should report all competing interests, including not only financial interests, but any role, relationship, or commitment of an author that presents an actual or perceived threat to the integrity or independence of the research presented in the article. If no competing interests exist, authors should explicitly state this.
  • Add a funding source statement. Authors should report all funding in support of the research presented in the article. Grant reference numbers should be included. If no funding sources exist, explicitly state this in the article.

Competing interests: None.

Peer Review 1

Serge P. J. M. Horbach

DOI: 10.70744/MetaROR.191.1.rv1

This paper addresses a timely issue that has recently received substantial attention: the potential use of partial randomisation in grant allocation procedures. Through a simulation study, it aims to better understand the implications of introducing randomisation for the reliability of the allocation process and the extent that biases can effect funding decisions. In the simulated grant review process, they indeed find that increased randomisation has the potential to reduce bias, while at the same time leading to what the authors refer to as ‘less reliable’ decisions.

This paper has the potential to contribute to ongoing debates about innovations in grant allocation processes, particularly those related to efficiency and fairness of such processes. The manuscript provides a decent introduction to this topic, mentioning several aspects that critically inform the debate and then explaining why it will focus on some of them. The simulation method are described in sufficient detail and seem methodologically sound, even though it is clear that many alternative choices could have been made and the impact of making these choices is not always clear. The findings are also presented in a clear and structured matter.

My main concern in relation to the manuscript relates to the nature of the simulation approach and the interpretability of the findings. The authors themselves repeatedly point at “the imprecise nature of the simulated data” (e.g. p13) or the simplicity of their simulation model, advising readers to be cautious with using their findings in real-world contexts or outright advising not to read too much into certain observations. While the authors’ modesty in presenting their work is highly appreciated, this does raise the question to what extent the findings can inform ongoing debates. Obviously, in such simulation studies, many simplifications and assumptions have to be made, all of which directly affect the results. Moreover, it has repeatedly been shown how social dynamics among selection panels are crucial in determining funding decisions. The simulation models, currently do not attempt to capture these dynamics, potentially further reducing their applicability to real-world scenarios.

In the conclusion, the authors themselves raise the question of what practical advice funders might take away from this simulation, concluding that it depends on their risk appetite. To me, however, the most important take away of the study is its pointers to phenomena of interest that require further experimentation. The authors themselves point out that some findings are as expected (which to me is mostly an indication of the appropriateness of their simulation), while others seem more surprising or counter intuitive (such as the findings in Fig 7). Tracing back what causes these findings, which might either be artifacts of the simulation model or some inherent property of funding decision processes, seems to me the exciting part of the manuscript’s results. Perhaps the authors could incorporate some suggestions along these lines in their paper.

Below I present additional specific comments and suggestions that might help the authors to further develop their manuscript:

  1. The model used to simulate the decision process now assumes that bias will necessarily lead to a lower score for the proposal under review. I guess in a realistic scenario, bias could equally lead to higher scores. Maybe the authors could integrate this element of symmetry into their model.
  2. The authors choose to refer to the ‘reliability’ of a decision process, which is measured by the accuracy and F1 score. I wonder whether simply sticking to the term ‘accuracy’ instead of reliability would do more justice to the nature of the measure. What is being measured is the distance between the simulated decision and the simulated ground truth, for which accuracy seems to me like an appropriate label.
  3. I have some doubt about one of the core assumptions made by the authors: a loss of information leads to a decrease in bias. First of all, I would argue that in this context, the information itself is not biased, but only an interpretation of information could be. But even then, I would argue that more information does not necessarily lead to more bias. Rather, it is mostly in the context of a lack of sufficient information that people tend to fall back to stereotypes or other preconditioned frames that ultimately could lead to biased decisions (e.g. only knowing that a person belongs to group X, could trigger stereotype projections of X’s characteristics on the individual, whereas more information about this person could reduce the need to fall back to such stereotype projections and hence reduce bias). Hence, the authors might want to reconsider this assumption.
  4. The manuscript repeatedly refers to simulated scores ‘reasonably closely’ matching those of real UKRI proposals. This would benefit from a little more specification. Figure 2 gives some information on this, but without numeric values on the y-axis is somewhat difficult to interpret. Nevertheless, the figure indicates that simulated scores are particularly more likely to be in the 2-3 range than those of actual proposals. A brief discussion on the potential implications of this for the study’s findings would be helpful.
  5. In section 4.2 the authors mention that their findings suggest “that the bias factors may be somewhat too large.” This assumes that the authors had some expectation about what is a realistic divergence of the ‘ground truth’ due to bias. Can the authors elaborate on this expectation and what it is based on?
  6. The authors, in their results section, fairly descriptively present their findings, sometimes indicating whether some result should be considered surprising or not, but otherwise largely refraining from interpreting the findings. I would like to encourage the authors to go a step further in an attempt to interpret their results. In particular, a reader might be interested to learn about (a) the extent to which the findings are (only) an artifact of choices made in the simulations and (b) how we should interpret some of the more surprising findings and what process might be causing them (e.g. in relation to figure 7). The authors build on this latter statement in their conclusions (“The middle ground may not be a safe place to reside when it comes to implementing partial randomization”), making it all the more relevant to provide some interpretation of the results and a description of the mechanism that causes this effect.
  7. In section 4.5 the author mention that “higher meeting reliability (F1) lead to higher mean fundability of funded proposals”. I was somewhat surprised by this statement, because the causal link seems not to be warranted here. Both the F1 and mean fundability of proposals are derivatives of the input data, rather than the one being the consequence of the other. Hence, a statement about correlation rather than causation seems more appropriate here, unless I have misunderstood the authors’ simulations.
  8. The references (both in-text citations and the reference list) now only mention the first author of the cited works.

Competing interests: None.

Peer Review 2

Thomas Feliciani

DOI: 10.70744/MetaROR.191.1.rv2

Hulkes and colleagues explore the important trade-off that underlies much of the academic debate on the desirability of allocating research funding by lot. On the one hand, lotteries may help curb the costs of peer review and undesirable biases in funding decisions. On the other hand, lotteries may also impair the ability of reviewers and review panels to identify and select the most promising project proposals. As more funders adopt some form of randomization in their processes, understanding this trade-off is becoming ever more important.

To date, only very few studies have tackled this issue. The paper re-examines some of the extant theoretical work with a computer simulation experiment. Crucially, the paper takes a different approach to previous work: grant peer review is modeled hinging on a different, reasonable set of assumptions, and bias and correctness of decisions are operationalized differently. This allows to compare results across the two implementations. To the extent that results from two alternative implementations agree with one another, the validity and generalizability of these results is strengthened; and when results come apart, we learn that some of the diverging assumptions can be consequential. This paper does both: it corroborates conclusions from previous research, but it also finds new intriguing effects that were never observed or reported before.

I list here my comments in a bulleted list, arranging them by theme.

Modeling assumptions

  • To simulate a meeting, the authors draw a random set of proposals from a large pool of pre-generated proposals. It may happen, as is pointed out at the start of Section 2.3, that the same proposal is chosen for more than one meeting. This modeling approach is convincing, I think. What I find less convincing is the argument made in Section 2.3 that re-uses are “in some ways similar to resubmission of a rejected proposal […]”.I do not find this to be a strong motivation for allowing re-uses in the simulation because it implies that resubmissions would be identical to the original submission. However, real-world resubmitted proposals are not typically identical to the original submission: they may have been updated to incorporate feedback from the previous submission; or they may be changed in order to better fit the new funding call to which they are resubmitted. To iterate, the modeling choice is not inherently problematic per se – I agree that occasional ‘re-uses’ of the same proposal are probably inconsequential. However, I see re-uses as a mere artifact, rather than a realistic modeling choice. Therefore, if you agree with me, my advice is to remove this argument, without necessarily changing the underlying modeling assumption. The text at very top of page 7 hints at the possibility of preventing this artifact – I suppose by drawing proposals from the pool without replacement. This would be a valid solution, too.
  • While my previous point was about a modeling assumption that was justified as realistic whereas I think it is not, the situation is reversed for another modeling assumption. In the third paragraph of 2.3, a very much realistic assumption is written off as a simplification. Here the text explains that the ranking of proposals and funding decisions are based on the average of the scores assigned to the proposals by reviewers. The “average” rule is introduced as a “necessary simplification”. I have two things to say in this regard. First is that this is strictly not a “necessary” simplification – it is not necessary because it could be as easily done in some other way.
    Secondly, I do not think this is a “simplification” at all. Some funders mandate that panels consider score averages in some of the steps involved in the construction of a ranking of proposals. To cite an example I’m familiar with: in some calls by Research Ireland, the peer review protocol instructs the panel quite explicitly to average scores in order to rank proposals. Quoting from the 2026 call document for their postdoctoral fellowships: “If the total average score is the same between two or more applications, applications with the same average scores will be ranked according to the higher average score under the highest weighted category”.
    Furthermore, even for funders that do not explicitly mandate averaging, research shows that average score of a proposal’s grades is a very good predictor of the final funding decision (see, e.g., Pina et al. 2021). This suggests that the modeling assumption is more than a necessary simplification: rather, it can be presented as a realistic assumption that reflects how rankings are constructed – following formal or informal protocols – by panels in different research funding organizations.

Idea attribution

  • I have a comment on the attribution of the idea of partial randomization in science funding. Paragraph 2 cites Avin (2019) for having identified “the oldest reference to [partial randomization] in an academic journal”. Indirectly this is crediting Greenberg’s opinion piece in The Lancet (Greenberg, 1998) for originating the idea. For one, I think that Greenberg should be credited directly by citing their work (Greenberg 1998) rather than (or in addition to) Avin who reports on it.
    Secondly, I wonder whether Greenberg really was the first to write about partial randomization in science funding – in the way we intend it today. Their 1998 proposal is indeed to distribute funding randomly among qualified researchers “whose qualifications and projects have been certified as respectable” – suggesting a partially randomized approach. However, in Greenberg’s view, this ‘respectability check’ is meant to replace peer review, and would only take “a fraction” of its cost. Arguably, many contemporary proponents of partial randomization – and, as far as I know, all funders running partial randomization – are not replacing traditional peer review with a much simpler ‘respectability check’. Partial lotteries today still involve a peer review panel that operates, for the most part, the way it always has.
    Therefore, my opinion based on how I interpret Greenberg’s words is that the idea of partial randomization in science funding – the way is implemented today – is to be credited to Brezis (2007), whereas I would credit Greenberg (1998) for being the first to propose randomization ‘tout-court’.
    In conclusion, I would recommend citing Brezis (2007) if you agree with me that Brezis’ proposal is a better fitting precursor to partial randomization as we know it today; and to explicitly cite Greenberg (1998) if you credit their opinion piece as the oldest published example. Citing both – Greenberg for proposing lotteries and Brezis for proposing modern partial lotteries – would also be a fair solution.

Results

  • I have two comments about Figure 5. The first comment is about the level of bias in this figure. Is this figure showing all simulations from all levels of bias lumped together (vs showing simulations from a specific value of bias)? Either way, this should be made clear in the text or caption.
  • The second comment is about the unit of analysis – i.e., the points on the scatterplot – which, here, are combinations of randomization and bias levels. As the caption points out, some combinations will appear very often in the data, whereas others will only have been simulated once or a few times. The caption also points out that edge cases – those with very high or low randomization or bias – are rare, and thus have little bearing on the regression line shown in the figure. I wonder whether the same plot would be more effective it was showing individual simulations rather than parameter combinations: it would then show that rare combinations are, well, rare. However, I can imagine that a scatterplot with all 100k simulations will be extremely busy: would then a heatmap (along the lines of Figure 4) be a solution?
  • About Section 4.4 and Figure 7. This section concludes observing some non-linearities in the relationship between bias, randomization, and reliability – i.e., the diagonal ‘beam’. I am more than a bit intrigued by this effect. The paper does not attempt an explanation for it, yet this is a non-trivial and completely unexpected effect. I wonder whether this is a substantive result or an artifact. Can you offer a speculation, if not an explanation?

Smaller points

  • A curiosity about Figure 2. If I understand correctly, this is showing how actual and simulated review scores are distributed, with vertical black lines showing how a more granular value space is discretized into the six available grades. If I understood correctly, what I’m curious about is why there is a more granular value space to begin with: why is there not a single frequency value for each of the intervals on the X-axis, but there seems to be a more continuous distribution? If I had to guess, I would say that the red line is actually showing reviewer scores averaged across criteria – and because there are multiple criteria, the average score of a reviewer for a proposal is on a scale that is more granular than the 1-to-6 scale used to score each criterion. Either way, I would recommend explaining this in the caption.
  • Second half of page 5: “In (Feliciani 2024) the intrinsic fundability is called a reference evaluation”. The concepts of intrinsic fundability and reference evaluation are indeed similar, and in this context they refer to the same element of the simulation. However, since they are two distinct concepts, I would rephrase this sentence in a way that avoids implying they are one and the same. For example: “The ground truth for the evaluation of proposals is modeled in a similar way by Feliciani et al. (2024). They also draw from a beta distribution …”,  or “Feliciani et al. (2024) do not explicitly model intrinsic fundability, relying instead on a different ground truth. Nonetheless, the ground truth in their simulations is modeled in a similar way, i.e., by drawing from a beta distribution…”, or similar.
  • The second paragraph of section 2.3 describes the two types of simulated meetings. The description of the second type is unclear – I struggled to see how it was not a rephrasing of what was said for the first type. In case it can be of help to debug where I get lost here: I take the word “groups” to be the same as “sets of reviews”. Thus, I think that the whole paragraph can be summarized as follows – notice how I split the second type into two: Each simulated meeting can be of three kinds: (1) a meeting where no review is biased; (2) a meeting where all reviews are biased; (3) a meeting where each review has a 50% likelihood of being biased. For each meeting of the third kind, we also simulate, in parallel, a meeting of the first or second kind – determined at random, with equal chance – with the same set of proposals.
    If that sounds about right, then the first two paragraphs of section 2.3 can be considerably streamlined; else, there is a more fundamental problem understanding how the experiment was set up.
    Furthermore, there’s a typo at the beginning of the very next paragraph: “In both types of meeting […]”.
  • I have a few recommendations or comments about Figure 1 – the figure showing various parameterizations of the beta distribution. First, for consistency with the text, consider replacing “shape 1” and “shape 2” with α and β. This avoids the need to explain panel titles in the caption. Second, the beta distribution is used in other parts of the simulation as well – for example, to determine the funding rate of a given meeting (α = 6, β = 14). Furthermore, in case you haven’t considered it, I wonder whether it would help the reader to have all possible beta distribution in this figure, beyond just the distributions used for drawing review scores. I can think of arguments why not to include other uses of the beta distribution, so I suppose the figure could work either way. Third, Figure 1 is introduced in section 4.1, but mentions of the beta distribution appear much earlier in the paper. Perhaps Figure 1 would then be more useful if introduced earlier.
  • The second-last paragraph of Section 5.2. (“Given the apparent ease […]”) is particularly hard to follow due to the way it is phrased. I would recommend using a simpler sentence structure.
  • The style of references – both in line and in the reference list – does not seem to comply with any standard I know (e.g., APA 7
  • For reproducibility, consider publicly sharing your code on an open repository such as CoMSES (comses.net) or Zenodo (zenodo.org).

References

Avin, S. (2019). Mavericks and lotteries. Studies in History and Philosophy of Science Part A76, 13-23.

Brezis E. S. (2007) ‘Focal Randomisation: An Optimal Mechanism for the Evaluation of R&D Projects’, Science and Public Policy, 34: 691–8.

Feliciani, T., Luo, J., & Shankar, K. (2024). Funding lotteries for research grant allocation: An extended taxonomy and evaluation of their fairness. Research Evaluation33, rvae025.

Greenberg, D. S. (1998). Chance and grants. The Lancet351(9103), 686.

Pina, D. G., Buljan, I., Hren, D., & Marušić, A. (2021). A retrospective analysis of the peer review of more than 75,000 Marie Curie proposals between 2007 and 2018. Elife, 10, e59338.

Competing interests: None.

Author Response

DOI: 10.70744/MetaROR.191.1.ar

We would like to thank the two reviewers for their comments. We have provided a new version of the original submission which has been revised in light of these comments. Where we have not made a revision, or where the comments are of a more general nature, we will indicate the fact in this author response.

Our response is structured around the two reviews in turn.

Reviewer 1

“[the authors’ modesty in presenting their work] does raise the question to what extent the findings can inform ongoing debates. Obviously, in such simulation studies, many simplifications and assumptions have to be made, all of which directly affect the results. Moreover, it has repeatedly been shown how social dynamics among selection panels are crucial in determining funding decisions. The simulation models, currently do not attempt to capture these dynamics, potentially further reducing their applicability to real-world scenarios.”

We agree with this observation. The description of the simulation is rather reticent, as a result of our deliberate intention to avoid over-claiming on the basis of a model which, as the reviewer suggests, does not even begin to capture the richness of even the simplest review process. At this stage we would prefer to leave it to readers of the work to make their own judgement about the validity of findings. We do however welcome the support of the reviewer in suggesting that interpretation could go a little further, if that is what readers wish to do.

“Tracing back what causes these findings, which might either be artifacts of the simulation model or some inherent property of funding decision processes, seems to me the exciting part of the manuscript’s results. Perhaps the authors could incorporate some suggestions along these lines in their paper.”

We have added an additional paragraph at the end of section 4.4 which indicates an extension of the work that was not included in the document for reasons of brevity, and because it was not thoroughly investigated. We do not intend to explore these results any further and would welcome it if others were able to do so, using this or another model.

“The model used to simulate the decision process now assumes that bias will necessarily lead to a lower score for the proposal under review. I guess in a realistic scenario, bias could equally lead to higher scores. Maybe the authors could integrate this element of symmetry into their model.”

This is correct. The overall assumption is that bias, as implemented, is net negative. We have taken this approach as much of the interest in application of partial randomisation in relation to bias reduction assumes this. We have added additional explanatory text in the relevant section and altered the abstract to make it clear that the model is based on net negative bias.

“The authors choose to refer to the ‘reliability’ of a decision process, which is measured by the accuracy and F1 score. I wonder whether simply sticking to the term ‘accuracy’ instead of reliability would do more justice to the nature of the measure. What is being measured is the distance between the simulated decision and the simulated ground truth, for which accuracy seems to me like an appropriate label.”

This is an unfortunate result of the technical use of the word ‘accuracy’ in a classification context, where ‘accuracy’ has the specific meaning set out in the text. Accuracy refers to the specific measure, so we had to find an alternate word with no technical meaning that described the same quality. Reliability seemed to us to be the best fit as it does not, as far as we are aware, have a particular meaning in classification measurement.

“I have some doubt about one of the core assumptions made by the authors: a loss of information leads to a decrease in bias. First of all, I would argue that in this context, the information itself is not biased, but only an interpretation of information could be. But even then, I would argue that more information does not necessarily lead to more bias. Rather, it is mostly in the context of a lack of sufficient information that people tend to fall back to stereotypes or other preconditioned frames that ultimately could lead to biased decisions (e.g. only knowing that a person belongs to group X, could trigger stereotype projections of X’s characteristics on the individual, whereas more information about this person could reduce the need to fall back to such stereotype projections and hence reduce bias). Hence, the authors might want to reconsider this assumption.”

This comment reflects a lack of precision in our writing, for which we apologise and in response to which we have made what we hope are appropriate modifications in the text. When we refer to ‘information’ in this context, it is strictly the information embodied in the reviewers’ scores for a simulated proposal. The reviewer is of course correct that if there is a lack of information that allows a reviewer to assess fundability, all they have to go on is their biases.

“The manuscript repeatedly refers to simulated scores ‘reasonably closely’ matching those of real UKRI proposals. This would benefit from a little more specification.
Figure 2 gives some information on this, but without numeric values on the y-axis is somewhat difficult to interpret. Nevertheless, the figure indicates that simulated scores are particularly more likely to be in the 2-3 range than those of actual proposals. A brief discussion on the potential implications of this for the study’s findings would be helpful.”

We have adjusted the text accordingly, in both the main body and the caption to Figure 2.

“In section 4.2 [sic, actually 4.3] the authors mention that their findings suggest “that the bias factors may be somewhat too large.” This assumes that the authors had some expectation about what is a realistic divergence of the ‘ground truth’ due to bias. Can the authors elaborate on this expectation and what it is based on?”

We have added explanation of why this is our belief to the text. It centres on the lack of any real-world biases that are associated with success rate differences of this size.

“The authors, in their results section, fairly descriptively present their findings, sometimes indicating whether some result should be considered surprising or not, but otherwise largely refraining from interpreting the findings. I would like to encourage the authors to go a step further in an attempt to interpret their results. In particular, a reader might be interested to learn about (a) the extent to which the findings are (only) an artifact of choices made in the simulations and (b) how we should interpret some of the more surprising findings and what process might be causing them (e.g. in relation to figure 7).
The authors build on this latter statement in their conclusions (“The middle ground may not be a safe place to reside when it comes to implementing partial randomization”), making it all the more relevant to provide some interpretation of the results and a description of the mechanism that causes this effect.”

Having considered this helpful encouragement, we are still reluctant to go further in providing explanation of observations that may be artefacts of (the simplicity of) the simulation. The extension to section 4.4 goes some way to addressing the reviewer’s suggestion and we hope that they and other readers find it useful.

“In section 4.5 the author mention that “higher meeting reliability (F1) lead to higher mean fundability of funded proposals”. I was somewhat surprised by this statement, because the causal link seems not to be warranted here. Both the F1 and mean fundability of proposals are derivatives of the input data, rather than the one being the consequence of the other. Hence, a statement about correlation rather than causation seems more appropriate here, unless I have misunderstood the authors’ simulations.”

The reviewer is entirely correct and we have changed the text accordingly.

“The references (both in-text citations and the reference list) now only mention the first author of the cited works.”

As the target outlet for this work has no specific formatting guidelines, we have opted simply for the citation system built in to MS Word.

Reviewer 2

“What I find less convincing is the argument made in Section 2.3 that re-uses are “in some ways similar to resubmission of a rejected proposal […]”. I do not find this to be a strong motivation for allowing re-uses in the simulation because it implies that resubmissions would be identical to the original submission. However, real-world resubmitted proposals are not typically identical to the original submission: they may have been updated to incorporate feedback from the previous submission; or they may be changed in order to better fit the new funding call to which they are resubmitted. To iterate, the modeling choice is not inherently problematic per se – I agree that ‘re-uses’ of the same proposal are probably inconsequential. However, I see re-uses as a mere artifact, rather than a realistic modeling choice. Therefore, if you agree with me, my advice is to remove this argument, without necessarily changing the underlying modeling assumption. The text at very top of page 7 hints at the possibility of preventing this artifact – I suppose by drawing proposals from the pool without replacement. This would be a valid solution, too.”

We have overplayed this interpretation, perhaps making a virtue out of a non-necessity, and have adjusted the text accordingly. The re-use of reviewer score sets is not problematic for the simulation as one set that is [6, 3, 3] is exactly the same as a distinct set that has the same scores. With so many in the pool to draw on, re-use has no appreciable effect but our original lack of clarity clearly is not helpful.

“I do not think [the use of average scores to determine rankings] is a “simplification” at all. Some funders mandate that panels consider score averages in some of the steps involved in the construction of a ranking of proposals. To cite an example I’m familiar with: in some calls by Research Ireland, the peer review protocol instructs the panel quite explicitly to average scores in order to rank proposals. Quoting from the 2026 call document for their postdoctoral fellowships: “If the total average score is the same between two or more applications, applications with the same average scores will be ranked according to the higher average score under the highest weighted category”.

Furthermore, even for funders that do not explicitly mandate averaging, research shows that average score of a proposal’s grades is a very good predictor of the final funding decision (see, e.g., Pina et al. 2021). This suggests that the modeling assumption is more than a necessary simplification: rather, it can be presented as a realistic assumption that reflects how rankings are constructed – following formal or informal protocols – by panels in different research funding organizations.”

We are grateful to the reviewer for this support for the approach, and have added further information to the text to indicate that the assumption may be more realistic than we had first believed based on personal experience in UKRI processes.

“In conclusion, I would recommend citing Brezis (2007) if you agree with me that Brezis’ proposal is a better fitting precursor to partial randomization as we know it today; and to explicitly cite Greenberg (1998) if you credit their opinion piece as the oldest published example. Citing both – Greenberg for proposing lotteries and Brezis for proposing modern partial lotteries – would also be a fair solution.”

We will do just that. Unfortunately we were unable to access the original publication by Greenberg (paywalled) and so had to cite Avin’s citation. But we are happy to rely on the reviewer’s interpretation in this regard. And grateful for their insight.

“I have two comments about Figure 5. The first comment is about the level of bias in this figure. Is this figure showing all simulations from all levels of bias lumped together (vs showing simulations from a specific value of bias)? Either way, this should be made clear in the text or caption.”

Each point comes with its own level of bias, as in Figure 4. We have altered the caption of Figure 5 to make this more clear.

“I wonder whether [Figure 5] would be more effective it was showing individual simulations rather than parameter combinations: it would then show that rare combinations are, well, rare. However, I can imagine that a scatterplot with all 100k simulations will be extremely busy: would then a heatmap (along the lines of Figure 4) be a solution?”

The reviewer has correctly inferred the reason why Figure 5 is presented as it is. It would be computationally challenging for us to present the work as suggested. But we hope that the gist of the results is still accessible to readers.

“About Section 4.4 and Figure 7. This section concludes with observing some non-linearities in the relationship between bias, randomization, and reliability – i.e., the diagonal ‘beam’. I am more than a bit intrigued by this effect. The paper does not attempt an explanation for it, yet this is a non-trivial and completely unexpected effect. I wonder whether this is a substantive result or an artifact. Can you offer a speculation, if not an explanation?”

The other reviewer made a similar statement, and we have partially addressed it as above. In general we are a bit reluctant to read much more into it, but other work suggests that it may be a feature, rather than an artefact. We do not intend to pursue any of this work further.

“A curiosity about Figure 2. If I understand correctly, this is showing how actual and simulated review scores are distributed, with vertical black lines showing how a more granular value space is discretized into the six available grades. If I understood correctly, what I’m curious about is why there is a more granular value space to begin with: why is there not a single frequency value for each of the intervals on the X-axis, but there seems to be a more continuous distribution?”

The reviewer is entirely correct. We used a continuous (density) distribution rather than the real discrete distribution, purely for aesthetic reasons. We have clarified this in the caption for Figure 2.

“Second half of page 5: “In (Feliciani 2024) the intrinsic fundability is called a reference evaluation”. The concepts of intrinsic fundability and reference evaluation are indeed similar, and in this context they refer to the same element of the simulation. However, since they are two distinct concepts, I would rephrase this sentence in a way that avoids implying they are one and the same. For example: “The ground truth for the evaluation of proposals is modeled in a similar way by Feliciani et al. (2024). They also draw from a beta distribution …”,  or “Feliciani et al. (2024) do not explicitly model intrinsic fundability, relying instead on a different ground truth. Nonetheless, the ground truth in their simulations is modeled in a similar way, i.e., by drawing from a beta distribution…”, or similar.”

We have changed the text appropriately, and thank the reviewer for the clarification.

“The second paragraph of section 2.3 describes the two types of simulated meetings. The description of the second type is unclear – I struggled to see how it was not a rephrasing of what was said for the first type. In case it can be of help in debugging where I get lost: I take the word “groups” to be the same as “sets of reviews”. Thus, I think that the whole paragraph can be summarized as follows – notice how I split the second type into two: Each simulated meeting can be of three kinds: (1) a meeting where no review is biased; (2) a meeting where all reviews are biased; (3) a meeting where each review has a 50% likelihood of being biased. For each meeting of the third kind, we also simulate, in parallel, a meeting of the first or second kind – determined at random, with equal chance – with the same set of proposals.
If that sounds about right, then the first two paragraphs of section 2.3 can be considerably streamlined; else, there is a more fundamental problem understanding how the experiment was set up.”

We have reviewed the text and tried to clarify it further. There are no meetings where no reviews are biased, so type 1 above does not exist. Nor does type 2, as the first type of meeting comprises a mixture of (potentially) unbiased, partially biased and completely biased proposals. Type 3 above is correctly described by the reviewer.

“I have a few recommendations or comments about Figure 1 – the figure showing various parameterizations of the beta distribution. First, for consistency with the text, consider replacing “shape 1” and “shape 2” with α and β. This avoids the need to explain panel titles in the caption. Second, the beta distribution is used in other parts of the simulation as well – for example, to determine the funding rate of a given meeting (α = 6, β = 14). Furthermore, in case you have not considered it, I wonder whether it would help the reader to have all possible beta distribution in this figure, beyond just the distributions used for drawing review scores. I can think of arguments why not to include other uses of the beta distribution, so I suppose the figure could work either way. Third, Figure 1 is introduced in section 4.1, but mentions of the beta distribution appear much earlier in the paper. Perhaps Figure 1 would then be more useful if introduced earlier.”

We agree that switching between ‘shape 1’ and alpha is not ideal, but must admit that the difficulty of representing Greek characters in R prevented us from making the labelling more consistent. We chose to keep the different beta distributions distinct for clarity, but agree that there is something to be said for presenting all of them together. As this is a matter of preference we have decided not to change the layout. The points made about the early introduction of the concept of the beta distribution are thought-provoking. While we may have assumed a level of awareness of the beta distribution that some readers might not have, we also wish to avoid taking too much of a diversion in the text to explain this background information. We hope that the reviewer understand our choice here, and that not altering the text is acceptable to them.

“The second-last paragraph of Section 5.2. (“Given the apparent ease […]”) is particularly hard to follow due to the way it is phrased. I would recommend using a simpler sentence structure.”

A fair point – it is a bit flowery. We have re-written in a simpler way.

“The style of references – both in line and in the reference list – does not seem to comply with any standard I know (e.g., APA 7th), and there seem to be a few typos in the reference list. I recommend double-checking for both style and typos.”

Also identified by the other review and addressed as above. The typos are a feature of how the bibliography in the original Word document works so unfortunately we are unable to correct them.

“For reproducibility, consider publicly sharing your code on an open repository such as CoMSES (comses.net) or Zenodo (zenodo.org).”

Thank you for the suggestion. We have added the code to Zenodo and included a link in the new version of the text: 10.5281/zenodo.17532196

Leave a comment