Modeling the global citation network using the scalable agent-based simulator for citation analysis with recency-emphasized sampling (SASCA-ReS)

Minhyuk Park¹, Haotian Yi¹, Tandy Warnow¹, George Chacko¹

^1* Siebel School of Computing and Data Science, Grainger College of Engineering, University of Illinois Urbana-Champaign, 201 N. Goodwin Ave, Urbana, 21801, IL, USA.

Originally published on November 30, 2025 at:

https://doi.org/10.5281/zenodo.17772113

Abstract

Preprint of a manuscript developed by Park, M; Yi, H; Warnow, T; and Chacko, G concerning very large scale simulations of the growth of citation networks using an agent based modeling approach. A document containing supplemental material accompanies the manuscript. The manuscript contains a link to a public Github repository where the code generated for this project is available. Data generated during this project is being made available via the Illinois Data Bank under repository IDB-9265079.

Full text

1 Introduction

We conservatively estimate the body of scientific literature to be in the order of 100-300 million articles [1–3] today: a remarkable increase of around six orders of magnitude over 360 years since 1655 when the first scientific journal published less than a hundred articles in its inaugural year [4]. Representing this collection as a citation network with nodes (articles) connected by directed edges (citations) [5–8] enables scientometric studies from historical, epistemological, sociological, and graph-theoretic perspectives. Within this area of investigation, we are interested in understanding how the citations made by authors influence the structure of large citation networks. Given the incentives for authors to want to be well-cited, we are also interested in strategies that influence the accumulation of citations at the level of an individual article.

Preceding theories of citation offer some insight into the motivations of researchers that translate into citation behavior [9–13]. These motivations can be coarsely grouped into normative and rhetorical types, with the additional motivation of malpractice being worthy of mention today [14, 15]. However, motivations are not easily measured and are unlikely to be mutually exclusive. It seems more likely that each citation made by an author comprises multiple motivations that vary with time and context. Thus, quantitatively analyzing motivations is challenging. Nevertheless, citation behavior has been modeled on stylized normative and rhetorical behavior, although at a small scale and for the purpose of studying effects on “community health” [16].

Rather than attempt to quantify overlapping motivations, it is possible to encode the patterns of citation that have been previously observed and observe their effects on a growing network. In this respect, mathematical models of network growth [17–19] have been proposed that incorporate randomness, preferential attachment [20, 21], recency (immediacy) [5, 22], as well as fitness (epistemic quality) [23, 24].

In an approach complementary to mathematical modeling of citation dynamics, we have previously developed two idealized agent-based models (ABMs): PyABM [25], written in Python, followed by SASCA-s [26] (Scalable Agent-based Simulator for Citation Analysis with sampling), written in C++. In these models, publications are represented as nodes in citation networks and citations by directed edges.

During a simulation, batches of autonomous agents are created in time-steps of a year and make citations to existing nodes after which they serve as targets for other agents. The basis for selecting a target node is a composite formulation of the target’s features, its distance from the generator node, and the agent’s “citation phenotype”, which is a weighted combination of its bias towards preferential attachment, recency, and fitness. Thus, given an input of a “seed” network, the output is a larger citation network resulting from letting the random process operate over a period of years.

In the PyABM model, each new agent evaluates every other node in the network before selecting nodes to cite. This approach becomes expensive as the network grows and greatly limits scalability, essentially limiting PyABM simulations to 1-2 million nodes. SASCA-s addresses this limitation by only having each new agent scoring a random sample of the nodes within the 2-hop neighborhood of its generator node (i.e., the nodes that distance 1 or 2 from the generator); this approach, along with the benefits of a compiled language, allowed SASCA-s to scale to more than 100,000,000 nodes. A third generation of the initial model, SASCA-ReS (Scalable Agent-based Simulator for Citation Analysis with Recency-emphasized Sampling) improved recency modeling by enforcing the distribution of citations made by each agent to produce a good fit to an empirically derived “recency table” that we computed from a citation network. This modification to SASCA-ReS provides the same scalability advantage of SASCA-s but with a better match to the several empirical properties of citation networks. The results presented in this manuscript are from the use of SASCA-ReS.

Of the various ways in which ABMs can be used [27], we use our model to support reasoning rather than to replicate real-world observations. We stress that SASCA-ReS is an idealized model in which realism is incorporated without attempting to copy it in detail. This offers advantages in terms of model parsimony and computational cost. Apart from the small number of variables, we have made additional simplifying assumptions: First, that every agent represents an article written by a single author and that no author has contributed more than one article to the network. Second, that we do not consider disciplinary habits or trends in citation behavior over time. The scenarios illuminated by the simulations reported below are exploratory and may or may not map to corresponding mechanisms of network growth in the real-world.

In the sections that follow, we report results from simulations under varied conditions resulting in networks ranging from just over a million nodes to greater than 100 million nodes. We focus both on network structure and the accumulation of citations by individual nodes. Throughout this manuscript, in degree of a node is used to refer to citations received by a node while references and out degree is synonymous with references, i.e., citations made to other nodes.

2 Materials and Methods

2.1 Data

Simulations shown in this manuscript used with one of three different seed networks as input: (i) the Stahl-Johnson (sj ) network consisting of 491,532 nodes and 899,051 edges, which is derived from a real-world network from the biomedical literature and has been previously described in [25], (ii) the sj er network, which is an Erdős–Rényi network modelled after the sj network, generated using the sample gnm model in the R igraph package where n = 491532 and m = 899, 051, and (iii) the g10x10 network, which is a lattice network with 100 nodes arranged in a square grid with edges directed from one corner to the opposite corner in an acyclic pattern. After generation, nodes without edges were dropped. Data from the use of sj er are shown in Supplemental Materials.

	# nodes	# edges	Reference
sj	491,532	899,051	[26]
sj er	479,027	899,051	[26]
g10x10	100	180

Table 1 Seed networks

The real-world networks used for comparison have been previously described. The Curated Exosome Network or CEN (cen) and Open Citations (oc) network with 13,989,436 and 75,025,194 nodes respectively [28] and the network extracted from Open Alex data (oa) with 111,453,719 nodes [29].

2.2 SASCA-ReS

Here we describe SASCA-ReS in more detail. Additional details are provided in Supplementary Materials and the code for SASCA-ReS is freely available from a Github site [30]. The design of SASCA-ReS incorporates a mid-level modeling approach where parameter settings for recency and out degree quota are drawn from real-world data in order to focus on more likely rather than formally possible outcomes [31].

Each agent is created by vertex copying [32, 33] from a randomly selected node in the network, termed the “generator”. When created, each agent is assigned a quota of citations to make that is randomly drawn from a real-world distribution of reference counts (see out degree distribution below) and randomly assigned a fitness value ranging from 1-1000 that is drawn from a power law distribution. Thus, every agent has a timestamp for when it was created (year), its fitness, and a quota of citations to make as attributes. Each agent within a time-step makes citations independently of other agents and the network is updated at the end of the time step. The duration of simulation and number of agents added each year are controlled by environment parameters set by the user for count of years and growth rate as a percentage of the number of nodes in the network.

Each agent has a parameter α that defines the fraction of its citations within the 2-hop neighborhood that go to the 1-hop neighborhood of its generator; thus, α = 1 means that all its citations (other than the citation to the generator and the idiosyncratic citations that go outside the 2-hop neighborhood) are to nodes in the 1-hop neighborhood.

We computed a recency table (Supplementary Materials) based on references in publications in a real-world citation network [8]. Similarly we have constructed an out degree distribution array from recent PubMed data, which is described in [26]. This recency table defines the expected fraction of its citations an agent will make to papers that were published n years ago, for each n ≥ 0. Given the out degree quota and recency table, we derive the expected distribution of citations to papers from a given year for each agent. Since achieving this distribution exactly is not always possible, we have implemented an approach that redistributes the out degree when necessary to publications in adjacent years. This technique provides a good fit to the distribution defined by the recency table.

To account for citations of an idiosyncratic [34, p. 203] nature, a small fraction of citations made by agents is made to nodes selected by uniform random sampling from the entire network.

Since recency is now managed externally (as described above), the citation phenotype for an agent is a weighted composite of only two attributes—preferential attachment and fitness. In contrast, the citation phenotypes for SASCA-s and PyABM also included a weight for recency, which is not relevant for SASCA-ReS. Thus, in SASCA-ReS, the phenotype for an agent node is given by the weights it attaches to preferential attachment and fitness, where the weights add up to 1.0. Agent phenotypes may be randomly assigned (ra) from a uniform distribution of possible phenotypes or fixed such that every agent has exactly the same phenotype.

Finally, like SASCA-s, SASCA-ReS only samples and scores at most 20,000 nodes in the 2-hop neighborhood of its generator node (that is, it samples and scores at most 10, 000 node at distance 1 and 10, 000 nodes at distance 2 from its generator). The small fraction of nodes that are cited from outside the 2-hop neighborhood are just randomly selected but not scored. In this way, SASCA-ReS only scores at most 20, 000 nodes in the network, all of which are in the 2-hop neighborhood of its generator node, and thus achieves the same scalability as SASCA-s, while obtaining a better fit to the observed recency distribution of citation networks.

2.3 Simulations

SASCA-ReS simulations were conducted by passing four input files and a configuration file created by a user to the SASCA-ReS code. Additional details on how to perform simulations are available from the Github repository where the SASCA-ReS code is archived as open source [30]. The configuration file requires 23 entries of which the first 11 control simulation at the environment level, the next 7 specify the characteristics of of individual agents, and the remaining manage parallelism, logging, and the location of output files (Supplementary Materials).

All simulations were performed on a compute cluster with up to 128 cores of parallelism and 512 GB of memory. Simulations were seeded with citation networks, Erdős–Rényi graphs, or lattice graphs with the fitness of seed nodes set to 1 in each case. A “standard” simulation was to simulate with the Stahl-Johnston seed set (below in Data) of around 500,000 nodes for 30 years at a growth rate of 3% using 16 cores of parallelism, which would result in an output network of around 1.2 million nodes in the order of 3-4 minutes. The default environment was to randomize agent phenotypes ra with respect to preferential attachment weight (pa) and α but results from partially static agent environments were also conducted where either α or preferential weight was set to fixed values for all agents. For a subset of the experiments reported here, simulations were scaled to (i) 10-20 million nodes, (ii) 70-80 million nodes, and (iii) 100-250 million nodes.

The largest network described in this manuscript has roughly 161 million nodes, and is termed the abm161m network (Table 2). Motivated by the two-dimensional epistemic grid used in [35, 36], we used a lattice graph of dimension 10 × 10 to approximate the number of publications in the inaugural year (1665) of [4]. The year of each node in the lattice was set to 1655, the fitness of each node was set to 1 (out of a possible range of 1 − 1000), and a constant growth of growth rate 4.0125% was applied for a simulation that lasted 360 years. For the first 150 years of this simulation, a normally distributed out degree distribution ranging from 1 − 10 was used, after which it was exchanged for a PubMed derived out degree distribution described in [26] that reflects more modern citation rates. For the two smaller networks, abm14 and abm76, the sj dataset was used as a starting point and the simulation allowed to proceed for 85 and 128 years respectively at a growth rate of 4.0125%.

2.4 Random Forest Analysis

A random forest regression analysis [37] was conducted to study model variables relative to time. This experiment was performed on the abm14 simulation where the final network size was 13,926,219 nodes and the simulation spanned 85 virtual years. A snapshot of node attributes for all agent nodes was recorded every 5 years, producing a total of 18 snapshots (including one from the final year of the simulation). For each snapshot, a random forest regressor was trained with Scikit Learn Python Library [38]. Each model consisted of 100 decision trees, with max features = n features and no max depth constraint; other hyperparameters were left at default values. The input features were publication year, the preferential attachment weight, fitness, and out degree. The target variable was in degree (number of citations received). We computed two types of feature importance: impurity-based feature importance (reported in Supplementary Materials) and permutation feature importance using the R² score. Feature importances were evaluated by examining empirical trends observed in the agent-based simulations to analyze SASCA-ReS model behavior. All models were trained on the Illinois Campus Cluster with up to 128 cores and 512GB of memory. The training scripts and analysis procedures are available in our Github repository [39].

2.5 Clustering Analysis

The abm14 network as well as variants of it were clustered using the Leiden algorithm optimizing for the constant Potts model (CPM) [40] was used with the number of iterations set to 2 and resolution values of either 0.1 or 0.01. For abm14, abm76, and abm161, we attempted to cluster all three networks using a parallel Louvain implementation [41] on a machine with 64 cores and 950GB of RAM and a 4 hour time limit. Although the runs on abm14 and abm76 were successful, abm161 ran into a timeout.

3 Results

In this section, we present results from experiments using SASCA-ReS that are designed to explore recency modeling, scalability, the relative importance of model variables, community structure, and unconventional strategies to optimize citations to a document in a citation network.

3.1 Recency Modeling

A salient feature of citation behavior is recency or immediacy [5], the tendency of authors to cite more contemporary articles than older ones.

In earlier versions of our ABM (SASCA-s and PyABM), high in degree nodes in a seed graph had an advantage in accumulating citations [42, pg5 and Fig. 1] on account of preferential attachment overriding recency. In SASCA-ReS, we improved recency modeling through binning approach (Supplementary Materials), which can be tuned to recency patterns of different fields. Our choice of bin size reflects patterns in the the biomedical literature and is derived from the cen network (Materials and Methods).

The effectiveness of this approach is shown in Fig. 1, where the proportion of citations made to preceding year nodes is more closely aligned with real-world data than in the case of the SASCA-s and PyABM models. The settings in Fig. 1 were used for all simulations reported in this manuscript.

Fig. 1 SASCA-ReS improves recency modeling. Here we show the true recency distribution (real-world data) for comparison with observed recency distributions using SASCAS-ReS, SASCA-s, and PyABM. For all models shown, the simulation was performed for 30 years at a growth rate of 3% using the sj seedset in a randomized agent environment. For SASCA-ReS, recency bins (Materials and Methods) were set to single-year bins with bin i containing all publications from i years ago for the first ten years, followed by 5-year bins until year 51. Those articles that were published at least 51 years ago were assigned to the last bin.

3.2 Scalability

To establish the scalability of SASCA-ReS, we simulated the growth of citation networks at three different scales: (i) 10-20 million nodes (ii) 50-100 million nodes, and

(iii) greater than 100 million nodes. These simulations were conducted in the default environment, ra, where the citation phenotype of agents was randomized to represent a distribution of citation behaviors. These scales were chosen because of the availability of comparable real-world citation networks, although the oa real-world network is only about 70% as large as the abm161 network while the abm14 and abm76 networks are much more closely matched in size.

In the first and second cases, the simulated networks comprised 14,020,073 nodes (abm14) and 76,107,698 nodes (abm76) respectively (Table 2 and Figure 2). In the third case, also a metaphor for the growth of the scientific literature from 1665-2025 (see Introduction), the resultant graph (abm161) network consisted of 160,714,032 nodes with a runtime of just under 71 hrs. These simulated networks are less sparse than the real-world examples they were compared against, which we attribute to incompleteness in the data for our real-world networks and our generative model.

Fig. 2 Scalable simulations of citation networks. Synthetic citation networks were generated with SASCA-ReS at three different scales (the count of nodes for the synthetic networks is approximately 161M, 76M and 14M). Real-world networks are shown for comparison. Left: Frequency of each in degree. Right: Cumulative distribution function of in degree (both logarithmic).

Unsurprisingly, the software tools that we typically use to analyze networks do not scale well to the two larger networks (abm76 and abm161) (Supplementary Materials). While it is possible to compute node and edge counts and degree distributions, it is expensive to compute metrics such as diameter and clustering coefficients, and even more challenging to cluster these networks. While this finding underscores the opportunity to develop more scalable tools, it also restricted much of our analysis to networks at the 10-20 million node scale exemplified by the abm14 network. Consequently, we have not yet tried to cluster the largest network generated under SASCA-Res, which amounted to 217,839,850 nodes and 9,390,741,826 edges (Supplemental Materials) and for which we do not have a comparable real-world network.

3.3 Temporal Analysis

For the abm14 network, we examined how the in degree accumulated by agents at the the end of a simulation varies according to when the agent was created (Figure 3). Agents that are created earlier in the simulation tend to have higher in degree compared to those created later, likely since the agents created later have fewer years to accumulate citations in even with recency being enforced.

**Table 2 Summary statistics of simulated and real-world networks.** Network sizes are shown for simulated networks (rows 1-3) and real-world networks (rows 4-6); we also show the time used to generate the simulated networks. The networks oa and oc are derived from the OpenAlex and Open Citations collections respectively [43], and *cen* [8] refers to the Curated Exosome Network, which is derived from the biomedical literature.
# nodes	# edges	network	generation	runtime
1	160,714,032	6,940,630,767	abm161	< 71hrs
2	75,598,213	3,245,034,596	abm76	< 4 hrs
3	13,926,219	581,472,876	abm14	< 30 min
4	111,453,719	2,148,788,148	oa	n.a.
5	75,025,194	1,363,303,678	oc	n.a.
6	13,989,436	92,051,051	cen	n.a.

Examining the profiles of agents in the top 0.1% for in degree (4,188–260,404 citations) in this simulation, the median values for out degree and fitness amount to 103 and 18 respectively. Further, 92.04% of the agents in the top 0.1% have out degree or fitness at least that of these medians. We evaluated the classifier for observations in the top 0.1% where in degree of “out degree > 103 or fitness > 18”, and noted sensitivity and specificity of 0.926 and 0.948 respectively. These data strongly suggest that out degree and fitness of an agent are influential, which is also consistent with our previous observations using the PyABM model [42]. In contrast, the median value of pa weight was marginally different from the expected value of 0.5. However, the roughly 7.4% of agents in the false negative group suggest additional determinants in these stochastic simulations.

Fig. 3 Positional measures of in degree for the abm14m network. The plot shows the minimum, first and third quartiles, median, and maximum in degree for agents grouped by year in which they were created for the 85-year simulation that generated the abm14m network.

We also used a Random Forest regressor to infer the relative importance of fitness, out degree, pa weight, and time of creation (Figure 4). Across simulation years, the permutation feature importances (PFI) stabilize during the course of the simulation, suggesting that once the dataset becomes sufficiently large, random forest models capture the underlying relationships with some consistency. Fitness consistently emerges as the dominant predictor while out degree and publication year are also influential in making predictions at comparable magnitudes. In contrast, preferential attachment weight shows lower importance for predictive accuracy.

**Table 3 Clustering with the Louvain algorithm**. The abm14 and abm161 networks were clustered using the implementation of the Louvain algorithm in the Kuzu graph database. Summary statistics for the size of non-singleton clusters are shown.
	Min	Q1	Median	Q3	Max
abm161	–	–	–	–	–
abm76	678	8051	4,181,752	13,781,060	44,347,046
abm14	1,016,079	1,948,065	3,162,148	4,719,101	6,679,698

Fig. 4 Permutation feature importances for abm14. Here we show the permutation feature importances, measured with R2 score, of random forest regressors trained on abm14 datasets at 5-year intervals during the simulation. Each random forest regressor consisted of 100 decision trees and predicts the final citation count received by an agent using four features: fitness, out degree, preferential attachment weight, and publication year.

3.4 Community Structure

An expected feature of growing networks is the presence of clusters, suggesting specialization and research community formation, which has previously been reported in simulations [35]. To ask to what extent networks generated under SASCA-ReS exhibit community structure, we attempted to cluster the synthetic networks we generated.

While we were previously able to cluster the Open Citations network of 75,025,194 (Table 2) with the Leiden algorithm optimizing the Constant Potts model, the allocations of time and memory needed to analyze the abm76 and variants of it generated by varying parameter settings exceeded what was available to us. Further, clustering the abm76 and abm41 networks with the parallelized implementation of the Louvain algorithm [44] available as an extension to the Kuzu graph database [41] resulted in a decidedly unsatisfying cluster size distribution (Table 3) with the largest cluster comprising 58% and 48% of the network for the abm76 and abm14 networks respectively. We do not consider these Louvain clusterings, very useful for evaluating community structure in these cases.

Table 4 In degree statistics and average local clustering coefficient (alcc) for agents in different simulated networks with 14 million nodes. The networks are variants of the abm14 network produced by varying whether α and preferential attachment weight are randomized or static. These networks all have 13,926,219 nodes. For each network we display statistics for its in degree distribution and alcc. Note that high α settings increase in degree and alcc of agent nodes.

In an alternate strategy, we used the Leiden algorithm optimizing the Constant Potts Model [40] to generate a more intuitively sensible [45] cluster size distribution from clustered SASCA-ReS networks generated at the 10-20 million node scale under different conditions (Table 4). While altering the preferential weight did not result in much difference in the cumulative distribution function (CDF) for in degree, a high α value of 0.95 caused an appreciable right shift of the CDF while a low α value caused a corresponding left shift relative to the default randomized conditions. Altering the pa weight of agents did not result in appreciable differences relative to the control default simulations (Figure 5).

The results of this experiment are shown in Table 5 and indicate that a high α setting substantially increases the median, first and third quartile of cluster sizes relative to all other conditions tested while a high preferential weight setting results in the highest maximum cluster size. We also examined the “age” of each cluster, defining cluster age as the number of years since the earliest node in the cluster was created, and observed a pronounced decrease in the size of agent-only clusters in the high α simulation as cluster-age increases. Finally, we examined, under varying conditions, the relationship between the size of clusters that consisted only of agents and their age (Figure 6). To further examine the effect of α, we also examined node coverage (the fraction of the network in clusters of size at least 2 or at least 10) across two different resolution values as α is varied.

Fig. 5 In-degree CDF on 14m node networks. Here we show the CDF plots of in-degrees from six different simulations where different agent parameters were held static.

**Table 5 Effect of** α **on cluster sizes** SASCA-ReS simulations were performed in random and static agent environments at a growth rate of 4.0125% for 85 years to generate networks of size 13,926,219 nodes. Each of these networks was then clustered with the Leiden algorithm optimizing the Constant Potts Model (CPM) with top section at resolution value r=0.01 and bottom section at r=0.1.
agent phenotype	Min	Q1	Median	Q3	Max	num clusters	node cov.	sing.
alpha0.05	2	15	16	21	4714	569,362	1.00	20,647
alpha0.95	2	66	102	147	2936	118,428	1.00	6181
pa0.05	2	30	43	61	3532	260,943	1.00	927
pa0.95	2	31	44	62	5576	264,911	1.00	685
ra	2	30	43	61	4117	263,438	1.00	746
ra noalpha	2	13	13	16	5419	659,344	1.00	25,866
alpha0.05	2	5	5	7	528	2,117,329	1.00	5156
alpha0.95	2	8	16	29	379	675,401	0.99	106,358
pa0.05	2	6	9	15	350	1,212,845	1.00	34,110
pa0.95	2	6	9	15	392	1,214,029	1.00	51,955
ra	2	6	9	15	351	1,213,577	1.00	38,788
ra noalpha	2	4	5	5	544	2,323,650	1.00	7631

3.5 Mavericks

Given the use of citation counts to evaluate productivity and impact, an incentive exists for authors to aspire to being well-cited. Results from PyABM suggest that, in addition to fitness, high-referencing and high values for α boost citation counts [25]. To explore unconventional strategies that might result in high in degree for individual agents, we created customized agents that we refer to as mavericks, polysemically related to maverick in [36]. Mavericks follow a different rule set than the agents in SASCA-ReS in not being constrained by neighborhood, and may cite any other node in the network as long as they were published in a previous year. Each maverick is assigned a fitness value x and an out degree quota y, and we allow these to vary.

Fig. 6 Permutation feature importances for abm14. Here we show the permutation feature importances, measured with R2 score, of random forest regressors trained on abm14 datasets at different years of the simulation. Each random forest regressor consisted of 100 decision trees and predicts the final citation received by an agent using four features: fitness, out degree, preferential attachment weight, and publication year.

Three different types of mavericks were designed: a maximizer that scans the entire network and randomly cites y random nodes from the top 0.1% of the network by total degree; a randomnik that scans the entire network and selects y random nodes; and a minimizer agent that scans the entire network and selects y random nodes (out degree) from the bottom 0.1% of the network by total degree. Mavericks were created in the first year of the simulation and were assigned the same fitness level and out degree quota. For the maverick parameters, we varied x from 1, 10, 100, and 1000 and y between 10 and 249. Given randomness in the SASCA-ReS model, we planted three mavericks of each type per simulation. Additionally, we set designated three non-maverick agents per simulation as controls.

Simulations were conducted for 30 years at a growth rate of 3% using the sj seed set where the nodes exhibit varying in degree that varies from 0 − 88, 435 produced a wide range of in degree for mavericks to select from. The results (Figure 7) show that: (i) increasing fitness of mavericks increases their resultant in degree and (ii) increasing the out degree of mavericks increases their in degree (iii) the maximizer maverick outcompetes the randomnik, which in turn, outcompetes the minimizer.

We also computed disruption [46] for mavericks and control agents and observed that when y = 249, the maximizer maverick although very well-cited had disruption values considerably lower than the randomnik, minimizer, or control. We attribute this to the larger denominator in the case of the maverick on account of its references also being highly cited (Supplemental Materials). Thus, the original disruption formula of Wu and colleagues [46] may not be suitable for measurements in this context [47, 48] though a variant that is appropriately normalized to the context of this model may be more useful.

Fig. 7 Maverick citation strategies. Three each of maximizer (max), randomnik (rnd), and minimizer (min) mavericks were planted in the first year of 30-year 3% growth simulation on the sj seed network with randomized agent phenotypes. A control group of three randomized agent was also planted. Simulations in the upper row were performed with all planted agents (mavericks and controls) assigned out degree quotas of 249 publications versus an out degree of 10 in the bottom row. In independent experiments, mavericks and controls were assigned fitness values of 1, 10, 100, and 1000 (columns).

4 Discussion

Using the SASCA-ReS model, which features improved recency modeling, we modeled the growth of citation networks at scale. Simulations were conducted in randomized agent environments corresponding to our assumption of a broad variety of citation behaviors across individuals. Networks generated under the SASCA-ReS model are denser than the real-world examples we used for comparison. While we cannot exclude that this is an artifact of the model, we are inclined to believe that missing nodes and edges from real-world data also account for this difference. In the simulation protocol, all citations made by agents are to nodes in the growing network, whereas, publications in the comparable real-world network may cite or be cited by articles that are not included in the network. Thus, the simulated networks will have more edges and hence be denser than their corresponding real-world networks.

Models do not generate conclusive evidence but are useful for generating new hypotheses and revisiting existing ones. The SASCA-ReS simulations reported here suggest that fitness and out degree are influential. While the model does not consider factors such as journal or author reputation, the effects of collaboration between authors, the limitations of peer review, and epistemic misconduct, the result suggests that publication quality is important even in a complex environment of individual behaviors. The out degree effect is consistent with Small, Sweeney, and Greenlee’s interpretation in 1984 [49] of ‘high-referencing’ articles and the case for fractional citation counting. An incidental advantage of the idealization is that it is not possible to compute the h-index.

Simulations enable asking counterfactual questions.We asked what the structure of the global citation network might be today if researchers had historically exhibited extreme insularity, modeled by high α values, in their citation behavior. The results suggest that community structure would be enhanced perhaps at the cost of interdisciplinary work.

Using maverick agents, we examined the hypothetical question of whether discipline-independent citations would favor the accumulation of citations to individual documents. We observed that referencing only the most highly cited nodes in a network results in very high citations. In the real-world, this effect is likely counterbalanced by the insular behavior of researchers.

These initial studies were limited by the challenges of studying synthetic networks of over 75 million nodes in detail because of software and infrastructure constraints. That fitness is a driver, even in our idealized environment, makes a welcome case for quality in the world of citations. The result from the maximizer maverick suggests that more cynical strategies may also work. Using an abstract model for reasoning, however, generates ideas that should not be equated with evidence in the real-world. Our emphasis on model simplification in this manuscript also limited the extent to which we could explore the nonlinearity of citation dynamics proposed by Golosovsky [19]. Finally, we have not assessed statistical identifiability, but we expect that the SASCA-ReS model is not identifiable.

Supplementary information. This manuscript is accompanied by Supplementary Information in pdf format.

Acknowledgements. The authors thank the Illinois Computes program for allocations of computing time.

Declarations

This study was partially supported by the Illinois:Insper Partnership and NSF Award 2402559
The authors declare that they have no conflicts of interest.
Data availability. Data for the large networks generated in this study are available from the Illinois Data Bank under IDB-9265079. All other data are publicly available or can be regenerated through the protocols in the manuscript and the SASCA-ReS Code.
Code availability The code for SASCA-ReS is available from Github
Author contributions: Conceptualization (GC, MP, TW), Data Curation (MP), Formal Analysis (MP, HY, MP), Investigation (MP, HY, TW, GC), Methodology (MP, GC, TW), Supervision (GC, TW), Validation (MP, HY, GC), Visualization (HY, MP, GC), Writing- original draft (GC, TW, MP, HY), Writing- Review and Editing (GC, TW, MP, HY), Funding acquisition (GC, TW).

References

Digital Science. (2018-) Dimensions Software. https://app.dimensions.ai. Accessed: 2025-09-07 under licence agreement
Open Alex. https://openalex.org. Accessed: 2025-09-07
Peroni, S., Shotton, D.: OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies 1(1), 428–444 (2020) https://doi.org/10.1162/qss a 00023
Philosophical Transactions of the Royal Society of London. https:// royalsocietypublishing.org/journal/rstl. Accessed: 2025-09-07
Price, D.J.d.S.: Networks of Scientific Papers: The pattern of bibliographic ref-erences indicates the nature of the scientific research front. Science 149(3683), 510–515 (1965) https://doi.org/10.1126/science.149.3683.510
Garfield, E., Cawkell, A.E.: Location of milestone papers through citation networks. Journal of Library History 5, 184–188 (1970)
Boyack, K.W., Klavans, R.: Creation of a highly detailed, dynamic, global model and map of science. Journal of the Association for Information Science and Technology 65(4), 670–685 (2013) https://doi.org/10.1002/asi.22990
Wedell, E., Park, M., Korobskiy, D., Warnow, T., Chacko, G.: Center–periphery structure in research communities. Quantitative Science Studies 3(1), 289–314 (2022) https://doi.org/10.1162/qss a 00184
Gilbert, G.N.: Referencing as Persuasion. Social Studies of Science 7(1), 113–122 (1977) https://doi.org/10.1177/030631277700700112
Cronin, B.: The Need for a Theory of Citing. J. Documentation 37, 16–24 (1981)
Cozzens, S.E.: Taking the measure of science: A review of citation theories. Newsletter of the International Society for the Sociology of Knowledge 7(1–2), 16–21 (1981)
Cozzens, S.E.: What do citations count? The rhetoric-first model. Scientometrics 15(5–6), 437–447 (1989) https://doi.org/10.1007/bf02017064
Leydesdorff, L.: Theories of citation? Scientometrics 43(1), 5–25 (1998) https://doi.org/10.1007/bf02458391
Biagioli, M., Lippman, A.: Gaming the Metrics: Misconduct and Manipulation in Academic Research. The MIT Press, Cambridge MA (2020). https://doi.org/10. 7551/mitpress/11087.001.0001
Secchi, D.: A Simple Model of Citation Cartels: When Self-interest Strikes Science, pp. 23–32. Springer, ??? (2023). https://doi.org/10.1007/978-3-031-34920-1 3
Bao, H., Teplitskiy, M.: A simulation-based analysis of the impact of rhetorical citations in science. Nature Communications 15(1) (2024) https://doi.org/10. 1038/s41467-023-44249-0
Simkin, M.V., Roychowdhury, V.P.: A mathematical theory of citing. Journal of the American Society for Information Science and Technology 58(11), 1661–1673 (2007) https://doi.org/10.1002/asi.20653
Goldberg, S.R., Anthony, H., Evans, T.S.: Modelling citation networks. Sciento¬metrics 105(3), 1577–1604 (2015) https://doi.org/10.1007/s11192-015-1737-9
Golosovsky, M., Solomon, S.: Growing complex network of citations of scientific papers: Modeling and measurements. Physical Review E 95(1) (2017) https:// doi.org/10.1103/physreve.95.012324
Price, D.D.S.: A general theory of bibliometric and other cumulative advantage processes. Journal of the American Society for Information Science 27(5), 292–306 (1976) https://doi.org/10.1002/asi.4630270505
Barabási, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)
Silva, F.N., Tandon, A., Amancio, D.R., Flammini, A., Menczer, F., Milojević, S., Fortunato, S.: Recency predicts bursts in the evolution of author citations. Quantitative Science Studies 1(3), 1298–1308 (2020) https://doi.org/10.1162/qss_a_00070
Zhou, B., Holme, P., Gong, Z., Zhan, C., Huang, Y., Lu, X., Meng, X.: The nature and nurture of network evolution. Nature Communications 14(1) (2023) https://doi.org/10.1038/s41467-023-42856-5
Touwen, L., Bucur, D., Hofstad, R., Garavaglia, A., Litvak, N.: Learning the mechanisms of network growth. Scientific Reports 14(1) (2024) https://doi.org/ 10.1038/s41598-024-61940-4
Chacko, G., Park, M., Ramavarapu, V., Grama, A., Robles-Granda, P., Warnow, T.: An agent-based model of citation behavior. (Accepted) Applied Network Sci¬ence (2025). A preprint version is available on arXiv at https://doi.org/10.48550/arXiv.2503.06579
Park, M., Lamy, J.A., Rodrigues, E.C., Ferreira, F.M., Vu-Le, T.-A., Warnow, T., Chacko, G.: Very Large Scale Simulations of Network Growth with the Scal-able Agent-based Simulator for Citation Analysis with sampling (SASCA-s). in press, 14th International Conference on Complex Networks and their Applications (2025) (2025)
Macal, C.M.: Everything you need to know about agent-based modelling and simulation. Journal of Simulation 10(2), 144–156 (2016) https://doi.org/10.1057/ jos.2016.7
Park, M., Tabatabaee, Y., Ramavarapu, V., Liu, B., Pailodi, V.K., Ramachan-dran, R., Korobskiy, D., Ayres, F., Chacko, G., Warnow, T.: Well-connectedness and community detection. PLOS Complex Systems 1(3), 0000009 (2024) https://doi.org/10.1371/journal.pcsy.0000009
Caetano Machado Lopes, L., Chacko, G.: A Citation Graph from OpenAlex (Works). University of Illinois Urbana-Champaign (2024). https://doi.org/10. 13012/B2IDB-7362697 V1
Park, M., Vu-Le, T.-A., Warnow, T., Chacko, G.: SASCA-ReS. https://github. com/illinois-or-research-analytics/SASCA-ReS. Github repository (2025)
Harnagel, A.: A mid-level approach to modeling scientific communities. Studies in History and Philosophy of Science Part A 76, 49–59 (2019) https://doi.org/ 10.1016/j.shpsa.2018.12.010
Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.S.: The Web as a Graph: Measurements, Models, and Methods, pp. 1–17. Springer, New York, NY (1999). https://doi.org/10.1007/3-540-48686-0 1
Krapivsky, P.L., Redner, S.: Network growth by copying. Physical Review E 71(3) (2005) https://doi.org/10.1103/physreve.71.036118
Crane, D., Small, H.: American sociology since the seventies. In: Halliday, T.C., Janowitz, M. (eds.) Sociology and Its Publics. Heritage of Sociology Series. University of Chicago Press, Chicago, IL (1992)
Gilbert, N.: A Simulation of the Structure of Academic Science. Sociological Research Online 2(2), 91–105 (1997) https://doi.org/10.5153/sro.85
Weisberg, M., Muldoon, R.: Epistemic Landscapes and the Division of Cogni¬tive Labor. Philosophy of Science 76(2), 225–252 (2009) https://doi.org/10.1086/644786
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001) https://doi. org/10.1023/A:1010933404324
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay: Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12(85), 2825–2830 (2011)
Yi, H.: SASCA-RF. https://github.com/AnoldGH/RandomForestTools. Github repository (2025)
Traag, V.A., Waltman, L., Eck, N.J.: From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports 9(1) (2019) https://doi.org/10.1038/ s41598-019-41695-z
Feng, X., Jin, G., Chen, Z., Liu, C., Saliho˘glu, S.: Ku`zu graph database management system. In: CIDR (2023)
Chacko, G., Park, M., Ramavarapu, V., Grama, A., Robles-Granda, P., Warnow, T.: An Agent-based Model of Citation Behavior. Preprint: arXiv:2503.06579 (2025)
Park, M., Tabatabaee, Y., Warnow, T., Chacko, G.: Data for Well-Connectedness and Community Detection. University of Illinois Urbana-Champaign (2024). https://doi.org/10.13012/B2IDB-6271968 V1 . https://doi. org/10.13012/B2IDB-6271968 V1
Lu, H., Halappanavar, M., Kalyanaraman, A.: Parallel heuristics for scalable com¬munity detection. Parallel Computing 47, 19–37 (2015) https://doi.org/10.1016/ j.parco.2015.03.003 . Graph analysis for scientific discovery
Šubelj, L., Eck, N.J., Waltman, L.: Clustering scientific publications based on citation relations: A systematic comparison of different methods. PLOS ONE 11(4), 0154404 (2016) https://doi.org/10.1371/journal.pone.0154404
Wu, L., Wang, D., Evans, J.A.: Large teams develop and small teams disrupt science and technology. Nature 566(7744), 378–382 (2019) https://doi.org/10. 1038/s41586-019-0941-9
Bornmann, L., Ettl, C., Leibel, C.: In search of innovative potential: The challenge of measuring disruptiveness of research using bibliometric data. EMBO Reports 25(7), 2837–2841 (2024) https://doi.org/10.1038/s44319-024-00177-8
Bornmann, L., Devarakonda, S., Tekles, A., Chacko, G.: Are disruption index indicators convergently valid? the comparison of several indicator variants with assessments by peers. Quantitative Science Studies 1(3), 1242–1259 (2020) https://doi.org/10.1162/qss a 00068
Small, H., Sweeney, E., Greenlee, E.: Clustering the science citation index using co-citations. ii. mapping science. Scientometrics 8(5-6), 321–340 (1985). DOI: https://doi.org/10.1007/bf02018057

Editors

Ludo Waltman
Editor-in-Chief

Ludo Waltman
Handling Editor

Editorial assessment

by Ludo Waltman

DOI: 10.70744/MetaROR.296.1.ea

In this article the authors present an agent-based model of citation behavior and compare citation networks generated by this model with real-world citation networks. The article has been reviewed by two reviewers, who both consider the work to be “technically impressive”. The reviewers point out that the comparison between synthetic and real-world citation networks could be extended and deepened. Reviewer 1 also asks which features of real-world citation networks the model aims to capture. The reviewer invites the authors to be more explicit about real-world use cases for their model. This relates to a question raised by reviewer 2 concerning how the model’s abstract insights map onto reality. Finally, reviewer 1 asks whether the model can account for disciplinary differences in citation practices. Reviewer 2 suggests that a sensitivity analysis may be helpful to provide a deeper understanding of the effects of the model’s key parameters.

Recommendations for enhanced transparency

Add author ORCID iDs.

For more information on these recommendations, please refer to our author guidelines.

Competing interests: None.

Peer review 1

Erjia Yan

DOI: 10.70744/MetaROR.296.1.rv1

This is a very interesting and technically impressive paper. The ability to simulate citation networks at the scale of 100+ million nodes is a major computational achievement. The model is clearly carefully designed, and the experiments are thoughtfully executed.

It’s not fully clear what the essential “features” of the model are beyond fitness, preferential attachment, recency, out-degree, and the α parameter. At this scale, small modeling assumptions can have very large structural consequences. Are these features sufficient to reproduce the deeper structure of real citation networks? Or are important drivers (topic similarity, journal stratification, institutional prestige, collaboration structure, etc.) missing? A clearer statement of what the model does and does not attempt to capture would help.

Citation behavior varies substantially across fields. It’s not just about recency. Different disciplines have different:

Citation half-lives
Reference list lengths
Collaboration norms
Interdisciplinary patterns

The model uses a recency table derived from biomedical data, but that may not generalize. Can the framework incorporate multiple disciplinary regimes? Without that, the simulation may reflect a single stylized citation culture rather than the heterogeneous reality of science.

Real-world citation networks are organic. They are not just the result of weighted attachment and recency. Even with random “idiosyncratic” citations built in, real citation behavior is not uniform randomness. So while comparing simulated and real networks (e.g., degree distributions) is interesting, it may not be especially informative on its own. A simulation cannot fully capture the organic, path-dependent nature of real citation systems. The comparison would be stronger if tied to specific theoretical questions rather than structural similarity alone.

It’s not fully clear what the practical applications of the model are. Is it meant for:

Policy simulation?
Testing citation-based evaluation metrics?
Studying long-term structural evolution?
Exploring counterfactuals about scientific development?

The model is powerful, but the real-world use cases could be articulated more concretely.

Competing interests: None.

Peer review 2

Anonymous User

DOI: 10.70744/MetaROR.296.1.rv2

The paper presents a significant and technically impressive contribution to scientometrics. The core methodological advancement, replacing an endogenous recency weight with an externally enforced, empirically derived recency distribution via binning, is both clever and effective, as demonstrated by the improved fit to real-world data. I offer the below suggestions to help improve the manuscript.

The paper briefly mentions (in the Discussion) that the model is likely not statistically identifiable, but this critical point is not explored. Given the model’s complexity (20+ configuration parameters) and the inherent stochasticity, different parameter combinations could potentially generate networks with similar macro-level properties (e.g., degree distribution, recency profile, etc.). A more thorough discussion or a simple sensitivity analysis on key parameters (like the idiosyncratic fraction IF, the sampling limit B, or the fitness distribution shape) would strengthen the claims about the model’s behavior. Without this, it is difficult to assess how uniquely the chosen parameterization leads to the observed results.
While it is understandable that clustering billion-edge networks is difficult, the presentation of results seems unsatisfying. The Louvain algorithm results are correctly dismissed as uninformative (?), but the subsequent Leiden analysis on smaller networks is presented without clear interpretation. I’m confused about: What do these clusters represent in the context of the simulation’s mechanics? How do they relate to the “epistemic grid” or generator neighborhoods? The observation that high alpha increases cluster size is noted, but the epistemological or sociological implication of this finding (beyond “enhanced community structure”) remains vague. The community analysis, as presented, adds little substantive insight.
The authors attribute the higher density of their synthetic networks (vs. OpenAlex or Open Citations…) to likely missing data in the real-world corpora. An alternative explanation could be a systematic bias in the generative model, perhaps in the outdegree distribution or the lack of mechanisms for “missing” citations (to books, non-indexed journals, etc.). More importantly, the validation is primarily structural (degree distribution, recency). The paper would be significantly stronger if it could compare dynamic or temporal metrics between the synthetic and real networks, e.g., the evolution of degree correlation over time, citation cascade patterns, or the aging profile of papers, to demonstrate that SASCA-ReS captures not just a snapshot statistic but the process of growth.
The authors clearly state their commitment to an idealized model for the sake of parsimony. However, this very idealization limits the translational power of some central conclusions. The finding that “fitness” is the dominant predictor is compelling within the model, but the paper does not sufficiently discuss what this abstract “fitness” maps onto in reality. Is it research quality, novelty, field-specific relevance, or a combination (this was also a potential limitation of many previous works)? The “Maverick” experiment, while interesting, suffers similarly. It shows a strategy that works in a world without disciplines, journals, author reputations, or reading practices. The discussion does not adequately address the vast gulf between this abstract result and its implausibility in real scholarly communication, where citing only the top-0.1% most-cited papers across all science would be seen as bizarre and lacking context.

Competing interests: None.

Author response

DOI: 10.70744/MetaROR.296.1.ar

We thank the reviewers for their thoughtful comments and we acknowledge the validity of their observations. We note the interest in greater emphasis on the implications of the work over reporting structural observations and consider it very helpful as we envision a revised version of this preprint.

A recurring theme in the reviews is the balance between parsimony against complexity. For SASCA-ReS, in the interest of interpretability, we chose to lean towards parsimony. We note though, that SASCA-ReS also represents a step towards complexity in the progressive development of a citation model, since it has additional features compared to the first model we developed (PyABM; Chacko et al 2026, doi: 10.1007/s41109-025-00766-z), which in turn adds complexity to Gilbert’s agent-based model (Gilbert 1997, doi:10.5153/sr0.85)

Not discussed in the manuscript is that we have followed an approach to model development that is similar to one used in mathematical modeling in biology: initially a very simple model is proposed and studied, then extended to reflect increasingly more of the biological complexity. An example is how models of molecular sequence evolution were developed, beginning with the single parameter Jukes-Cantor 1969 model and progressing over several decades to include increasingly parameter-rich models (see https://en.wikipedia.org/wiki/Models_of_DNA_evolution), and then continuing to expand to address genome-level complexity. Our own approach to developing SASCA-ReS is consistent with this philosophy of gradual incorporation of features that come closer to modeling the complex citation practices in modern science.

Reviewer 1

…It’s not fully clear what the essential “features” of the model are beyond fitness, preferential attachment, recency, out-degree, and the α parameter. At this scale, small modeling assumptions can have very large structural consequences. Are these features sufficient to reproduce the deeper structure of real citation networks? Or are important drivers (topic similarity, journal stratification, institutional prestige, collaboration structure, etc.) missing? A clearer statement of what the model does and does not attempt to capture would help.

Our view of agents is that they should be autonomous, obey simple rules, and exhibit randomness in their behavior. SASCA-ReS (preceded by PyABM) was never intended to account for all the reasons that drive researchers to cite some publications and not others. Thus, we initially chose preferential attachment, fitness, recency, and network locality (α). We do not presently encode topic similarity, affiliations, institutional prestige, and collaboration, and venue.

Citation behavior varies substantially across fields. It’s not just about recency. Different disciplines have different:

Citation half-lives

Reference list lengths

Collaboration norms

Interdisciplinary patterns

The model uses a recency table derived from biomedical data, but that may not generalize. Can the framework incorporate multiple disciplinary regimes? Without that, the simulation may reflect a single stylized citation culture rather than the heterogeneous reality of science.

Our choice of biomedical data reflects our interest and the availability of a curated database of the biomedical literature (PubMed). The approach itself is generalizable and the software allows a user to use any recency table they prefer. Only one recency table can be used in a simulation.

That new communities form in such a generative model speaks to article clusters that we believe, may better correspond with research communities than disciplinary labels alone. This idea is not novel. For example, Price (1965 doi: 10.1126/science.149.3683.5) alluded to “segments and strips in the research front” in 1965 and Waltman and van Eck (2012 doi: https://doi.org/10.1002/asi.22748) comment that “..However, we found only a partial correspondence between our research areas and these traditional disciplines. In itself, we believe this to be an interesting result. It may be seen as an indication that traditional disciplines such as those just mentioned only partly reflect the actual organization of today’s scientific research.” A future version of the model could incorporate disciplinary labels.

Real-world citation networks are organic. They are not just the result of weighted attachment and recency. Even with random “idiosyncratic” citations built in, real citation behavior is not uniform randomness. So while comparing simulated and real networks (e.g., degree distributions) is interesting, it may not be especially informative on its own. A simulation cannot fully capture the organic, path-dependent nature of real citation systems. The comparison would be stronger if tied to specific theoretical questions rather than structural similarity alone.

We agree that a simulation cannot “…fully capture the organic nature of…”. Tying the simulation to specific theoretical questions is also subject to this caveat. However, the point is appreciated and aligns well with future work we have planned.

It’s not fully clear what the practical applications of the model are. Is it meant for:

Policy simulation?

Testing citation-based evaluation metrics?

Studying long-term structural evolution?

Exploring counterfactuals about scientific development?

The model is powerful, but the real-world use cases could be articulated more concretely.

All four are feasible but we are most interested in the third and fourth.

Reviewer 2 (anonymous)

The paper briefly mentions (in the Discussion) that the model is likely not statistically identifiable, but this critical point is not explored. Given the model’s complexity (20+ configuration parameters) and the inherent stochasticity, different parameter combinations could potentially generate networks with similar macro-level properties (e.g., degree distribution, recency profile, etc.). A more thorough discussion or a simple sensitivity analysis on key parameters (like the idiosyncratic fraction IF, the sampling limit B, or the fitness distribution shape) would strengthen the claims about the model’s behavior. Without this, it is difficult to assess how uniquely the chosen parameterization leads to the observed results.

We do not have either a proof or a strong analytical result–yet. However, we did perform an initial sensitivity analysis using a one factor at a time (OFAT) approach during the development cycles of PyABM (Chacko et al. 2026; Supplementary Materials) and the SASCA model series (Park et al. In Press 2025 Proceedings of the XIV International Conference on Complex Systems and their Applications–a figure has been posted to the Github site for SASCA-Res) in which we varied weights and alpha (α) and estimated their effects on accumulated citations. All other conditions were held constant with the idiosyncratic fraction set to 5%, the fitness distribution set from 1:1000 following a power-law distribution, and the sampling approach calibrated to generate comparable results to using the full set of eligible target nodes. The results of these OFAT analyses are in the supplementary materials of these publications.

At the time of this response, an advanced version of the model is under active development. A preprint and supplementary material describing this version will include results for a more comprehensive OFAT analysis and an ablation study.

While it is understandable that clustering billion-edge networks is difficult, the presentation of results seems unsatisfying. The Louvain algorithm results are correctly dismissed as uninformative (?), but the subsequent Leiden analysis on smaller networks is presented without clear interpretation. I’m confused about: What do these clusters represent in the context of the simulation’s mechanics? How do they relate to the “epistemic grid” or generator neighborhoods? The observation that high alpha increases cluster size is noted, but the epistemological or sociological implication of this finding (beyond “enhanced community structure”) remains vague. The community analysis, as presented, adds little substantive insight.

We accept this criticism. While we use clustering based on network topology as a measure of community structure, we interpret its results cautiously, given the lack of a ground truth and the limitations of any clustering method. From an abstract perspective, community formation is favored by a bias towards local citation. Extrapolating to the epistemic or sociological implication is more tricky but one could speculate that this mirrors behavior in social groups. Without more evidence, we are not willing to make bold assertions.

The authors attribute the higher density of their synthetic networks (vs. OpenAlex or Open Citations…) to likely missing data in the real-world corpora. An alternative explanation could be a systematic bias in the generative model, perhaps in the outdegree distribution or the lack of mechanisms for “missing” citations (to books, non-indexed journals, etc.). More importantly, the validation is primarily structural (degree distribution, recency). The paper would be significantly stronger if it could compare dynamic or temporal metrics between the synthetic and real networks, e.g., the evolution of degree correlation over time, citation cascade patterns, or the aging profile of papers, to demonstrate that SASCA-ReS captures not just a snapshot statistic but the process of growth.

We outlined one possible explanation without excluding others. In designing the model, we made the assumption that the network consisted of journal articles. The outdegree distribution that we use was sampled from PubMed journal articles. We did not engineer a missing citation feature into the model. The output of SASCA-ReS does permit kinetic analysis of the synthetic networks generated, which seems feasible for a revision.

The authors clearly state their commitment to an idealized model for the sake of parsimony. However, this very idealization limits the translational power of some central conclusions. The finding that “fitness” is the dominant predictor is compelling within the model, but the paper does not sufficiently discuss what this abstract “fitness” maps onto in reality. Is it research quality, novelty, field-specific relevance, or a combination (this was also a potential limitation of many previous works)?

We briefly discuss the tradeoff between parsimony and complexity at the beginning of this Response. The former offers interpretability at the cost of realism and complexity. The latter is challenging to implement, evaluate, and runs the risk of overfitting. Eventually, a choice is made by the modeler. While the lack of realism in SASCA-ReS may leave a reader disappointed, we clearly stated that the purpose of these models was not to replicate reality but to allow us to reason within an artificial world.

Fitness, in our model, refers to the epistemic quality of an article that attracts citations and is distinct from preferential attachment. Thus, an article of high fitness with few or zero citations will be cited (Touwen et al. 2024, doi: 10.1038/s41598-024-61940-4 and Zhou et al. 2023, doi: 10.1038/s41467-023-42856-5), especially by agents with high fitness weight. In the real world epistemic quality is more difficult to define precisely, which is what the reviewer may be referring to.

The “Maverick” experiment, while interesting, suffers similarly. It shows a strategy that works in a world without disciplines, journals, author reputations, or reading practices. The discussion does not adequately address the vast gulf between this abstract result and its implausibility in real scholarly communication, where citing only the top-0.1% most-cited papers across all science would be seen as bizarre and lacking context.

The point of the Maverick experiment was to exaggerate a citation behavior in order to observe its effects in a model scenario. The experiment addresses whether a citation strategy that only concerns itself with preferential attachment can succeed. The answer in our experiment was affirmative. This illustrative experiment suggests that less-exaggerated expressions, which would not merit the label of “bizarre”, may offer benefits in the quest for citations. This is a case where experimentation in a model environment offers advantages over mining real world data.

Cite