Publication output doubles roughly every 17 years1, reaching 3.3 million articles in 20222, while researchers spend ever less time on each paper3. AI tools aid summarization and question-answering4 but cannot solve the deeper challenge of knowledge integration when the underlying literature is incoherent and inaccessible. In the behavioral and social sciences, findings are fragmented by incomparable measures5, bespoke materials and stimuli6,7, and poorly documented participant demographics8. Stimuli, code, and data9 are often unavailable, and even shared data are typically limited to summary statistics rather than standardized trial- or event-level observations10. Knowledge cannot accumulate from such incommensurable fragments.
To address this dual crisis of volume and fragmentation, and in response to a recent U.S. National Academies call for infrastructure to unify scientific knowledge11, we argue that the scientific paper must be rebuilt for two audiences: human readers and AI systems. For humans, this requires front-loading key findings in accessible prose for time-pressed researchers and turning papers into layered knowledge bases where readers can interrogate data, rerun analyses, and access technical details without wading through dense text. For AI systems, the main text, data, code, and protocols must be structured in machine-readable formats that enable automated analysis, comparison, and synthesis. In this Perspective, “AI system” denotes an LLM‑based pipeline that combines generative models with tools, external memory, and deterministic analysis modules, rather than a standalone language model.
Our Perspective builds on FAIR, open‑science, and reproducible‑workflow initiatives but makes four contributions. First, we propose a dual‑audience paper architecture that couples a front‑loaded narrative for human readers with a machine‑actionable structured appendix, rather than treating “supplementary materials” as an afterthought. Second, we specify a minimum viable structured appendix: not just a mandate to “share data” but a bundle of executable containers, standardized stimulus and measure identifiers, ontological mappings of constructs, and tidy trial‑level data with inclusive demographic coding. Third, we introduce a publication‑layer jingle–jangle audit, in which semantic and statistical checks on construct labels become routine infrastructure rather than occasional manual critiques. Fourth, we show how these elements together support living evidence networks, where individual papers become version‑controlled nodes feeding continuously updated, quality‑weighted syntheses.
Some journals already require authors to share datasets and scripts, typically in loosely documented repositories. These resources are idiosyncratic across studies, lack persistent identifiers for stimuli and measures, are not mapped onto shared ontologies, and rarely include trial‑level data with full provenance. They are human‑downloadable but not readily interoperable, queryable, or auditable at scale. By contrast, the dual‑audience paper and structured appendix proposed here treat interoperability, automation, and verification as first‑class design goals: standardized metadata, identifiers, and containerized workflows enable both human analysts and AI systems to rerun analyses, audit construct usage, and feed living evidence networks, rather than merely attaching files to a PDF.
Together, these components define a framework for turning individual papers into queryable research environments and inputs for AI pipelines in which synthesis and verification modules operate on verified, structured knowledge rather than unvetted prose. Our goal is not to introduce yet another platform or standard, but to specify how existing tools—containers, ontologies, data standards, and registries—can be assembled into a publication format and incentive structure that serves both human comprehension and machine analysis in psychology and other data‑rich sciences.
Crises in scientific practice and communication
The volume–fragmentation spiral has produced a cascade of institutional failures, beginning with quality control. Exponential growth in publications and preprints has overwhelmed traditional peer review12, which no longer provides a reliable signal of quality and relevance. Researchers skim abstracts, abandon papers mid-reading, and retreat to secondary summaries. Meanwhile, the scientific paper—designed for a print era of information scarcity—has become a bottleneck in an AI-rich era: critical data are buried in unstructured prose, trapped behind paywalls, and encoded in formats that resist computation. This friction may incidentally limit large-scale automated reuse of errors and sensitive findings, but it is a crude safeguard. Structured, machine-readable archives could instead pair easier access with explicit governance and quality checks, enabling more reliable synthesis.
These institutional failures both reflect and deepen epistemic fragmentation. Experimental psychologists, for example, routinely deploy proprietary or poorly documented stimulus sets—images, videos, vignettes, auditory clips—that differ in valence, arousal, cultural reference, and perceptual salience13,14. Even well-validated sets are seldom cross-validated against one another, including the many affective picture databases: the International Affective Picture System (IAPS)15, Open Affective Standardized Image Set (OASIS)16, Geneva Affective Picture Database (GAPED)17, Nencki Affective Pictures System (NAPS)18, Complex Affective Scene Set (COMPASS)19, International Affective Virtual Reality System (IAVRS)20, and dozens of culturally specific successors21. Effects may then be driven by stimulus-specific confounds rather than the intended construct. Without standardized, cross-validated stimulus sets, studies using different databases or materials become incomparable, forcing meta-analysts to hand-code or exclude findings—a laborious, error-prone process that scales exponentially with literature growth22.
The measurement landscape is equally fragmented. Psychology and related behavioral and social sciences are rife with “jingle–jangle” fallacies: identical names for fundamentally different phenomena (the jingle fallacy23) and different names for the same constructs (the jangle fallacy24). “Flourishing,” for instance, bundles conflicting theoretical approaches while preserving nominal unity25. “Executive function” spans working memory, cognitive flexibility, and inhibitory control26, with research oscillating among one-, two-, three-, and nested-factor models without converging on a stable structure27. Similar taxonomic confusions arise in economics (“poverty”), medicine (“autism”), and computer science (“artificial intelligence”), but they are especially pernicious in the behavioral sciences. Aggregating findings across such disparate conceptualizations yields statistically significant but scientifically questionable results.
Jangle problems extend to measurement proliferation. A large-scale analysis of APA databases shows that thousands of new measures are published annually, yet over 70% are never reused more than once, so the literature grows more fragmented over time28. This proliferation creates redundant research silos: “grit,” for example, shares most of its reliable variance with conscientiousness29; “psychological capital” often repackages existing well-being measures25. Semantic embeddings further suggest that the 277 distinct construct labels in the International Personality Item Pool could be collapsed into a more parsimonious taxonomy of just 68 clusters30.
A third failure compounds these problems: demographic blindness conceals effect heterogeneity31. Many studies report age and sex while omitting race/ethnicity and socioeconomic status, making it impossible to determine for whom effects actually hold32. Findings robust in U.S. college samples may shrink or reverse in other cultural contexts, yet current reporting practices render such moderation invisible until replication failures accumulate. This is not merely a representational concern but a threat to validity: unreported demographic moderators masquerade as statistical noise, obscuring the very patterns researchers seek to understand.
A fourth failure—data inaccessibility and impoverishment—renders many findings functionally unverifiable and unsynthesizable. Most research data remain unavailable upon request9,33-36, and even when shared, they typically appear as summary statistics rather than the individual participant data (IPD) required for rigorous verification and reuse37. This practice blocks robust forms of evidence synthesis, which depend on IPD to standardize outcomes, conduct proper intention-to-treat analyses, and examine effect heterogeneity across participant characteristics. The very existence of IPD meta-analysis—a methodological gold standard that requires manual collection of raw data from original authors37—indicts the standard scientific paper’s failure as a knowledge-delivery mechanism.
These four failures interact to create formidable barriers to knowledge accumulation. A study of “emotion regulation” (construct problem) using different picture sets (stimulus problem) across varied populations (demographic problem) with only summary statistics available (data problem) poses a computational impossibility: disentangling these sources of heterogeneity requires precisely the structured information that current publication practices withhold.
This fragmentation renders traditional evidence-synthesis methods inadequate for modern science. Systematic reviews are too slow to keep pace with literature growth, prone to error, and often yield inconclusive findings from incommensurable studies38. With more than 70,000 unique measures already documented in psychology28, manual curation has become computationally intractable.
Generative AI promises to automate knowledge synthesis at scale, yet this potential remains largely unrealized. LLMs now affect all stages of the writing process—with at least 13.5% of 2024 PubMed-indexed biomedical abstracts bearing AI-linked style markers39 and 22.5% of sentences in arXiv computer science abstracts estimated to be LLM-modified by September 202440. Yet they cannot synthesize knowledge from fragmented, incomparable studies. Worse, LLMs’ documented tendency to hallucinate citations and perpetuate training biases41 means that any AI-assisted synthesis must be grounded in verified, structured knowledge rather than unvetted prose.
Writing for two audiences
Addressing these crises requires rethinking the scientific paper’s structure to defragment the behavioral sciences and restart cumulative progress. We propose a dual-audience framework that serves time-pressed human readers and emerging AI systems by pairing a responsible front-loaded narrative and interactive knowledge layers with machine-readable appendices (Fig. 1).

Figure 1. Publication crisis and layered solution for human and AI readers. (A) Structural problems in behavioral-science publishing: stimulus and measurement fragmentation, demographic blindness, and inaccessible or impoverished data. (B) Human-optimized layers: a front-loaded narrative and interactive knowledge layer that give readers rapid access to key findings, methods, and materials. (C) Machine-optimized layer: a machine-readable appendix package with executable code, semantic annotations, and tidy trial-level data that powers interactive reanalysis for human readers and automated reproducibility checks, construct audits, and living meta-analysis for AI agents.
Front-loading for human readers. A front‑loaded paper inverts traditional structure: instead of shambling through a literature review before revealing findings, it begins with explicitly situated answers. The opening paragraphs should address, in order: What did you discover? Why does it matter? How does it change our understanding? They should simultaneously signal the strength of evidence, scope conditions, and how readers can verify or challenge the claims (e.g., via links to the specific scripts, containers, and data tables in the structured appendix that reproduce the headline numbers). This adapts journalism’s inverted pyramid with scientific guardrails so that readers immediately see the contribution, its limits, and the paths for inspection.
Traditional abstracts nominally front‑load some information, but severe space limits and conventions that prioritize technical precision over clarity often produce text that serves gatekeepers rather than readers.
Structured appendices for human interaction and machine processing. Front-loading alone, however, is insufficient. Papers must also become layered knowledge bases that let humans and machines directly interrogate underlying data. This requires structured appendices: a standardized knowledge package that transforms supplementary materials from passive dumping grounds into active, queryable computational infrastructure.
Each appendix is built around computational reproducibility. Instead of scattered code files and static data dumps, researchers provide executable analysis environments using containerization technologies such as Docker42 or Singularity43. These containers package code, dependencies, and configurations so that anyone can reproduce the analytical pipeline from raw data to final figures.
This approach is already standard in much of scientific computing: workflow systems such as Nextflow in bioinformatics rely on containers44, and pipelines such as fMRIPrep (neuroimaging), BioContainers (genomics), and tools in ecology (QGIS), astronomy (Astropy), and physics (CERN) demonstrate broad disciplinary adoption45.
On this foundation sit three layers of semantic structure. First, every stimulus—whether drawn from existing databases or created anew—receives a persistent identifier with standardized metadata for modality, normative ratings (valence, arousal, dominance), cultural validation samples, and licensing. This mirrors the Resource Identification Initiative, which assigns persistent identifiers to biological reagents and software to improve tracking and identifiability46.
Researchers creating novel stimuli document them in the same framework, contributing to an expanding queryable ecosystem. A study might reuse validated stimuli—for example, “all positive faces with arousal > 7 validated in East Asian samples”—or introduce new ones such as “custom workplace scenarios rated for stress and cultural relevance”; in both cases, structured metadata enables future discovery and comparison. In language‑cognition experiments, the HED LANG framework already provides a standardized, machine‑readable vocabulary for annotating stimuli47.
Second, measurement instruments are mapped to shared conceptual spaces using ontological systems, from controlled vocabularies to formal logic-based, machine-readable ontologies48. When a study uses the Beck Depression Inventory-II, for example, items are linked to standardized depression subdimensions using Uniform Resource Identifiers (URIs) from repositories such as the Cognitive Atlas49. This semantic mapping enables AI systems to detect when ostensibly different measures assess the same construct, or when identical labels mask different phenomena, addressing the jingle–jangle problem algorithmically rather than through laborious manual coding (Box 1).
This framework makes explicit four measurement questions: What construct does this instrument measure? Why was it chosen? How are responses quantified? What study‑specific modifications were made?5 Yet infrastructure alone is insufficient; meaningful interoperability requires communities to forge consensus on shared definitions and standards50. The Human Behaviour Ontology illustrates this approach, systematically defining and relating thousands of behavioral concepts to impose coherence on fragmented research domains51.
Third, all data adopt tidy trial-level formats that capture the full experimental context, linking each response to its stimulus, participant characteristics, and trial conditions using standardized, inclusive demographic coding. This operationalizes the FAIR (Findable, Accessible, Interoperable, and Reusable) principles52 by sharing data at the most granular level in standardized formats (e.g., Psych-DS for behavioral data, BIDS for neuroimaging) to maximize value for secondary analysis and reuse53.
This restructuring enables queries that are impossible with current summary-statistics approaches: “Show effects for women over 60” or “Exclude WEIRD-dominated samples” become computational operations rather than manual exclusions. Readers can instantly examine how effects vary across age, education, or cultural context without requesting raw data or running new studies. They can query the executable environment to assess heterogeneity across stimuli and analytic choices—“How sensitive are results to different stimuli or preprocessing decisions?”—and manipulate interactive visualizations, adjust parameters, and explore alternative presentations in real time.
This transforms multiverse analysis—the systematic exploration of how results vary across reasonable data-processing and analytical choices54—from a reporting burden into a native capability. Instead of cramming robustness checks into static appendices, researchers embed alternative scripts within the executable environment, allowing readers to probe a finding’s stability computationally. The appendix becomes a space where readers can extend analyses, test alternative hypotheses, and build directly on existing work without first deciphering the authors’ original narrative.
The framework can be adopted in stages, with a minimum viable version that most labs can implement now and a full version that depends on emerging infrastructure. A simple four‑tier roadmap is:
- Tier 0: Data, code, and provenance. Authors share the analytic dataset and scripts used to generate the main results in a stable repository, with persistent identifiers, clear licenses, and a brief provenance note describing recruitment, inclusion criteria, and key preprocessing steps.
- Tier 1: Tidy trial‑level data and basic metadata. Shared data are restructured into trial- or observation‑level tidy format and accompanied by a simple machine‑readable schema (e.g., JSON or YAML) that defines variable names, units, coding, and links between stimuli, participants, and conditions.
- Tier 2: Executable environment and automated checks. The analysis is wrapped in a containerized environment (e.g., Docker or Singularity) together with a lightweight continuous‑integration script that reruns the main analyses and regenerates figures whenever code or data change, flagging breakage early.
- Tier 3: Ontology linking and evidence‑network integration. Measures, tasks, and stimuli are linked to shared ontologies and persistent identifiers; effect estimates and study‑level metadata are exported in standardized form suitable for automatic ingestion into domain‑specific repositories and living meta‑analyses.
In this scheme, a minimum viable dual‑audience paper corresponds to Tiers 0–1—open data and code plus tidy trial‑level data with basic metadata. Tiers 2–3 realize the full framework: executable environments, automated robustness checks, semantic linking, and native participation in living evidence networks. Journals and funders can ratchet expectations gradually, first normalizing Tiers 0–1 and treating higher tiers as aspirational targets for consortia and well‑resourced teams, rather than insisting that the world jump directly to Tier 3.
Box 1 | Automating the Jingle–Jangle Audit
| The jingle fallacy conflates distinct phenomena under identical labels; the jangle fallacy fragments identical constructs across different names. Our framework embeds a dual-layer automated audit to detect both.
The first layer employs semantic and ontological analysis. A jingle detector cross-references construct names against shared ontologies (e.g., Cognitive Atlas), using graph algorithms to identify identical labels that occupy distinct semantic neighborhoods. In parallel, a jangle detector applies LLM embeddings to cluster scale items, flagging nominally different instruments whose item content shows high semantic convergence30,55.
The second layer provides empirical validation via statistical analysis56 of structured-appendix data and the living evidence network. For jangle detection, it tests extrinsic convergent validity by comparing correlation patterns between putatively different measures and external criteria—statistically indistinguishable patterns indicate empirical redundancy. For jingle detection, it tests discriminant validity by asking whether identically labeled measures show divergent correlation with theoretically distinct criteria—systematic differences reveal that a single label masks multiple constructs.
This dual-validation approach—semantic analysis for large-scale screening, statistical comparison for empirical confirmation—builds construct hygiene into the publication infrastructure. Continuous-integration scripts run these audits automatically, generating curator reports that flag redundancies and collisions before they propagate. Construct validation thus shifts from an occasional, labor‑intensive exercise to an automated quality-control mechanism operating at the scale and speed of modern scientific publishing. |
Transforming AI-assisted peer review
The dual‑audience framework also restructures peer review. A front‑loaded narrative lets reviewers judge novelty and significance within minutes, rather than spending hours parsing methods only to discover that a study is flawed or incremental.
Moreover, the structured appendix facilitates precise auditing and more rigorous evaluation. Current peer review rarely scrutinizes data and code: a recent large-scale attempt to rerun published code found a reproducibility rate below 6%, suggesting that reviewers largely take methodological claims on trust57. With structured appendices, reviewers can inspect configuration files and data, execute code, and verify that figures regenerate from the underlying data. This reduces the frustrating back-and-forth that plagues current review cycles.
Critically, this structure provides the essential infrastructure for responsible AI-assisted review (Fig. 2). Static, print-replica PDFs are poorly suited to AI systems that require structured access to tables and figures.
For example, autonomous AI agents relying only on prose in methods sections have failed to reproduce nearly half of published findings—applying incorrect or incomplete statistical methods when text is ambiguous or omits essential details58. Providing structured, machine-friendly content (e.g., CSV, Markdown, Git) would unlock new opportunities for AI-driven quality control: validating references, auditing logical consistency, checking mathematical and statistical accuracy, and systematically verifying the structured appendix. Verification can then move beyond confirming that code runs to automated multiverse analyses that vary data‑processing choices and analytic parameters to map the fragility of a study’s conclusions.
Beyond computational verification, the structured format also improves semantic clarity. Formal ontologies impose precise, unambiguous definitions on core constructs, resolving where theoretical disagreements genuinely lie59. Review shifts from semantic excavation to focused scientific dialogue. Human reviewers receive both the manuscript and an AI‑generated audit report, freeing them to concentrate on interpretive claims, novelty, and overall significance.

Figure 2. Hybrid AI–human peer review enabled by structured appendices. Authors submit a package containing a front‑loaded manuscript and structured appendix (code, data, containers, ontologies). AI review agents run reproducibility, multiverse‑robustness, and citation‑integrity checks to generate an audit summarizing status, warnings, and fragility. Human reviewers and editors use this audit alongside the narrative to request targeted author responses, assess conceptual novelty and theoretical contribution, adjudicate interpretation versus evidence, and make final publication decisions.
From isolated papers to networked knowledge
Beyond improving communication and quality assurance, the dual-audience framework creates the technical foundation for systematic knowledge integration. Papers can be linked into living evidence networks—extensions of the manually intensive “living systematic review” model60—in which each publication becomes a node that automatically contributes to evolving synthesis.
Consider how systematic reviews and meta-analyses currently work: researchers manually search databases, screen thousands of abstracts, extract data from hundreds of PDFs, code effect sizes, and produce static summary estimates—a process that typically occupies about five researchers for a year61. Yet by publication, many meta-analyses are already outdated: median “shelf life” is 5.5 years, with 23% requiring updates within two years and 7% being obsolete on arrival62. New studies accumulate in limbo until another large manual effort is mounted. This process is further compromised by systematic publication bias: null or inconvenient results remain buried in file drawers while false positives proliferate63. Fig. 3 contrasts this static, labor-intensive pipeline with the living evidence network enabled when structured papers feed domain-specific repositories and continuously updated meta-analyses.
Living evidence networks invert this workflow. When researchers publish under this framework, whether in journals or public registries64, their structured appendices automatically populate theme-specific evidence repositories. A new report on mindfulness and anxiety immediately contributes its effect sizes, sample characteristics, and methodological features to the running meta-analysis on that topic, regardless of whether results are significant, null, or contradictory. Pooled estimates update in real time. When a paper is retracted, its data points are removed from downstream syntheses within hours rather than years, automating and accelerating the update process currently managed by teams conducting living systematic reviews65.
In addition, the same structured knowledge base becomes training data for AI models that predict outcomes for novel interventions or specific populations38. The system can forecast the likely efficacy of a hypothetical intervention for a given demographic, turning evidence synthesis from retrospective summary into a tool for direct, actionable guidance.

Figure 3. From static meta-analyses to living evidence networks. (A) Traditional workflow in which scattered PDFs feed a one-off meta-analysis requiring months or years of manual screening and coding, yielding a synthesis that is already outdated at publication. (B) Structured papers with appendices support automatic extraction of effect sizes, moderators, and quality indicators into domain-specific repositories. (C) Living evidence network in which domain-specific repositories continuously update living meta-analyses and cross-domain syntheses; new studies, retractions, and registered null results dynamically alter study weights, providing real-time evidence for clinical guidelines, policy briefs, and AI prediction models.
This framework requires two additional infrastructure components beyond the semantic anchoring and ontological mapping already described. First, research artefacts—datasets, code, preregistrations, and derived effect-size tables—must be treated as versioned, citable objects. Version-control systems such as Git provide transparent audit trails, but propagating corrections requires additional layers: persistent identifiers, standardized cross‑reference metadata (e.g., DataCite‑style schemas), explicit dependency graphs linking studies to syntheses, and registries that index these links. When a dataset or analysis is corrected or retracted, systems built on this graph can automatically flag and update downstream meta‑analyses and evidence syntheses, instead of relying on formal errata and retractions that are slow and cumbersome to issue66 and often remain invisible for years67.
Second, quality indicators—sample size, pre-registration status, replication attempts, methodological rigor scores—must dynamically weight each study’s contribution to pooled estimates. High-quality, preregistered studies with large samples should carry more influence than exploratory work with convenience samples, so that evidence synthesis reflects both the quantity and quality of available research.
Barriers and failure modes
The framework outlined above is aspirational; without attention to implementation constraints, it could reinforce existing inequities. Building executable containers, tidy trial‑level datasets, and ontological mappings requires technical capacity that many small labs, non‑elite institutions, and practitioner settings do not yet have. If structured appendices become de facto requirements without parallel investments in infrastructure, training, and credit, adoption will be slow and skewed toward well‑resourced groups. Standardization for shared ontologies and demographic vocabularies also entails substantial coordination costs and risks freezing contested constructs or marginalizing alternative theoretical traditions unless governance is explicitly pluralistic, revisable, and transparent.
Sensitive and confidential data pose a different failure mode. In clinical, educational, and small or marginalized populations, naive mandates for fully open trial‑level data collide with privacy protections, data‑sovereignty claims, and legal or ethical constraints. The dual‑audience architecture must therefore support tiered access, secure data enclaves, and remote‑execution or synthetic‑data solutions so that code and metadata remain reusable even when raw data cannot be widely shared.
Finally, lowering the friction for reanalysis also lowers the friction for motivated misuse. Interactive appendices can make it easier for ideological actors to cherry‑pick specifications, ignore multiverse fragility or quality weights, and promote “do your own research” narratives that overstate the certainty of convenient results. Design choices can mitigate these risks by foregrounding robustness summaries rather than single estimates, making departures from preregistered analyses and default pipelines explicitly visible, and tying reanalyses back into version‑controlled living evidence networks where idiosyncratic claims are evaluated against the full corpus rather than in isolation. These barriers and failure modes are not reasons to abandon the framework but constraints that should shape governance, incentive design, and support from publishers and institutions.
Implementation and governance options
Publishers face an existential choice: continue distributing static PDFs as their gatekeeping role erodes under funder and institutional open‑access mandates, or help build the infrastructure that turns those same papers into queryable research environments. Scholarly societies, disciplinary and institutional repositories, and research libraries likewise must decide whether to merely mirror static PDFs or adopt shared formats for structured, executable paper packages that any interface can use.
For any host, two implementation paths emerge. One is to develop native AI systems integrated directly into their platforms—tools trained on scientific content with domain expertise in methodology, statistics, and interpretation. The other is to leverage browser‑ or operating‑system–level AI via secure APIs that let tools such as Chrome’s Gemini68 access full paper contexts, including structured appendices and executable code, rather than only rendered text.
Either approach enables text‑based services such as on‑demand translation that preserves technical precision, personalized summarization tailored to reader expertise, and advanced Q&A that can execute code for reanalysis or generate new visualizations. Audio and video services could automatically produce conversational podcasts or video summaries, making research accessible across formats and languages.
To remain viable as AI systems and interfaces evolve, these implementations should rest on open, versioned standards. The archival “paper package”—identifiers, metadata, data, and executable code—must remain portable across hosts and decoupled from any single AI model or user interface, so that the scientific record outlives particular vendors and tools.
These services imply a shift from charging primarily for document access toward supporting authenticated analytical capability. Funding models may range from institutional support and public infrastructure to subscription‑based access to advanced tools, while the underlying narrative text and structured appendices remain findable and portable even when specific interfaces are restricted. For sensitive data, access would operate through institutional agreements that create a trusted “data commons” of authorized researchers.
Proof-of-concept already exists. The Resource Identification Initiative standardizes identifiers for research materials46. Databrary provides infrastructure for sharing research data, including sensitive data53. EBRAINS “Live Papers” bundle code, data, and computational models for interactive simulation within neuroscience publications69. eLife’s Executable Research Articles let readers inspect, modify, and rerun the code that generates figures and tables in the browser. Schol-AR embeds manipulable visualizations into articles70. Code Ocean packages complete executable environments71. These pioneers demonstrate technical feasibility, but the transition from static document to dynamic resource remains fragmented across isolated initiatives.
The next evolution demands clear governance as well as technical innovation. Three questions are central. First, platforms hosting AI–paper interactions must protect the privacy of user queries and interaction data. Second, they must distinguish policies for confidential peer-review materials from those for published content and avoid sending unpublished manuscripts to external commercial AI systems that may retain proprietary data72. Third, they must state explicitly whether and how published content or interaction logs are used for model training, and on what legal and ethical basis. Any organization implementing AI solutions should guarantee privacy-first architectures in which user interactions are secure and encrypted, and in which data uses—including for training—are transparent and subject to meaningful consent and oversight.
Role of institutions
Universities and research institutions control the most powerful lever for adoption: career incentives. Promotion and tenure committees have historically undervalued digital research assets, creating a systemic disincentive to produce the high-quality data and code essential for a cumulative science73.
Reform therefore must be explicit and immediate. Institutions should revise hiring, promotion, and tenure criteria to prioritize work that advances long-term scientific progress, such as structured appendices and systematic consensus building—time-consuming but essential work that yields shared terminologies and methodological standards, making structured data interoperable and meaningful50. This aligns with the Coalition for Advancing Research Assessment (CoARA), which treats diverse contributions—datasets, software, code, and protocols—as legitimate scholarly products beyond journal impact factors.
But recognition alone is insufficient. Institutions must provide funding, computational resources, version-control systems, and technical training that make comprehensive documentation feasible rather than burdensome. Research libraries should expand from literature access to data curation, helping faculty turn messy research outputs into structured, queryable packages.
Early adopters gain competitive advantages. As AI-driven discovery tools emerge, institutions producing structured, machine-readable research will see their faculty become more visible and influential—if improved citation of open data is any guide74—creating systematic advantages in knowledge synthesis and collaboration.
Toward dynamic scientific authority
This transformation alters the epistemological relationship between author and audience. Traditionally, authority rests with the author’s narrative in the main text. Once raw data become directly accessible through AI queries, authority shifts toward machine-mediated analysis of evidence. Readers can pose counterfactual questions—“Re-run the analysis excluding subjects over 65” or “Plot the data using logarithmic scaling”—as computational operations rather than requests to authors. This moves beyond static replication toward dynamic exploration, making robustness checks, alternative specifications, and hypothesis generation cheap. Such interactive reanalyses remain exploratory; unbiased confirmatory tests still require prespecified design and analyses (ideally preregistered) evaluated on fresh or held‑out data not used to generate the hypotheses or analytic decisions.
This shift poses new questions for scientific practice: how theoretical synthesis should guide and interpret increasingly discoverable empirical patterns; how authority is distributed when any reader can interrogate data via AI interfaces; and how peer review should balance algorithmic validation of methods with human judgment about theory and significance.
This framework creates discipline‑agnostic infrastructure that both addresses current crises and positions science for AI‑enabled discovery: persistent identifiers, standardized metadata and provenance, executable analysis environments, and versioned audit trails. Psychology serves here as a stress test because its core objects—stimulus sets, tasks, and jingle–jangle‑prone constructs—are unusually tangled; other domains can substitute their own reagents, instruments, specimens, or datasets on the same scaffold. Widespread adoption would generate high‑quality, structured research artefacts that surpass the unstructured text currently training most models, providing a foundation for AI systems in which causal models, statistical engines, and verification tools operate on verifiable inputs. Rather than passively accepting commercial AI tools, the academic community must define how these pipelines integrate with scientific values of transparency and rigor.
References
1 Bornmann, L., Haunschild, R. & Mutz, R. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanities and Social Sciences Communications 8, 224 (2021). https://doi.org/10.1057/s41599-021-00903-w
2 National Science Board. Publications output: U.S. trends and international comparisons. (National Science Foundation, 2024).
3 Tenopir, C., King, D. W., Christian, L. & Volentine, R. Scholarly article seeking, reading, and use: A continuing evolution from print to electronic in the sciences and social sciences. Learned Publishing 28, 93-105 (2015). https://doi.org/10.1087/20150203
4 Lin, Z. Why and how to embrace AI such as ChatGPT in your academic life. R. Soc. Open Sci. 10, 230658 (2023). https://doi.org/10.1098/rsos.230658
5 Flake, J. K. & Fried, E. I. Measurement schmeasurement: Questionable measurement practices and how to avoid them. Adv. Meth. Pract. Psychol. Sci. 3, 456-465 (2020). https://doi.org/10.1177/2515245920952393
6 Clark, H. H. The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior 12, 335-359 (1973). https://doi.org/10.1016/S0022-5371(73)80014-3
7 Yarkoni, T. The generalizability crisis. Behav. Brain Sci. 45, e1 (2020). https://doi.org/10.1017/s0140525x20001685
8 Sterling, E., Pearl, H., Liu, Z., Allen, J. W. & Fleischer, C. C. Demographic reporting across a decade of neuroimaging: A systematic review. Brain Imaging and Behavior 16, 2785-2796 (2022). https://doi.org/10.1007/s11682-022-00724-8
9 Tedersoo, L. et al. Data sharing practices and data availability upon request differ across scientific disciplines. Scientific Data 8, 192 (2021). https://doi.org/10.1038/s41597-021-00981-0
10 Hardwicke, T. E. et al. Estimating the prevalence of transparency and reproducibility-related research practices in psychology (2014–2017). Perspect. Psychol. Sci. 17, 239-251 (2021). https://doi.org/10.1177/1745691620979806
11 National Academies of Sciences, Engineering, and Medicine; Division of Behavioral and Social Sciences and Education; Board on Behavioral, Cognitive, and Sensory Sciences; Committee on Accelerating Behavioral Science through Ontology Development and Use,. in Ontologies in the Behavioral Sciences: Accelerating Research and the Spread of Knowledge (eds A. S. Beatty & R. M. Kaplan) (National Academies Press (US), 2022).
12 Hanson, M. A., Barreiro, P. G., Crosetto, P. & Brockington, D. The strain on scientific publishing. Quantitative Science Studies 5, 823-843 (2024). https://doi.org/10.1162/qss_a_00327
13 Diconne, K., Kountouriotis, G. K., Paltoglou, A. E., Parker, A. & Hostler, T. J. Presenting KAPODI—the searchable database of emotional stimuli sets. Emotion Review 14, 84-95 (2022). https://doi.org/10.1177/17540739211072803
14 Lin, Z., Ma, Q., Huang, X., Wu, X. & Zhang, Y. Pervasive failure to report properties of visual stimuli in experimental research in psychology and neuroscience: Two metascientific studies. Psychol. Bull. 149, 487-505 (2023).
15 Lang, P. J., Bradley, M. M. & Cuthbert, B. N. International affective picture system (IAPS): Affective ratings of pictures and instruction manual. (NIMH, Center for the Study of Emotion & Attention, 2005).
16 Kurdi, B., Lozano, S. & Banaji, M. R. Introducing the Open Affective Standardized Image Set (OASIS). Behav. Res. Methods 49, 457-470 (2017). https://doi.org/10.3758/s13428-016-0715-3
17 Dan-Glauser, E. S. & Scherer, K. R. The Geneva affective picture database (GAPED): A new 730-picture database focusing on valence and normative significance. Behav. Res. Methods 43, 468-477 (2011). https://doi.org/10.3758/s13428-011-0064-1
18 Marchewka, A., Żurawski, Ł., Jednoróg, K. & Grabowska, A. The Nencki Affective Picture System (NAPS): Introduction to a novel, standardized, wide-range, high-quality, realistic picture database. Behav. Res. Methods 46, 596-610 (2014). https://doi.org/10.3758/s13428-013-0379-1
19 Weierich, M. R., Kleshchova, O., Rieder, J. K. & Reilly, D. M. The Complex Affective Scene Set (COMPASS): Solving the social content problem in affective visual stimulus sets. Collabra: Psychology 5, 53 (2019). https://doi.org/10.1525/collabra.256
20 Mancuso, V. et al. IAVRS—International Affective Virtual Reality System: Psychometric assessment of 360° images by using psychophysiological data. Sensors 24 (2024).
21 Balsamo, M., Carlucci, L., Padulo, C., Perfetti, B. & Fairfield, B. A bottom-up validation of the IAPS, GAPED, and NAPS affective picture databases: Differential effects on behavioral performance. Front Psychol 11 (2020). https://doi.org/10.3389/fpsyg.2020.02187
22 Michelson, M. & Reuter, K. The significant cost of systematic reviews and meta-analyses: A call for greater involvement of machine learning to assess the promise of clinical trials. Contemporary Clinical Trials Communications 16, 100443 (2019). https://doi.org/10.1016/j.conctc.2019.100443
23 Thorndike, E. L. Theory of mental and social measurements. (The Science Press, 1904).
24 Kelley, T. L. Interpretation of educational measurements. (World Book Company, 1927).
25 van Zyl, L. E. & Rothmann, S. Grand challenges for positive psychology: Future perspectives and opportunities. Front Psychol 13, 833057 (2022). https://doi.org/10.3389/fpsyg.2022.833057
26 Baggetta, P. & Alexander, P. A. Conceptualization and operationalization of executive function. Mind, Brain, and Education 10, 10-33 (2016). https://doi.org/10.1111/mbe.12100
27 Karr, J. E. et al. The unity and diversity of executive functions: A systematic review and re-analysis of latent variable studies. Psychol. Bull. 144, 1147-1185 (2018). https://doi.org/10.1037/bul0000160
28 Anvari, F. et al. Defragmenting psychology. Nat. Hum. Behav. 9, 836-839 (2025). https://doi.org/10.1038/s41562-025-02138-0
29 Ponnock, A. et al. Grit and conscientiousness: Another jangle fallacy. J Res Pers 89, 104021 (2020). https://doi.org/10.1016/j.jrp.2020.104021
30 Wulff, D. U. & Mata, R. Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nat. Hum. Behav. 9, 944-954 (2025). https://doi.org/10.1038/s41562-024-02089-y
31 von Hippel, P. T. & Schuetze, B. A. How not to fool ourselves about heterogeneity of treatment effects. Adv. Meth. Pract. Psychol. Sci. 8, 25152459241304347 (2025). https://doi.org/10.1177/25152459241304347
32 Call, C. C. et al. An ethics and social-justice approach to collecting and using demographic data for psychological researchers. Perspect. Psychol. Sci. 18, 979-995 (2022). https://doi.org/10.1177/17456916221137350
33 Danchev, V., Min, Y., Borghi, J., Baiocchi, M. & Ioannidis, J. P. A. Evaluation of data sharing after implementation of the International Committee of Medical Journal Editors data sharing statement requirement. JAMA Network Open 4, e2033972-e2033972 (2021). https://doi.org/10.1001/jamanetworkopen.2020.33972
34 Wicherts, J. M., Borsboom, D., Kats, J. & Molenaar, D. The poor availability of psychological research data for reanalysis. Am. Psychol. 61, 726-728 (2006). https://doi.org/10.1037/0003-066X.61.7.726
35 Vines, Timothy H. et al. The availability of research data declines rapidly with article age. Curr. Biol. 24, 94-97 (2014). https://doi.org/10.1016/j.cub.2013.11.014
36 Hardwicke, T. E. & Ioannidis, J. P. A. Populating the Data Ark: An attempt to retrieve, preserve, and liberate data from the most highly-cited psychology and psychiatry articles. PLOS ONE 13, e0201856 (2018). https://doi.org/10.1371/journal.pone.0201856
37 Tierney, J. F., Stewart, L. A., Clarke, M. & on behalf of the Cochrane Individual Participant Data Meta-analysis Methods Group. in Cochrane Handbook for Systematic Reviews of Interventions 643-658 (2019).
38 Castro, O., Mair, J., von Wangenheim, F. & Kowatsch, T. in Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies – HEALTHINF. 671-678 (SciTePress).
39 Kobak, D., González-Márquez, R., Horvát, E.-Á. & Lause, J. Delving into LLM-assisted writing in biomedical publications through excess vocabulary. Science Advances 11, eadt3813 (2025). https://doi.org/10.1126/sciadv.adt3813
40 Liang, W. et al. Quantifying large language model usage in scientific papers. Nat. Hum. Behav. (2025). https://doi.org/10.1038/s41562-025-02273-8
41 Walters, W. H. & Wilder, E. I. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci. Rep. 13, 14045 (2023). https://doi.org/10.1038/s41598-023-41032-5
42 Boettiger, C. An introduction to Docker for reproducible research. SIGOPS Oper. Syst. Rev. 49, 71–79 (2015). https://doi.org/10.1145/2723872.2723882
43 Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity: Scientific containers for mobility of compute. PLOS ONE 12, e0177459 (2017). https://doi.org/10.1371/journal.pone.0177459
44 Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316-319 (2017). https://doi.org/10.1038/nbt.3820
45 Moreau, D., Wiebels, K. & Boettiger, C. Containers for computational reproducibility. Nature Reviews Methods Primers 3, 50 (2023). https://doi.org/10.1038/s43586-023-00236-9
46 Bandrowski, A. et al. The Resource Identification Initiative: A cultural shift in publishing. J. Comp. Neurol. 524, 8-22 (2016). https://doi.org/10.1002/cne.23913
47 Denissen, M., Pöll, B., Robbins, K., Makeig, S. & Hutzler, F. HED LANG – A Hierarchical Event Descriptors library extension for annotation of language cognition experiments. Scientific Data 11, 1428 (2024). https://doi.org/10.1038/s41597-024-04282-0
48 Sharp, C., Kaplan, R. M. & Strauman, T. J. The use of ontologies to accelerate the behavioral sciences: Promises and challenges. Curr. Dir. Psychol. Sci. 32, 418-426 (2023). https://doi.org/10.1177/09637214231183917
49 Poldrack, R. A. et al. The Cognitive Atlas: Toward a knowledge foundation for cognitive neuroscience. Frontiers in Neuroinformatics 5 (2011). https://doi.org/10.3389/fninf.2011.00017
50 Leising, D., Liesefeld, H., Buecker, S., Glöckner, A. & Lortsch, S. A tentative roadmap for consensus building processes. Personality Science 5, 27000710241298610 (2024). https://doi.org/10.1177/27000710241298610
51 Schenk, P. et al. An ontological framework for organising and describing behaviours: The Human Behaviour Ontology. Wellcome Open Research 9 (2025). https://doi.org/10.12688/wellcomeopenres.21252.2
52 Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
53 Gilmore, R. O., Kennedy, J. L. & Adolph, K. E. Practical solutions for sharing data and materials from psychological research. Adv. Meth. Pract. Psychol. Sci. 1, 121-130 (2018). https://doi.org/10.1177/2515245917746500
54 Steegen, S., Tuerlinckx, F., Gelman, A. & Vanpaemel, W. Increasing transparency through a multiverse analysis. Perspect. Psychol. Sci. 11, 702-712 (2016). https://doi.org/10.1177/1745691616658637
55 Huang, Z., Long, Y., Peng, K. & Tong, S. An embedding-based semantic analysis approach: A preliminary study on redundancy detection in psychological concepts operationalized by scales. Journal of Intelligence 13 (2025). https://doi.org/10.3390/jintelligence13010011
56 Gonzalez, O., MacKinnon, D. P. & Muniz, F. B. Extrinsic convergent validity evidence to prevent jingle and jangle fallacies. Multivariate Behavioral Research 56, 3-19 (2021). https://doi.org/10.1080/00273171.2019.1707061
57 Samuel, S. & Mietchen, D. Computational reproducibility of Jupyter notebooks from biomedical publications. GigaScience 13, giad113 (2024). https://doi.org/10.1093/gigascience/giad113
58 Dobbins, N., Xiong, C., Lan, K. & Yetisgen, M. Large language model-based agents for automated research reproducibility: An exploratory study in Alzheimer’s disease. arXiv:2505.23852 (2025). https://doi.org/10.48550/arXiv.2505.23852
59 Michie, S. et al. Developing and using ontologies in behavioural science: addressing issues raised. Wellcome Open Research 7 (2023). https://doi.org/10.12688/wellcomeopenres.18211.2
60 Elliott, J. H. et al. Living systematic reviews: An emerging opportunity to narrow the evidence-practice gap. PLoS Med. 11, e1001603 (2014). https://doi.org/10.1371/journal.pmed.1001603
61 Borah, R., Brown, A. W., Capers, P. L. & Kaiser, K. A. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open 7, e012545 (2017). https://doi.org/10.1136/bmjopen-2016-012545
62 Shojania, K. G. et al. How quickly do systematic reviews go out of date? A survival analysis. Ann. Intern. Med. 147, 224-233 (2007). https://doi.org/10.7326/0003-4819-147-4-200708210-00179
63 Franco, A., Malhotra, N. & Simonovits, G. Underreporting in psychology experiments: Evidence from a study registry. Social Psychological and Personality Science 7, 8-12 (2016). https://doi.org/10.1177/1948550615598377
64 Laitin, D. D. et al. Reporting all results efficiently: A RARE proposal to open up the file drawer. Proc. Natl. Acad. Sci. U. S. A. 118, e2106178118 (2021). https://doi.org/10.1073/pnas.2106178118
65 Butler, A. R., Hartmann-Boyce, J., Livingstone-Banks, J., Turner, T. & Lindson, N. Optimizing process and methods for a living systematic review: 30 search updates and three review updates later. J. Clin. Epidemiol. 166, 111231 (2024). https://doi.org/10.1016/j.jclinepi.2023.111231
66 Kane, A. & Amin, B. Amending the literature through version control. Biol. Lett. 19, 20220463 (2023). https://doi.org/10.1098/rsbl.2022.0463
67 Budd, J. M., Sievert, M., Schultz, T. R. & Scoville, C. Effects of article retraction on citation and practice in medicine. Bull. Med. Libr. Assoc. 87, 437-443 (1999).
68 Lin, Z. FOCUS: an AI-assisted reading workflow for information overload. Nat. Biotechnol. 43, 2070-2075 (2025). https://doi.org/10.1038/s41587-025-02947-8
69 Appukuttan, S., Bologna, L. L., Schürmann, F., Migliore, M. & Davison, A. P. EBRAINS Live Papers – Interactive resource sheets for computational studies in neuroscience. Neuroinformatics 21, 101-113 (2023). https://doi.org/10.1007/s12021-022-09598-z
70 Ard, T. et al. Integrating data directly into publications with augmented reality and web-based technologies – Schol-AR. Scientific Data 9, 298 (2022). https://doi.org/10.1038/s41597-022-01426-y
71 Perkel, J. M. Make code accessible with these cloud services. Nature 575, 247-248 (2019). https://doi.org/10.1038/d41586-019-03366-x
72 Lin, Z. Towards an AI policy framework in scholarly publishing. Trends Cogn. Sci. 28, 85-88 (2024). https://doi.org/10.1016/j.tics.2023.12.002
73 Puebla, I. et al. Ten simple rules for recognizing data and software contributions in hiring, promotion, and tenure. PLoS Comput. Biol. 20, e1012296 (2024). https://doi.org/10.1371/journal.pcbi.1012296
74 Piwowar, H. A. & Vision, T. J. Data reuse and the open data citation advantage. PeerJ 1, e175 (2013). https://doi.org/10.7717/peerj.175