Restructuring scientific papers for human and AI readers

Zhicheng Lin¹

¹ Department of Psychology, Yonsei University

Originally published on January 21, 2026 at:

https://doi.org/10.31234/osf.io/c46hs_v5

Abstract

Scientific communication faces a dual crisis: exponential publication growth overwhelms human readers, and fragmented research practices block automated synthesis. AI-assisted writing exacerbates the volume problem, producing papers faster than they can be read. Behavioral and social sciences in particular suffer from incomparable stimulus databases, jingle–jangle measurement fallacies, and demographic blindness that conceals effect heterogeneity. Current AI tools aid comprehension and summarization yet cannot aggregate findings from incommensurable studies and risk amplifying biases when trained on unstructured, unverified text. We propose restructuring scientific papers for dual audiences: front-loaded narratives for time-pressed human readers, paired with machine-readable appendices containing executable code, standardized metadata, and ontologically mapped constructs. This design turns papers into queryable research environments where readers can interrogate data and rerun analyses, and where structured appendices enable automated verification of statistical methods and AI-assisted peer review grounded in executable rather than narrative claims. Such papers become nodes in continuously updated evidence networks: each publication automatically contributes effect sizes to real-time meta-analyses, with corrections and retractions propagating through dependent analyses. Widespread adoption will require institutional recognition of structured documentation as essential scholarly output and computational infrastructure that serves both human comprehension and machine analysis.

Full text

Publication output doubles roughly every 17 years¹, reaching 3.3 million articles in 2022², while researchers spend ever less time on each paper³. AI tools aid summarization and question-answering⁴ but cannot solve the deeper challenge of knowledge integration when the underlying literature is incoherent and inaccessible. In the behavioral and social sciences, findings are fragmented by incomparable measures⁵, bespoke materials and stimuli^6,7, and poorly documented participant demographics⁸. Stimuli, code, and data⁹ are often unavailable, and even shared data are typically limited to summary statistics rather than standardized trial- or event-level observations¹⁰. Knowledge cannot accumulate from such incommensurable fragments.

To address this dual crisis of volume and fragmentation, and in response to a recent U.S. National Academies call for infrastructure to unify scientific knowledge¹¹, we argue that the scientific paper must be rebuilt for two audiences: human readers and AI systems. For humans, this requires front-loading key findings in accessible prose for time-pressed researchers and turning papers into layered knowledge bases where readers can interrogate data, rerun analyses, and access technical details without wading through dense text. For AI systems, the main text, data, code, and protocols must be structured in machine-readable formats that enable automated analysis, comparison, and synthesis. In this Perspective, “AI system” denotes an LLM‑based pipeline that combines generative models with tools, external memory, and deterministic analysis modules, rather than a standalone language model.

Our Perspective builds on FAIR, open‑science, and reproducible‑workflow initiatives but makes four contributions. First, we propose a dual‑audience paper architecture that couples a front‑loaded narrative for human readers with a machine‑actionable structured appendix, rather than treating “supplementary materials” as an afterthought. Second, we specify a minimum viable structured appendix: not just a mandate to “share data” but a bundle of executable containers, standardized stimulus and measure identifiers, ontological mappings of constructs, and tidy trial‑level data with inclusive demographic coding. Third, we introduce a publication‑layer jingle–jangle audit, in which semantic and statistical checks on construct labels become routine infrastructure rather than occasional manual critiques. Fourth, we show how these elements together support living evidence networks, where individual papers become version‑controlled nodes feeding continuously updated, quality‑weighted syntheses.

Some journals already require authors to share datasets and scripts, typically in loosely documented repositories. These resources are idiosyncratic across studies, lack persistent identifiers for stimuli and measures, are not mapped onto shared ontologies, and rarely include trial‑level data with full provenance. They are human‑downloadable but not readily interoperable, queryable, or auditable at scale. By contrast, the dual‑audience paper and structured appendix proposed here treat interoperability, automation, and verification as first‑class design goals: standardized metadata, identifiers, and containerized workflows enable both human analysts and AI systems to rerun analyses, audit construct usage, and feed living evidence networks, rather than merely attaching files to a PDF.

Together, these components define a framework for turning individual papers into queryable research environments and inputs for AI pipelines in which synthesis and verification modules operate on verified, structured knowledge rather than unvetted prose. Our goal is not to introduce yet another platform or standard, but to specify how existing tools—containers, ontologies, data standards, and registries—can be assembled into a publication format and incentive structure that serves both human comprehension and machine analysis in psychology and other data‑rich sciences.

Crises in scientific practice and communication

The volume–fragmentation spiral has produced a cascade of institutional failures, beginning with quality control. Exponential growth in publications and preprints has overwhelmed traditional peer review¹², which no longer provides a reliable signal of quality and relevance. Researchers skim abstracts, abandon papers mid-reading, and retreat to secondary summaries. Meanwhile, the scientific paper—designed for a print era of information scarcity—has become a bottleneck in an AI-rich era: critical data are buried in unstructured prose, trapped behind paywalls, and encoded in formats that resist computation. This friction may incidentally limit large-scale automated reuse of errors and sensitive findings, but it is a crude safeguard. Structured, machine-readable archives could instead pair easier access with explicit governance and quality checks, enabling more reliable synthesis.

These institutional failures both reflect and deepen epistemic fragmentation. Experimental psychologists, for example, routinely deploy proprietary or poorly documented stimulus sets—images, videos, vignettes, auditory clips—that differ in valence, arousal, cultural reference, and perceptual salience^13,14. Even well-validated sets are seldom cross-validated against one another, including the many affective picture databases: the International Affective Picture System (IAPS)¹⁵, Open Affective Standardized Image Set (OASIS)¹⁶, Geneva Affective Picture Database (GAPED)¹⁷, Nencki Affective Pictures System (NAPS)¹⁸, Complex Affective Scene Set (COMPASS)¹⁹, International Affective Virtual Reality System (IAVRS)²⁰, and dozens of culturally specific successors²¹. Effects may then be driven by stimulus-specific confounds rather than the intended construct. Without standardized, cross-validated stimulus sets, studies using different databases or materials become incomparable, forcing meta-analysts to hand-code or exclude findings—a laborious, error-prone process that scales exponentially with literature growth²².

The measurement landscape is equally fragmented. Psychology and related behavioral and social sciences are rife with “jingle–jangle” fallacies: identical names for fundamentally different phenomena (the jingle fallacy²³) and different names for the same constructs (the jangle fallacy²⁴). “Flourishing,” for instance, bundles conflicting theoretical approaches while preserving nominal unity²⁵. “Executive function” spans working memory, cognitive flexibility, and inhibitory control²⁶, with research oscillating among one-, two-, three-, and nested-factor models without converging on a stable structure²⁷. Similar taxonomic confusions arise in economics (“poverty”), medicine (“autism”), and computer science (“artificial intelligence”), but they are especially pernicious in the behavioral sciences. Aggregating findings across such disparate conceptualizations yields statistically significant but scientifically questionable results.

Jangle problems extend to measurement proliferation. A large-scale analysis of APA databases shows that thousands of new measures are published annually, yet over 70% are never reused more than once, so the literature grows more fragmented over time²⁸. This proliferation creates redundant research silos: “grit,” for example, shares most of its reliable variance with conscientiousness²⁹; “psychological capital” often repackages existing well-being measures²⁵. Semantic embeddings further suggest that the 277 distinct construct labels in the International Personality Item Pool could be collapsed into a more parsimonious taxonomy of just 68 clusters³⁰.

A third failure compounds these problems: demographic blindness conceals effect heterogeneity³¹. Many studies report age and sex while omitting race/ethnicity and socioeconomic status, making it impossible to determine for whom effects actually hold³². Findings robust in U.S. college samples may shrink or reverse in other cultural contexts, yet current reporting practices render such moderation invisible until replication failures accumulate. This is not merely a representational concern but a threat to validity: unreported demographic moderators masquerade as statistical noise, obscuring the very patterns researchers seek to understand.

A fourth failure—data inaccessibility and impoverishment—renders many findings functionally unverifiable and unsynthesizable. Most research data remain unavailable upon request^9,33-36, and even when shared, they typically appear as summary statistics rather than the individual participant data (IPD) required for rigorous verification and reuse³⁷. This practice blocks robust forms of evidence synthesis, which depend on IPD to standardize outcomes, conduct proper intention-to-treat analyses, and examine effect heterogeneity across participant characteristics. The very existence of IPD meta-analysis—a methodological gold standard that requires manual collection of raw data from original authors³⁷—indicts the standard scientific paper’s failure as a knowledge-delivery mechanism.

These four failures interact to create formidable barriers to knowledge accumulation. A study of “emotion regulation” (construct problem) using different picture sets (stimulus problem) across varied populations (demographic problem) with only summary statistics available (data problem) poses a computational impossibility: disentangling these sources of heterogeneity requires precisely the structured information that current publication practices withhold.

This fragmentation renders traditional evidence-synthesis methods inadequate for modern science. Systematic reviews are too slow to keep pace with literature growth, prone to error, and often yield inconclusive findings from incommensurable studies³⁸. With more than 70,000 unique measures already documented in psychology²⁸, manual curation has become computationally intractable.

Generative AI promises to automate knowledge synthesis at scale, yet this potential remains largely unrealized. LLMs now affect all stages of the writing process—with at least 13.5% of 2024 PubMed-indexed biomedical abstracts bearing AI-linked style markers³⁹ and 22.5% of sentences in arXiv computer science abstracts estimated to be LLM-modified by September 2024⁴⁰. Yet they cannot synthesize knowledge from fragmented, incomparable studies. Worse, LLMs’ documented tendency to hallucinate citations and perpetuate training biases⁴¹ means that any AI-assisted synthesis must be grounded in verified, structured knowledge rather than unvetted prose.

Writing for two audiences

Addressing these crises requires rethinking the scientific paper’s structure to defragment the behavioral sciences and restart cumulative progress. We propose a dual-audience framework that serves time-pressed human readers and emerging AI systems by pairing a responsible front-loaded narrative and interactive knowledge layers with machine-readable appendices (Fig. 1).

Figure 1. Publication crisis and layered solution for human and AI readers. (A) Structural problems in behavioral-science publishing: stimulus and measurement fragmentation, demographic blindness, and inaccessible or impoverished data. (B) Human-optimized layers: a front-loaded narrative and interactive knowledge layer that give readers rapid access to key findings, methods, and materials. (C) Machine-optimized layer: a machine-readable appendix package with executable code, semantic annotations, and tidy trial-level data that powers interactive reanalysis for human readers and automated reproducibility checks, construct audits, and living meta-analysis for AI agents.

Front-loading for human readers. A front‑loaded paper inverts traditional structure: instead of shambling through a literature review before revealing findings, it begins with explicitly situated answers. The opening paragraphs should address, in order: What did you discover? Why does it matter? How does it change our understanding? They should simultaneously signal the strength of evidence, scope conditions, and how readers can verify or challenge the claims (e.g., via links to the specific scripts, containers, and data tables in the structured appendix that reproduce the headline numbers). This adapts journalism’s inverted pyramid with scientific guardrails so that readers immediately see the contribution, its limits, and the paths for inspection.

Traditional abstracts nominally front‑load some information, but severe space limits and conventions that prioritize technical precision over clarity often produce text that serves gatekeepers rather than readers.

Structured appendices for human interaction and machine processing. Front-loading alone, however, is insufficient. Papers must also become layered knowledge bases that let humans and machines directly interrogate underlying data. This requires structured appendices: a standardized knowledge package that transforms supplementary materials from passive dumping grounds into active, queryable computational infrastructure.

Each appendix is built around computational reproducibility. Instead of scattered code files and static data dumps, researchers provide executable analysis environments using containerization technologies such as Docker⁴² or Singularity⁴³. These containers package code, dependencies, and configurations so that anyone can reproduce the analytical pipeline from raw data to final figures.

This approach is already standard in much of scientific computing: workflow systems such as Nextflow in bioinformatics rely on containers⁴⁴, and pipelines such as fMRIPrep (neuroimaging), BioContainers (genomics), and tools in ecology (QGIS), astronomy (Astropy), and physics (CERN) demonstrate broad disciplinary adoption⁴⁵.

On this foundation sit three layers of semantic structure. First, every stimulus—whether drawn from existing databases or created anew—receives a persistent identifier with standardized metadata for modality, normative ratings (valence, arousal, dominance), cultural validation samples, and licensing. This mirrors the Resource Identification Initiative, which assigns persistent identifiers to biological reagents and software to improve tracking and identifiability⁴⁶.

Researchers creating novel stimuli document them in the same framework, contributing to an expanding queryable ecosystem. A study might reuse validated stimuli—for example, “all positive faces with arousal > 7 validated in East Asian samples”—or introduce new ones such as “custom workplace scenarios rated for stress and cultural relevance”; in both cases, structured metadata enables future discovery and comparison. In language‑cognition experiments, the HED LANG framework already provides a standardized, machine‑readable vocabulary for annotating stimuli⁴⁷.

Second, measurement instruments are mapped to shared conceptual spaces using ontological systems, from controlled vocabularies to formal logic-based, machine-readable ontologies⁴⁸. When a study uses the Beck Depression Inventory-II, for example, items are linked to standardized depression subdimensions using Uniform Resource Identifiers (URIs) from repositories such as the Cognitive Atlas⁴⁹. This semantic mapping enables AI systems to detect when ostensibly different measures assess the same construct, or when identical labels mask different phenomena, addressing the jingle–jangle problem algorithmically rather than through laborious manual coding (Box 1).

This framework makes explicit four measurement questions: What construct does this instrument measure? Why was it chosen? How are responses quantified? What study‑specific modifications were made?⁵ Yet infrastructure alone is insufficient; meaningful interoperability requires communities to forge consensus on shared definitions and standards⁵⁰. The Human Behaviour Ontology illustrates this approach, systematically defining and relating thousands of behavioral concepts to impose coherence on fragmented research domains⁵¹.

Third, all data adopt tidy trial-level formats that capture the full experimental context, linking each response to its stimulus, participant characteristics, and trial conditions using standardized, inclusive demographic coding. This operationalizes the FAIR (Findable, Accessible, Interoperable, and Reusable) principles⁵² by sharing data at the most granular level in standardized formats (e.g., Psych-DS for behavioral data, BIDS for neuroimaging) to maximize value for secondary analysis and reuse⁵³.

This restructuring enables queries that are impossible with current summary-statistics approaches: “Show effects for women over 60” or “Exclude WEIRD-dominated samples” become computational operations rather than manual exclusions. Readers can instantly examine how effects vary across age, education, or cultural context without requesting raw data or running new studies. They can query the executable environment to assess heterogeneity across stimuli and analytic choices—“How sensitive are results to different stimuli or preprocessing decisions?”—and manipulate interactive visualizations, adjust parameters, and explore alternative presentations in real time.

This transforms multiverse analysis—the systematic exploration of how results vary across reasonable data-processing and analytical choices⁵⁴—from a reporting burden into a native capability. Instead of cramming robustness checks into static appendices, researchers embed alternative scripts within the executable environment, allowing readers to probe a finding’s stability computationally. The appendix becomes a space where readers can extend analyses, test alternative hypotheses, and build directly on existing work without first deciphering the authors’ original narrative.

The framework can be adopted in stages, with a minimum viable version that most labs can implement now and a full version that depends on emerging infrastructure. A simple four‑tier roadmap is:

Tier 0: Data, code, and provenance. Authors share the analytic dataset and scripts used to generate the main results in a stable repository, with persistent identifiers, clear licenses, and a brief provenance note describing recruitment, inclusion criteria, and key preprocessing steps.
Tier 1: Tidy trial‑level data and basic metadata. Shared data are restructured into trial- or observation‑level tidy format and accompanied by a simple machine‑readable schema (e.g., JSON or YAML) that defines variable names, units, coding, and links between stimuli, participants, and conditions.
Tier 2: Executable environment and automated checks. The analysis is wrapped in a containerized environment (e.g., Docker or Singularity) together with a lightweight continuous‑integration script that reruns the main analyses and regenerates figures whenever code or data change, flagging breakage early.
Tier 3: Ontology linking and evidence‑network integration. Measures, tasks, and stimuli are linked to shared ontologies and persistent identifiers; effect estimates and study‑level metadata are exported in standardized form suitable for automatic ingestion into domain‑specific repositories and living meta‑analyses.

In this scheme, a minimum viable dual‑audience paper corresponds to Tiers 0–1—open data and code plus tidy trial‑level data with basic metadata. Tiers 2–3 realize the full framework: executable environments, automated robustness checks, semantic linking, and native participation in living evidence networks. Journals and funders can ratchet expectations gradually, first normalizing Tiers 0–1 and treating higher tiers as aspirational targets for consortia and well‑resourced teams, rather than insisting that the world jump directly to Tier 3.

**Box 1 | Automating the Jingle–Jangle Audit**
The jingle fallacy conflates distinct phenomena under identical labels; the jangle fallacy fragments identical constructs across different names. Our framework embeds a dual-layer automated audit to detect both. The first layer employs semantic and ontological analysis. A jingle detector cross-references construct names against shared ontologies (e.g., Cognitive Atlas), using graph algorithms to identify identical labels that occupy distinct semantic neighborhoods. In parallel, a jangle detector applies LLM embeddings to cluster scale items, flagging nominally different instruments whose item content shows high semantic convergence^30,55. The second layer provides empirical validation via statistical analysis⁵⁶ of structured-appendix data and the living evidence network. For jangle detection, it tests extrinsic convergent validity by comparing correlation patterns between putatively different measures and external criteria—statistically indistinguishable patterns indicate empirical redundancy. For jingle detection, it tests discriminant validity by asking whether identically labeled measures show divergent correlation with theoretically distinct criteria—systematic differences reveal that a single label masks multiple constructs. This dual-validation approach—semantic analysis for large-scale screening, statistical comparison for empirical confirmation—builds construct hygiene into the publication infrastructure. Continuous-integration scripts run these audits automatically, generating curator reports that flag redundancies and collisions before they propagate. Construct validation thus shifts from an occasional, labor‑intensive exercise to an automated quality-control mechanism operating at the scale and speed of modern scientific publishing.

Transforming AI-assisted peer review

The dual‑audience framework also restructures peer review. A front‑loaded narrative lets reviewers judge novelty and significance within minutes, rather than spending hours parsing methods only to discover that a study is flawed or incremental.

Moreover, the structured appendix facilitates precise auditing and more rigorous evaluation. Current peer review rarely scrutinizes data and code: a recent large-scale attempt to rerun published code found a reproducibility rate below 6%, suggesting that reviewers largely take methodological claims on trust⁵⁷. With structured appendices, reviewers can inspect configuration files and data, execute code, and verify that figures regenerate from the underlying data. This reduces the frustrating back-and-forth that plagues current review cycles.

Critically, this structure provides the essential infrastructure for responsible AI-assisted review (Fig. 2). Static, print-replica PDFs are poorly suited to AI systems that require structured access to tables and figures.

For example, autonomous AI agents relying only on prose in methods sections have failed to reproduce nearly half of published findings—applying incorrect or incomplete statistical methods when text is ambiguous or omits essential details⁵⁸. Providing structured, machine-friendly content (e.g., CSV, Markdown, Git) would unlock new opportunities for AI-driven quality control: validating references, auditing logical consistency, checking mathematical and statistical accuracy, and systematically verifying the structured appendix. Verification can then move beyond confirming that code runs to automated multiverse analyses that vary data‑processing choices and analytic parameters to map the fragility of a study’s conclusions.

Beyond computational verification, the structured format also improves semantic clarity. Formal ontologies impose precise, unambiguous definitions on core constructs, resolving where theoretical disagreements genuinely lie⁵⁹. Review shifts from semantic excavation to focused scientific dialogue. Human reviewers receive both the manuscript and an AI‑generated audit report, freeing them to concentrate on interpretive claims, novelty, and overall significance.

Figure 2. Hybrid AI–human peer review enabled by structured appendices. Authors submit a package containing a front‑loaded manuscript and structured appendix (code, data, containers, ontologies). AI review agents run reproducibility, multiverse‑robustness, and citation‑integrity checks to generate an audit summarizing status, warnings, and fragility. Human reviewers and editors use this audit alongside the narrative to request targeted author responses, assess conceptual novelty and theoretical contribution, adjudicate interpretation versus evidence, and make final publication decisions.

From isolated papers to networked knowledge

Beyond improving communication and quality assurance, the dual-audience framework creates the technical foundation for systematic knowledge integration. Papers can be linked into living evidence networks—extensions of the manually intensive “living systematic review” model⁶⁰—in which each publication becomes a node that automatically contributes to evolving synthesis.

Consider how systematic reviews and meta-analyses currently work: researchers manually search databases, screen thousands of abstracts, extract data from hundreds of PDFs, code effect sizes, and produce static summary estimates—a process that typically occupies about five researchers for a year⁶¹. Yet by publication, many meta-analyses are already outdated: median “shelf life” is 5.5 years, with 23% requiring updates within two years and 7% being obsolete on arrival⁶². New studies accumulate in limbo until another large manual effort is mounted. This process is further compromised by systematic publication bias: null or inconvenient results remain buried in file drawers while false positives proliferate⁶³. Fig. 3 contrasts this static, labor-intensive pipeline with the living evidence network enabled when structured papers feed domain-specific repositories and continuously updated meta-analyses.

Living evidence networks invert this workflow. When researchers publish under this framework, whether in journals or public registries⁶⁴, their structured appendices automatically populate theme-specific evidence repositories. A new report on mindfulness and anxiety immediately contributes its effect sizes, sample characteristics, and methodological features to the running meta-analysis on that topic, regardless of whether results are significant, null, or contradictory. Pooled estimates update in real time. When a paper is retracted, its data points are removed from downstream syntheses within hours rather than years, automating and accelerating the update process currently managed by teams conducting living systematic reviews⁶⁵.

In addition, the same structured knowledge base becomes training data for AI models that predict outcomes for novel interventions or specific populations³⁸. The system can forecast the likely efficacy of a hypothetical intervention for a given demographic, turning evidence synthesis from retrospective summary into a tool for direct, actionable guidance.

Figure 3. From static meta-analyses to living evidence networks. (A) Traditional workflow in which scattered PDFs feed a one-off meta-analysis requiring months or years of manual screening and coding, yielding a synthesis that is already outdated at publication. (B) Structured papers with appendices support automatic extraction of effect sizes, moderators, and quality indicators into domain-specific repositories. (C) Living evidence network in which domain-specific repositories continuously update living meta-analyses and cross-domain syntheses; new studies, retractions, and registered null results dynamically alter study weights, providing real-time evidence for clinical guidelines, policy briefs, and AI prediction models.

This framework requires two additional infrastructure components beyond the semantic anchoring and ontological mapping already described. First, research artefacts—datasets, code, preregistrations, and derived effect-size tables—must be treated as versioned, citable objects. Version-control systems such as Git provide transparent audit trails, but propagating corrections requires additional layers: persistent identifiers, standardized cross‑reference metadata (e.g., DataCite‑style schemas), explicit dependency graphs linking studies to syntheses, and registries that index these links. When a dataset or analysis is corrected or retracted, systems built on this graph can automatically flag and update downstream meta‑analyses and evidence syntheses, instead of relying on formal errata and retractions that are slow and cumbersome to issue⁶⁶ and often remain invisible for years⁶⁷.

Second, quality indicators—sample size, pre-registration status, replication attempts, methodological rigor scores—must dynamically weight each study’s contribution to pooled estimates. High-quality, preregistered studies with large samples should carry more influence than exploratory work with convenience samples, so that evidence synthesis reflects both the quantity and quality of available research.

Barriers and failure modes

The framework outlined above is aspirational; without attention to implementation constraints, it could reinforce existing inequities. Building executable containers, tidy trial‑level datasets, and ontological mappings requires technical capacity that many small labs, non‑elite institutions, and practitioner settings do not yet have. If structured appendices become de facto requirements without parallel investments in infrastructure, training, and credit, adoption will be slow and skewed toward well‑resourced groups. Standardization for shared ontologies and demographic vocabularies also entails substantial coordination costs and risks freezing contested constructs or marginalizing alternative theoretical traditions unless governance is explicitly pluralistic, revisable, and transparent.

Sensitive and confidential data pose a different failure mode. In clinical, educational, and small or marginalized populations, naive mandates for fully open trial‑level data collide with privacy protections, data‑sovereignty claims, and legal or ethical constraints. The dual‑audience architecture must therefore support tiered access, secure data enclaves, and remote‑execution or synthetic‑data solutions so that code and metadata remain reusable even when raw data cannot be widely shared.

Finally, lowering the friction for reanalysis also lowers the friction for motivated misuse. Interactive appendices can make it easier for ideological actors to cherry‑pick specifications, ignore multiverse fragility or quality weights, and promote “do your own research” narratives that overstate the certainty of convenient results. Design choices can mitigate these risks by foregrounding robustness summaries rather than single estimates, making departures from preregistered analyses and default pipelines explicitly visible, and tying reanalyses back into version‑controlled living evidence networks where idiosyncratic claims are evaluated against the full corpus rather than in isolation. These barriers and failure modes are not reasons to abandon the framework but constraints that should shape governance, incentive design, and support from publishers and institutions.

Implementation and governance options

Publishers face an existential choice: continue distributing static PDFs as their gatekeeping role erodes under funder and institutional open‑access mandates, or help build the infrastructure that turns those same papers into queryable research environments. Scholarly societies, disciplinary and institutional repositories, and research libraries likewise must decide whether to merely mirror static PDFs or adopt shared formats for structured, executable paper packages that any interface can use.

For any host, two implementation paths emerge. One is to develop native AI systems integrated directly into their platforms—tools trained on scientific content with domain expertise in methodology, statistics, and interpretation. The other is to leverage browser‑ or operating‑system–level AI via secure APIs that let tools such as Chrome’s Gemini⁶⁸ access full paper contexts, including structured appendices and executable code, rather than only rendered text.

Either approach enables text‑based services such as on‑demand translation that preserves technical precision, personalized summarization tailored to reader expertise, and advanced Q&A that can execute code for reanalysis or generate new visualizations. Audio and video services could automatically produce conversational podcasts or video summaries, making research accessible across formats and languages.

To remain viable as AI systems and interfaces evolve, these implementations should rest on open, versioned standards. The archival “paper package”—identifiers, metadata, data, and executable code—must remain portable across hosts and decoupled from any single AI model or user interface, so that the scientific record outlives particular vendors and tools.

These services imply a shift from charging primarily for document access toward supporting authenticated analytical capability. Funding models may range from institutional support and public infrastructure to subscription‑based access to advanced tools, while the underlying narrative text and structured appendices remain findable and portable even when specific interfaces are restricted. For sensitive data, access would operate through institutional agreements that create a trusted “data commons” of authorized researchers.

Proof-of-concept already exists. The Resource Identification Initiative standardizes identifiers for research materials⁴⁶. Databrary provides infrastructure for sharing research data, including sensitive data⁵³. EBRAINS “Live Papers” bundle code, data, and computational models for interactive simulation within neuroscience publications⁶⁹. eLife’s Executable Research Articles let readers inspect, modify, and rerun the code that generates figures and tables in the browser. Schol-AR embeds manipulable visualizations into articles⁷⁰. Code Ocean packages complete executable environments⁷¹. These pioneers demonstrate technical feasibility, but the transition from static document to dynamic resource remains fragmented across isolated initiatives.

The next evolution demands clear governance as well as technical innovation. Three questions are central. First, platforms hosting AI–paper interactions must protect the privacy of user queries and interaction data. Second, they must distinguish policies for confidential peer-review materials from those for published content and avoid sending unpublished manuscripts to external commercial AI systems that may retain proprietary data⁷². Third, they must state explicitly whether and how published content or interaction logs are used for model training, and on what legal and ethical basis. Any organization implementing AI solutions should guarantee privacy-first architectures in which user interactions are secure and encrypted, and in which data uses—including for training—are transparent and subject to meaningful consent and oversight.

Role of institutions

Universities and research institutions control the most powerful lever for adoption: career incentives. Promotion and tenure committees have historically undervalued digital research assets, creating a systemic disincentive to produce the high-quality data and code essential for a cumulative science⁷³.

Reform therefore must be explicit and immediate. Institutions should revise hiring, promotion, and tenure criteria to prioritize work that advances long-term scientific progress, such as structured appendices and systematic consensus building—time-consuming but essential work that yields shared terminologies and methodological standards, making structured data interoperable and meaningful⁵⁰. This aligns with the Coalition for Advancing Research Assessment (CoARA), which treats diverse contributions—datasets, software, code, and protocols—as legitimate scholarly products beyond journal impact factors.

But recognition alone is insufficient. Institutions must provide funding, computational resources, version-control systems, and technical training that make comprehensive documentation feasible rather than burdensome. Research libraries should expand from literature access to data curation, helping faculty turn messy research outputs into structured, queryable packages.

Early adopters gain competitive advantages. As AI-driven discovery tools emerge, institutions producing structured, machine-readable research will see their faculty become more visible and influential—if improved citation of open data is any guide⁷⁴—creating systematic advantages in knowledge synthesis and collaboration.

Toward dynamic scientific authority

This transformation alters the epistemological relationship between author and audience. Traditionally, authority rests with the author’s narrative in the main text. Once raw data become directly accessible through AI queries, authority shifts toward machine-mediated analysis of evidence. Readers can pose counterfactual questions—“Re-run the analysis excluding subjects over 65” or “Plot the data using logarithmic scaling”—as computational operations rather than requests to authors. This moves beyond static replication toward dynamic exploration, making robustness checks, alternative specifications, and hypothesis generation cheap. Such interactive reanalyses remain exploratory; unbiased confirmatory tests still require prespecified design and analyses (ideally preregistered) evaluated on fresh or held‑out data not used to generate the hypotheses or analytic decisions.

This shift poses new questions for scientific practice: how theoretical synthesis should guide and interpret increasingly discoverable empirical patterns; how authority is distributed when any reader can interrogate data via AI interfaces; and how peer review should balance algorithmic validation of methods with human judgment about theory and significance.

This framework creates discipline‑agnostic infrastructure that both addresses current crises and positions science for AI‑enabled discovery: persistent identifiers, standardized metadata and provenance, executable analysis environments, and versioned audit trails. Psychology serves here as a stress test because its core objects—stimulus sets, tasks, and jingle–jangle‑prone constructs—are unusually tangled; other domains can substitute their own reagents, instruments, specimens, or datasets on the same scaffold. Widespread adoption would generate high‑quality, structured research artefacts that surpass the unstructured text currently training most models, providing a foundation for AI systems in which causal models, statistical engines, and verification tools operate on verifiable inputs. Rather than passively accepting commercial AI tools, the academic community must define how these pipelines integrate with scientific values of transparency and rigor.

References

1 Bornmann, L., Haunschild, R. & Mutz, R. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanities and Social Sciences Communications 8, 224 (2021). https://doi.org/10.1057/s41599-021-00903-w

2 National Science Board. Publications output: U.S. trends and international comparisons. (National Science Foundation, 2024).

3 Tenopir, C., King, D. W., Christian, L. & Volentine, R. Scholarly article seeking, reading, and use: A continuing evolution from print to electronic in the sciences and social sciences. Learned Publishing 28, 93-105 (2015). https://doi.org/10.1087/20150203

4 Lin, Z. Why and how to embrace AI such as ChatGPT in your academic life. R. Soc. Open Sci. 10, 230658 (2023). https://doi.org/10.1098/rsos.230658

5 Flake, J. K. & Fried, E. I. Measurement schmeasurement: Questionable measurement practices and how to avoid them. Adv. Meth. Pract. Psychol. Sci. 3, 456-465 (2020). https://doi.org/10.1177/2515245920952393

6 Clark, H. H. The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior 12, 335-359 (1973). https://doi.org/10.1016/S0022-5371(73)80014-3

7 Yarkoni, T. The generalizability crisis. Behav. Brain Sci. 45, e1 (2020). https://doi.org/10.1017/s0140525x20001685

8 Sterling, E., Pearl, H., Liu, Z., Allen, J. W. & Fleischer, C. C. Demographic reporting across a decade of neuroimaging: A systematic review. Brain Imaging and Behavior 16, 2785-2796 (2022). https://doi.org/10.1007/s11682-022-00724-8

9 Tedersoo, L. et al. Data sharing practices and data availability upon request differ across scientific disciplines. Scientific Data 8, 192 (2021). https://doi.org/10.1038/s41597-021-00981-0

10 Hardwicke, T. E. et al. Estimating the prevalence of transparency and reproducibility-related research practices in psychology (2014–2017). Perspect. Psychol. Sci. 17, 239-251 (2021). https://doi.org/10.1177/1745691620979806

11 National Academies of Sciences, Engineering, and Medicine; Division of Behavioral and Social Sciences and Education; Board on Behavioral, Cognitive, and Sensory Sciences; Committee on Accelerating Behavioral Science through Ontology Development and Use,. in Ontologies in the Behavioral Sciences: Accelerating Research and the Spread of Knowledge (eds A. S. Beatty & R. M. Kaplan) (National Academies Press (US), 2022).

12 Hanson, M. A., Barreiro, P. G., Crosetto, P. & Brockington, D. The strain on scientific publishing. Quantitative Science Studies 5, 823-843 (2024). https://doi.org/10.1162/qss_a_00327

13 Diconne, K., Kountouriotis, G. K., Paltoglou, A. E., Parker, A. & Hostler, T. J. Presenting KAPODI—the searchable database of emotional stimuli sets. Emotion Review 14, 84-95 (2022). https://doi.org/10.1177/17540739211072803

14 Lin, Z., Ma, Q., Huang, X., Wu, X. & Zhang, Y. Pervasive failure to report properties of visual stimuli in experimental research in psychology and neuroscience: Two metascientific studies. Psychol. Bull. 149, 487-505 (2023).

15 Lang, P. J., Bradley, M. M. & Cuthbert, B. N. International affective picture system (IAPS): Affective ratings of pictures and instruction manual. (NIMH, Center for the Study of Emotion & Attention, 2005).

16 Kurdi, B., Lozano, S. & Banaji, M. R. Introducing the Open Affective Standardized Image Set (OASIS). Behav. Res. Methods 49, 457-470 (2017). https://doi.org/10.3758/s13428-016-0715-3

17 Dan-Glauser, E. S. & Scherer, K. R. The Geneva affective picture database (GAPED): A new 730-picture database focusing on valence and normative significance. Behav. Res. Methods 43, 468-477 (2011). https://doi.org/10.3758/s13428-011-0064-1

18 Marchewka, A., Żurawski, Ł., Jednoróg, K. & Grabowska, A. The Nencki Affective Picture System (NAPS): Introduction to a novel, standardized, wide-range, high-quality, realistic picture database. Behav. Res. Methods 46, 596-610 (2014). https://doi.org/10.3758/s13428-013-0379-1

19 Weierich, M. R., Kleshchova, O., Rieder, J. K. & Reilly, D. M. The Complex Affective Scene Set (COMPASS): Solving the social content problem in affective visual stimulus sets. Collabra: Psychology 5, 53 (2019). https://doi.org/10.1525/collabra.256

20 Mancuso, V. et al. IAVRS—International Affective Virtual Reality System: Psychometric assessment of 360° images by using psychophysiological data. Sensors 24 (2024).

21 Balsamo, M., Carlucci, L., Padulo, C., Perfetti, B. & Fairfield, B. A bottom-up validation of the IAPS, GAPED, and NAPS affective picture databases: Differential effects on behavioral performance. Front Psychol 11 (2020). https://doi.org/10.3389/fpsyg.2020.02187

22 Michelson, M. & Reuter, K. The significant cost of systematic reviews and meta-analyses: A call for greater involvement of machine learning to assess the promise of clinical trials. Contemporary Clinical Trials Communications 16, 100443 (2019). https://doi.org/10.1016/j.conctc.2019.100443

23 Thorndike, E. L. Theory of mental and social measurements. (The Science Press, 1904).

24 Kelley, T. L. Interpretation of educational measurements. (World Book Company, 1927).

25 van Zyl, L. E. & Rothmann, S. Grand challenges for positive psychology: Future perspectives and opportunities. Front Psychol 13, 833057 (2022). https://doi.org/10.3389/fpsyg.2022.833057

26 Baggetta, P. & Alexander, P. A. Conceptualization and operationalization of executive function. Mind, Brain, and Education 10, 10-33 (2016). https://doi.org/10.1111/mbe.12100

27 Karr, J. E. et al. The unity and diversity of executive functions: A systematic review and re-analysis of latent variable studies. Psychol. Bull. 144, 1147-1185 (2018). https://doi.org/10.1037/bul0000160

28 Anvari, F. et al. Defragmenting psychology. Nat. Hum. Behav. 9, 836-839 (2025). https://doi.org/10.1038/s41562-025-02138-0

29 Ponnock, A. et al. Grit and conscientiousness: Another jangle fallacy. J Res Pers 89, 104021 (2020). https://doi.org/10.1016/j.jrp.2020.104021

30 Wulff, D. U. & Mata, R. Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nat. Hum. Behav. 9, 944-954 (2025). https://doi.org/10.1038/s41562-024-02089-y

31 von Hippel, P. T. & Schuetze, B. A. How not to fool ourselves about heterogeneity of treatment effects. Adv. Meth. Pract. Psychol. Sci. 8, 25152459241304347 (2025). https://doi.org/10.1177/25152459241304347

32 Call, C. C. et al. An ethics and social-justice approach to collecting and using demographic data for psychological researchers. Perspect. Psychol. Sci. 18, 979-995 (2022). https://doi.org/10.1177/17456916221137350

33 Danchev, V., Min, Y., Borghi, J., Baiocchi, M. & Ioannidis, J. P. A. Evaluation of data sharing after implementation of the International Committee of Medical Journal Editors data sharing statement requirement. JAMA Network Open 4, e2033972-e2033972 (2021). https://doi.org/10.1001/jamanetworkopen.2020.33972

34 Wicherts, J. M., Borsboom, D., Kats, J. & Molenaar, D. The poor availability of psychological research data for reanalysis. Am. Psychol. 61, 726-728 (2006). https://doi.org/10.1037/0003-066X.61.7.726

35 Vines, Timothy H. et al. The availability of research data declines rapidly with article age. Curr. Biol. 24, 94-97 (2014). https://doi.org/10.1016/j.cub.2013.11.014

36 Hardwicke, T. E. & Ioannidis, J. P. A. Populating the Data Ark: An attempt to retrieve, preserve, and liberate data from the most highly-cited psychology and psychiatry articles. PLOS ONE 13, e0201856 (2018). https://doi.org/10.1371/journal.pone.0201856

37 Tierney, J. F., Stewart, L. A., Clarke, M. & on behalf of the Cochrane Individual Participant Data Meta-analysis Methods Group. in Cochrane Handbook for Systematic Reviews of Interventions 643-658 (2019).

38 Castro, O., Mair, J., von Wangenheim, F. & Kowatsch, T. in Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies – HEALTHINF. 671-678 (SciTePress).

39 Kobak, D., González-Márquez, R., Horvát, E.-Á. & Lause, J. Delving into LLM-assisted writing in biomedical publications through excess vocabulary. Science Advances 11, eadt3813 (2025). https://doi.org/10.1126/sciadv.adt3813

40 Liang, W. et al. Quantifying large language model usage in scientific papers. Nat. Hum. Behav. (2025). https://doi.org/10.1038/s41562-025-02273-8

41 Walters, W. H. & Wilder, E. I. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci. Rep. 13, 14045 (2023). https://doi.org/10.1038/s41598-023-41032-5

42 Boettiger, C. An introduction to Docker for reproducible research. SIGOPS Oper. Syst. Rev. 49, 71–79 (2015). https://doi.org/10.1145/2723872.2723882

43 Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity: Scientific containers for mobility of compute. PLOS ONE 12, e0177459 (2017). https://doi.org/10.1371/journal.pone.0177459

44 Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316-319 (2017). https://doi.org/10.1038/nbt.3820

45 Moreau, D., Wiebels, K. & Boettiger, C. Containers for computational reproducibility. Nature Reviews Methods Primers 3, 50 (2023). https://doi.org/10.1038/s43586-023-00236-9

46 Bandrowski, A. et al. The Resource Identification Initiative: A cultural shift in publishing. J. Comp. Neurol. 524, 8-22 (2016). https://doi.org/10.1002/cne.23913

47 Denissen, M., Pöll, B., Robbins, K., Makeig, S. & Hutzler, F. HED LANG – A Hierarchical Event Descriptors library extension for annotation of language cognition experiments. Scientific Data 11, 1428 (2024). https://doi.org/10.1038/s41597-024-04282-0

48 Sharp, C., Kaplan, R. M. & Strauman, T. J. The use of ontologies to accelerate the behavioral sciences: Promises and challenges. Curr. Dir. Psychol. Sci. 32, 418-426 (2023). https://doi.org/10.1177/09637214231183917

49 Poldrack, R. A. et al. The Cognitive Atlas: Toward a knowledge foundation for cognitive neuroscience. Frontiers in Neuroinformatics 5 (2011). https://doi.org/10.3389/fninf.2011.00017

50 Leising, D., Liesefeld, H., Buecker, S., Glöckner, A. & Lortsch, S. A tentative roadmap for consensus building processes. Personality Science 5, 27000710241298610 (2024). https://doi.org/10.1177/27000710241298610

51 Schenk, P. et al. An ontological framework for organising and describing behaviours: The Human Behaviour Ontology. Wellcome Open Research 9 (2025). https://doi.org/10.12688/wellcomeopenres.21252.2

52 Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

53 Gilmore, R. O., Kennedy, J. L. & Adolph, K. E. Practical solutions for sharing data and materials from psychological research. Adv. Meth. Pract. Psychol. Sci. 1, 121-130 (2018). https://doi.org/10.1177/2515245917746500

54 Steegen, S., Tuerlinckx, F., Gelman, A. & Vanpaemel, W. Increasing transparency through a multiverse analysis. Perspect. Psychol. Sci. 11, 702-712 (2016). https://doi.org/10.1177/1745691616658637

55 Huang, Z., Long, Y., Peng, K. & Tong, S. An embedding-based semantic analysis approach: A preliminary study on redundancy detection in psychological concepts operationalized by scales. Journal of Intelligence 13 (2025). https://doi.org/10.3390/jintelligence13010011

56 Gonzalez, O., MacKinnon, D. P. & Muniz, F. B. Extrinsic convergent validity evidence to prevent jingle and jangle fallacies. Multivariate Behavioral Research 56, 3-19 (2021). https://doi.org/10.1080/00273171.2019.1707061

57 Samuel, S. & Mietchen, D. Computational reproducibility of Jupyter notebooks from biomedical publications. GigaScience 13, giad113 (2024). https://doi.org/10.1093/gigascience/giad113

58 Dobbins, N., Xiong, C., Lan, K. & Yetisgen, M. Large language model-based agents for automated research reproducibility: An exploratory study in Alzheimer’s disease. arXiv:2505.23852 (2025). https://doi.org/10.48550/arXiv.2505.23852

59 Michie, S. et al. Developing and using ontologies in behavioural science: addressing issues raised. Wellcome Open Research 7 (2023). https://doi.org/10.12688/wellcomeopenres.18211.2

60 Elliott, J. H. et al. Living systematic reviews: An emerging opportunity to narrow the evidence-practice gap. PLoS Med. 11, e1001603 (2014). https://doi.org/10.1371/journal.pmed.1001603

61 Borah, R., Brown, A. W., Capers, P. L. & Kaiser, K. A. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open 7, e012545 (2017). https://doi.org/10.1136/bmjopen-2016-012545

62 Shojania, K. G. et al. How quickly do systematic reviews go out of date? A survival analysis. Ann. Intern. Med. 147, 224-233 (2007). https://doi.org/10.7326/0003-4819-147-4-200708210-00179

63 Franco, A., Malhotra, N. & Simonovits, G. Underreporting in psychology experiments: Evidence from a study registry. Social Psychological and Personality Science 7, 8-12 (2016). https://doi.org/10.1177/1948550615598377

64 Laitin, D. D. et al. Reporting all results efficiently: A RARE proposal to open up the file drawer. Proc. Natl. Acad. Sci. U. S. A. 118, e2106178118 (2021). https://doi.org/10.1073/pnas.2106178118

65 Butler, A. R., Hartmann-Boyce, J., Livingstone-Banks, J., Turner, T. & Lindson, N. Optimizing process and methods for a living systematic review: 30 search updates and three review updates later. J. Clin. Epidemiol. 166, 111231 (2024). https://doi.org/10.1016/j.jclinepi.2023.111231

66 Kane, A. & Amin, B. Amending the literature through version control. Biol. Lett. 19, 20220463 (2023). https://doi.org/10.1098/rsbl.2022.0463

67 Budd, J. M., Sievert, M., Schultz, T. R. & Scoville, C. Effects of article retraction on citation and practice in medicine. Bull. Med. Libr. Assoc. 87, 437-443 (1999).

68 Lin, Z. FOCUS: an AI-assisted reading workflow for information overload. Nat. Biotechnol. 43, 2070-2075 (2025). https://doi.org/10.1038/s41587-025-02947-8

69 Appukuttan, S., Bologna, L. L., Schürmann, F., Migliore, M. & Davison, A. P. EBRAINS Live Papers – Interactive resource sheets for computational studies in neuroscience. Neuroinformatics 21, 101-113 (2023). https://doi.org/10.1007/s12021-022-09598-z

70 Ard, T. et al. Integrating data directly into publications with augmented reality and web-based technologies – Schol-AR. Scientific Data 9, 298 (2022). https://doi.org/10.1038/s41597-022-01426-y

71 Perkel, J. M. Make code accessible with these cloud services. Nature 575, 247-248 (2019). https://doi.org/10.1038/d41586-019-03366-x

72 Lin, Z. Towards an AI policy framework in scholarly publishing. Trends Cogn. Sci. 28, 85-88 (2024). https://doi.org/10.1016/j.tics.2023.12.002

73 Puebla, I. et al. Ten simple rules for recognizing data and software contributions in hiring, promotion, and tenure. PLoS Comput. Biol. 20, e1012296 (2024). https://doi.org/10.1371/journal.pcbi.1012296

74 Piwowar, H. A. & Vision, T. J. Data reuse and the open data citation advantage. PeerJ 1, e175 (2013). https://doi.org/10.7717/peerj.175

Editors

Kathryn Zeiler
Editor-in-Chief

Alex Holcombe
Handling Editor

Editorial assessment

by Alex Holcombe

DOI: 10.70744/MetaROR.312.1.ea

Reviewers found the manuscript’s motivation, to restructure scientific publishing to serve both human and AI readers, worthwhile and timely. The reviewers suggested that the manuscript would be improved by engaging more with previous work, including initiatives related to different elements of the proposal such as the FAIR principles, the Research Object concept that included the idea of machine-readable appendices, the Force11 manifesto, and machine-actionable publishing efforts at some journals, each of which have arguably contributed to making scientific articles more machine-readable. Additionally, the manuscript’s treatment of how AI systems actually process scientific literature was seen by the reviewers as needing updating, and would benefit from discussion of risks such as AI hallucination in proposed automated auditing functions.

Reviewers also raised some concerns about practical implementation, including a need for evidence and/or a stronger argument to back the proposed tiered adoption roadmap. It would also be worthwhile for the framework to make explicit its apparent assumption of a certain level of data and code sharing, which remains uncommon in many disciplines. One reviewer, an ethicist, raised concerns about accountability if significant portions of a paper are written for AI consumption as that could make them less readily directly interpretable by humans.

Recommendations for enhanced transparency

Add author ORCID iD.
Add a competing interest statement. Authors should report all competing interests, including not only financial interests, but any role, relationship, or commitment of an author that presents an actual or perceived threat to the integrity or independence of the research presented in the article. If no competing interests exist, authors should explicitly state this.
Add a funding source statement. Authors should report all funding in support of the research presented in the article. Grant reference numbers should be included. If no funding sources exist, explicitly state this in the article.

For more information on these recommendations, please refer to our author guidelines.

Competing interests: None.

Peer review 1

David Resnik

DOI: 10.70744/MetaROR.312.1.rv1

This paper argues for a new model of bifurcated model of scientific papers: a human-readable section and an AI-readable section. The rationale of this is to take better enable AI systems to synthesize the scientific literature, to facilitate peer review, and to promote more rigor in science. The idea original, important, and merits further discussion and debate. I would like to raise some philosophical and ethical concerns with the idea that the authors do not adequately address.

1. Who would take responsibility for the paper? The human? The AI? It seems that large sections of the paper humans could not be responsible for because it is all written by AIs so it will be understandable to AIs. But if humans can ‘t really take responsibility for what has be done, this creates a very dangerous situation. The AI sections of the paper could have dangerous information, for example, to make bioweapons, that is not understandable to humans but is to AIs, but the humans could do nothing about it. “Human in the loop” is a big theme in AI ethics but what the authors have proposed seems to take humans dangerously out of the loop.

2. Since authorship and responsibility go hand in hand, this of course raises major authorship questions.

3. It seems it also raise epistemological issues about knowledge that transcends human understanding (can the even be considered knowledge?) and the democratization of science. Making science even more technical than it already is seems to make it even less democratic.

4. It also seems that this approach might be more applicable to some fields rather than others. For example, computer science highly technical disciplines but not humanities (philosophy? law) and maybe not even social science. This needs to be addressed.

5. Deskilling of human is an issue too. If we get the dumbed down version, we get dumber.

See Resnik, DB, Hosseini, M & Hauswald, R. Autonomous artificial intelligence, scientific research, and human values. AI Ethics 6, 141 (2026). https://doi.org/10.1007/s43681-025-00908-0. This article touches on human in the loop issues and related issues with respect to AI agents, which raise similar concerns.

Competing interests: None.

Peer review 2

Iratxe Puebla

DOI: 10.70744/MetaROR.312.1.rv2

The manuscript proposes a new format for research articles that includes front-loading summaries of the work for human readers and executable appendices that include the data, code and other research artefacts necessary so that machines can execute the analyses reported in the article.

There are a number of things that I like about the proposal such as leveraging new digital technologies for article publishing, and the focus on maximizing the reproducibility and openness of scholarly work. I like the idea of front-loading the article with information to make it easier for readers to grasp content and decide whether it is relevant to them. I could imagine something similar to the eLife summary, that includes structured designations of rigor and novelty to assist with consistent assessment across articles. I liked the mention in the perspective about a clear signal about the limitations of the work – I view this as a key trust signal valuable to readers and missed some further elaboration of what that would look like.

I also like the idea of an AI‑generated audit report for peer reviewers. Many journals already apply AI-base checks on papers, so expanding that and making it available for papers that proceed to peer review would add transparency on journal processes – and may prevent instances of reviewers adding the papers to an AI tool -against journal policy- to generate summaries.

At the same time, I have some questions and concerns as to how the implementation of the proposed format would work in practice. The proposal appears to focus on technological opportunity without accounting for the level of adoption for certain practices needed for implemention. The machine-actionable package described requires a foundation of practices toward data and code sharing and detailed methodological reporting. This is not commonplace across papers and disciplines, and there is no discussion about the challenges that would arise from an implementation that is only applicable to articles where all associated research objects are shared and the full methodology reported.

Conceptually, I also have a concern about perpetuating a framing where data, code and other open objects are presented as ‘appendices’ or corollary to the ‘article’ rather than as research contributions on their own merit. I would argue that given the current digital platforms available, the argument for appendices or supplementary materials is weak. Objects originating from a research project can be deposited in repositories or other platforms and provided persistent identifiers and associated metadata. These can then be linked to the article. On this basis, the option of having those materials already exists and the current need relates to better systems on the journal side to link to other objects, make those connections visible in the research information ecosystem, and potentially, as discussed in the perspective, bring those into the article environment to enable greater scrutiny and re-analysis. Figure 3B points articles -> repositories in relation to information flow, I would be interested in a flow that leverages open outputs shared in repositories where the direction is repository -> article to enrich the information provided in the article narrative.

The text refers to articles several times as PDFs, this does not account for the fact that many journals use formats such as XML that are machine readable. I acknowledge that important contents of an article are not be machine readable, but it’d be worth noting that there are already formats in place for articles that are machine readable.

In the discussion of risks, it’d worth noting the risk of the executable article leading to a proliferation of yet more articles given the low bar to create aggregate datasets & analysis, for example, in the form of irrelevant meta-analyses? There have been examples around this e.g. from the large-scale reuse of NHANES database: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3003152

With regard to the aggregation of information across articles, aggregators exist that index content from different journals and other platforms (e.g. Google Scholar). Admittedly this only covers a portion of the information about articles and does not provide executable options, but one challenge relates to the availability/openness of metadata provided for articles. This is something to consider for a system where the potential for large-scale analysis relies on information flows from journals.

I felt that the section on incentives is underdeveloped. The section mentions that CoARA advocates for recognition for different research objects, but this appears at odds with the suggestion to place associated data and code as appendices within an article. There is also no discussion on how the proposed article format aligns with research assessment reform efforts, or how it would facilitate recognition for a greater diversity of research contributions as part of assessment processes.

Competing interests: I work for the Make Data Count initiative, which promotes recognition of data as important research objects.

Peer review 3

Anonymous User

DOI: 10.70744/MetaROR.312.1.rv3

Note for the Author’s manuscript: this review is based on the version V4.

The article proposes a dual-audience framework for restructuring scientific papers so that they serve both human readers and AI systems. The author claims that behavioral and social sciences face a crisis of volume overload and epistemic fragmentation, caused by incomparable stimulus databases, jingle-jangle measurement errors, demographic blindness, and inaccessible raw data, and that
current AI tools, while useful for summarization, cannot resolve these problems. The proposed solution consists of two main components: a narrative layer for human readers, organized around
the key findings and a machine-readable structured appendix containing executable containerized environments, ontologically mapped constructs, persistent stimulus identifiers, and data. The article
describes how these elements could support automated jingle-jangle auditing, AI-assisted peer review, and continuously updated “living evidence networks” and systematic reviews/meta-analyses.
However, as currently written, the manuscript seems to be more a Perspective article than a research article. The contribution is difficult to isolate from what is already proposed by prior FAIR, open-
science, and Barcelona declaration. The four original contributions are restatements of existing proposals (or underdeveloped in the manuscript itself). Methodologically, the engagement with AI
systems is not well expanded and does not reflect the current state of LLM-based research pipelines. There are also several terminological and practical problems. These issues require revision before
the manuscript can be considered for publication.

Major issues

1. The manuscript opens by claiming four original contributions that distinguish it from FAIR, open-science, and reproducibility initiatives. This distinction is unconvincing. The dual-audience paper architecture is described as pairing a front-loaded narrative with a machine-actionable structured appendix to have full reproducible research “rather than treating supplementary materials as an afterthought”. Yet this is the design motivation of declaration and initiatives such as the Barcelona Declaration. The manuscript needs to demonstrate how its proposal differs/is better respect these other initiatives and frameworks in any technically or conceptually meaningful way.

2. The minimum viable structured appendix (Contribution 2) is presented as a novel specification, but its four elements (executable containers, standardized identifiers, ontological mappings, and data) are very similar to existing requirements in some highly reputational journals. Personally, I find the fourth contribution (the living systematic reviews idea) very interesting and important. However, in this sense, I think that the use of both LLMs, ontological map and data, can be used in two ways: the text information (i.e., the manuscript) and ontological map can be used in an AI agent to construct a specific RAG, Knowledge-RAG or the recent proposed LLM wiki on the specific topic, while the second one (data) can be used to update the analysis. I kindly ask the author if this is the direction that the proposed framework wants to propose and, if so, if he can expand better this part. At the same time, this kind of future raise another very important question (that is maybe beyond the author central focus): who maintains the “Living Evidence Networks”?

3. The manuscript’s central motivation is that manuscripts must be restructured for “AI systems” but the treatment of those systems is thin and does not reflect AI developments. AI world is very fast, so the architecture and proposal also must be adaptable and flexible in this sense. Specifically, the manuscript does not consider that modern document-ingestion pipelines for LLM-based research agents do not use appendices or manuscript text as described; they typically work through Markdown conversion, chunked embeddings, and/or retrieval-augmented generation over parsed text. Also, AI Agents in future could theoretically develop scripts to fully reproduce the code regarding the methodology part of the article if it is well described. The AI bottleneck for literature analysis is not only the absence of structured appendices (excluding data) but rather PDF-to-text parsing failures, different table formats, images (that, up to date, are the most difficult to analyze for a classical LLM), and citation disambiguation. I think that the Markdown format of the articles can be the possible way for the proposed architecture (and for Journal publishers) to really push on AI research pipelines.

4. The article proposes that AI perform automatic jingle-jangle audits (Box 1). However, it does not discuss the risk that the AI might hallucinate incorrect ontological relationships between constructs, creating a scientific “false truth” that is even harder to eradicate because it is “validated by the system”. A manual validation step, or maybe a Human in the Loop approach, can help to avoid this problem.

5. The manuscript is single-authored, but the author employs first-person plural: “we propose”, “we argue”, “we introduce”, “our Perspective”, “our framework”, “our goal”. Revise to singular first person (“I propose”, “I argue”) or use the objective/passive voice.

6. The article identifies privacy concerns as a “failure mode” but treats them as a constraint to be noted rather than a challenge to be addressed. Privacy is arguably the most significant barrier to widespread adoption of individual participant data sharing, particularly in clinical, educational, and cross-national research contexts. For the article, we have not only privacy issues (as described in Section Implementation and governance options), but copyright (and economic) issues. How the article is treated or used by LLM must be disclosed by the publisher and shared with the original author. Not all authors can agree to let the article be ingested by LLMs for future training.

7. The implementation roadmap (Tiers 0-3) is presented without evidence that the proposed tiers are calibrated to actual barriers to adoption. The claim that “most labs can implement Tiers 0-1 now” is asserted rather than documented. Managerial and technological barriers are arguably the main challenges to adoption that we can find in almost all the new proposals. Empirical literature on the determinants of open data adoption, including training barriers, time costs, incentive misalignment (I really suggest highlighting this aspect), and institutional risk aversion, is not cited.

8. Finally, in Figure 1, the block on “demographic blindness” is very important for the analysis of primary data collected via questionnaires and for the field of psychology, but it is not always
applicable to other types of analysis. Every type of research presents similar distortions depending on the context. For example, in economic analysis we may encounter the same problem if we do not specify the size of companies in terms of number of employees or revenues. The same applies to comparisons between universities using enrolments or academic staff. Therefore, I believe that the main issue is not solely linked to demographic data (which may perhaps represent the main problem in psychology), but to the lack of contextual data. I suggest updating Figure 1 with a section relating to this concept (perhaps something like ‘Insufficiency of contextual data’) and, particularly for work on primary data/questionnaires, focusing attention on demographic data. This may contribute to the
generalizability of the proposed framework.

Minor issues

The manuscript uses “AI system”, “AI tools”, “AI agents”, “LLM-based pipeline”, and “generative AI” in ways that are not always consistent or clearly distinguished. A brief terminological table or
definitional paragraph at the outset would reduce ambiguity.

Give to all the sections classical research articles Section Name (i.e., the first Section does not have the “Introduction” Section Name). Clarify if this is a Perspective or a Research/Review article. The Section division does not help in this sense, since the Method Section is not presented and the Framework is not presented after a “Literature Review” or “Background” Section. Please improve the article structure.

There are two DOIs links that are not currently working (even if I checked manually and the DOIs are correct). Please fix the link error:

Appukuttan, S., Bologna, L. L., Schürmann, F., Migliore, M. & Davison, A. P. EBRAINS Live Papers – Interactive resource sheets for computational studies in neuroscience.
Tedersoo, L., Küngas, R., Oras, E., Köster, K., Eenmaa, H., Leijen, Ä., … & Sepp, T. (2021). Data sharing practices and data availability upon request differ across scientific disciplines. Scientific data, 8(1), 192.

Competing interests: None.

Peer review 4

Stian Soiland-Reyes

DOI: 10.70744/MetaROR.312.1.rv4

This article proposes a framework for publishing academic articles with a duality purpose to reach both human and AI readers.

The ideas and motivation are in principle well-reasoned, however this work is hampered by a lack of background research. Notably the article does not have a good notion of Background or Existing Work, these are mainly mentioned in passing and not contrasted against the proposed framework. Notably the paper claims the perspective builds on FAIR and open science principles, but these are mainly ignored for the rest of the article.

For instance, the concept of Research Object (https://doi.org/10.1016/j.future.2011.08.004) introduced the idea of machine-readable appendices from 2009 onwards, but this seems not acknowledged by this manuscript. There have been whole conferences named “Beyond The PDF” by initiatives like Force11. The Force 11 manifesto is recommended reading. Research Data Alliance (RDA) has worked on open research practices and FAIR principles since 2013. GO FAIR initiative is backed by several government initiatives.

Likewise the FAIR principles have argued for machine-actionable metadata and data for two decades. Many research domains such as biodiversity, life sciences and biomedical are well advanced on use of FAIR, with persistent identifiers, repositories, ontologies etc. are established best practice as part of publication processes, although arguably not consistently referenced from corresponding academic articles. Psychology was one of the first fields encouraging use of reproducible code and using pre-registrations (see for instance https://doi.org/10.1177/21582440231205390). Several journals like Gigascience or RIO Journal have machine-actionable measures like embedding computational workflows and nanopublications.

Overall the article presents its framework as a new proposal, but I feel by ignoring all previous work in this area of improving scholarly communication to be machine-readable, the genuinely useful proposals from the framework (such as embedding machine-actionable reproducibility checks and audit reports into the publication pipeline) would be undermined. A major revision of the article would need to put the framework in context of the existing work, and suggest how it can be (or already is) implemented.

There is no evaluation provided of the proposed framework, or any suggestion of how its realization could be evaluated. Notably the manuscript itself is submitted as only a PDF and do not follow its own mantra, there is no machine-readable appendix package attached. Following a review of existing methodologies and background, a revised manuscript could attempt to show the capabilities of the existing techniques, e.g. it can include any of Frictionless Data package, RO-Crate, Croissant-ML appendix packages. It is not clear from the article why LLMs, which primarily are fed from natural language text, would be better suited with structured machine-readable appendices. For instance, StructGPT (https://doi.org/10.48550/arXiv.2305.09645) makes this point.

The writing and needs significant improvements, for instance “jingle-jangle” is mentioned twice before it is explained on page 3. While, MetaROR does not require any particular article format, the sections are not structured enough for an academic article, and the text read more like a blog post. As there is a lack of implementation, it can perhaps be improved to become an article in the type of an Opinion piece, but it would still need to relate its proposal with existing work.

Competing interests: My research group eScience Lab at The University of Manchester first published on the Research Object ideas in: Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Phillip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble (2013): Why Linked Data is Not Enough for Scientists. Future Generation Computer Systems 29(2) https://doi.org/10.1016/j.future.2011.08.004. I am the co-lead of the RO-Crate community.

Cite