Published at MetaROR

May 6, 2026

Table of contents

Cite this article as:

Lin, Z. (2026, January 21). Restructuring scientific papers for human and AI readers. Retrieved from osf.io/preprints/psyarxiv/c46hs_v5

Restructuring scientific papers for human and AI readers

Zhicheng Lin1Email

1 Department of Psychology, Yonsei University

Originally published on January 21, 2026 at: 

Abstract

Scientific communication faces a dual crisis: exponential publication growth overwhelms human readers, and fragmented research practices block automated synthesis. AI-assisted writing exacerbates the volume problem, producing papers faster than they can be read. Behavioral and social sciences in particular suffer from incomparable stimulus databases, jingle–jangle measurement fallacies, and demographic blindness that conceals effect heterogeneity. Current AI tools aid comprehension and summarization yet cannot aggregate findings from incommensurable studies and risk amplifying biases when trained on unstructured, unverified text. We propose restructuring scientific papers for dual audiences: front-loaded narratives for time-pressed human readers, paired with machine-readable appendices containing executable code, standardized metadata, and ontologically mapped constructs. This design turns papers into queryable research environments where readers can interrogate data and rerun analyses, and where structured appendices enable automated verification of statistical methods and AI-assisted peer review grounded in executable rather than narrative claims. Such papers become nodes in continuously updated evidence networks: each publication automatically contributes effect sizes to real-time meta-analyses, with corrections and retractions propagating through dependent analyses. Widespread adoption will require institutional recognition of structured documentation as essential scholarly output and computational infrastructure that serves both human comprehension and machine analysis.

Publication output doubles roughly every 17 years1, reaching 3.3 million articles in 20222, while researchers spend ever less time on each paper3. AI tools aid summarization and question-answering4 but cannot solve the deeper challenge of knowledge integration when the underlying literature is incoherent and inaccessible. In the behavioral and social sciences, findings are fragmented by incomparable measures5, bespoke materials and stimuli6,7, and poorly documented participant demographics8. Stimuli, code, and data9 are often unavailable, and even shared data are typically limited to summary statistics rather than standardized trial- or event-level observations10. Knowledge cannot accumulate from such incommensurable fragments.

To address this dual crisis of volume and fragmentation, and in response to a recent U.S. National Academies call for infrastructure to unify scientific knowledge11, we argue that the scientific paper must be rebuilt for two audiences: human readers and AI systems. For humans, this requires front-loading key findings in accessible prose for time-pressed researchers and turning papers into layered knowledge bases where readers can interrogate data, rerun analyses, and access technical details without wading through dense text. For AI systems, the main text, data, code, and protocols must be structured in machine-readable formats that enable automated analysis, comparison, and synthesis. In this Perspective, “AI system” denotes an LLM‑based pipeline that combines generative models with tools, external memory, and deterministic analysis modules, rather than a standalone language model.

Our Perspective builds on FAIR, open‑science, and reproducible‑workflow initiatives but makes four contributions. First, we propose a dual‑audience paper architecture that couples a front‑loaded narrative for human readers with a machine‑actionable structured appendix, rather than treating “supplementary materials” as an afterthought. Second, we specify a minimum viable structured appendix: not just a mandate to “share data” but a bundle of executable containers, standardized stimulus and measure identifiers, ontological mappings of constructs, and tidy trial‑level data with inclusive demographic coding. Third, we introduce a publication‑layer jingle–jangle audit, in which semantic and statistical checks on construct labels become routine infrastructure rather than occasional manual critiques. Fourth, we show how these elements together support living evidence networks, where individual papers become version‑controlled nodes feeding continuously updated, quality‑weighted syntheses.

Some journals already require authors to share datasets and scripts, typically in loosely documented repositories. These resources are idiosyncratic across studies, lack persistent identifiers for stimuli and measures, are not mapped onto shared ontologies, and rarely include trial‑level data with full provenance. They are human‑downloadable but not readily interoperable, queryable, or auditable at scale. By contrast, the dual‑audience paper and structured appendix proposed here treat interoperability, automation, and verification as first‑class design goals: standardized metadata, identifiers, and containerized workflows enable both human analysts and AI systems to rerun analyses, audit construct usage, and feed living evidence networks, rather than merely attaching files to a PDF.

Together, these components define a framework for turning individual papers into queryable research environments and inputs for AI pipelines in which synthesis and verification modules operate on verified, structured knowledge rather than unvetted prose. Our goal is not to introduce yet another platform or standard, but to specify how existing tools—containers, ontologies, data standards, and registries—can be assembled into a publication format and incentive structure that serves both human comprehension and machine analysis in psychology and other data‑rich sciences.

Crises in scientific practice and communication

The volume–fragmentation spiral has produced a cascade of institutional failures, beginning with quality control. Exponential growth in publications and preprints has overwhelmed traditional peer review12, which no longer provides a reliable signal of quality and relevance. Researchers skim abstracts, abandon papers mid-reading, and retreat to secondary summaries. Meanwhile, the scientific paper—designed for a print era of information scarcity—has become a bottleneck in an AI-rich era: critical data are buried in unstructured prose, trapped behind paywalls, and encoded in formats that resist computation. This friction may incidentally limit large-scale automated reuse of errors and sensitive findings, but it is a crude safeguard. Structured, machine-readable archives could instead pair easier access with explicit governance and quality checks, enabling more reliable synthesis.

These institutional failures both reflect and deepen epistemic fragmentation. Experimental psychologists, for example, routinely deploy proprietary or poorly documented stimulus sets—images, videos, vignettes, auditory clips—that differ in valence, arousal, cultural reference, and perceptual salience13,14. Even well-validated sets are seldom cross-validated against one another, including the many affective picture databases: the International Affective Picture System (IAPS)15, Open Affective Standardized Image Set (OASIS)16, Geneva Affective Picture Database (GAPED)17, Nencki Affective Pictures System (NAPS)18, Complex Affective Scene Set (COMPASS)19, International Affective Virtual Reality System (IAVRS)20, and dozens of culturally specific successors21. Effects may then be driven by stimulus-specific confounds rather than the intended construct. Without standardized, cross-validated stimulus sets, studies using different databases or materials become incomparable, forcing meta-analysts to hand-code or exclude findings—a laborious, error-prone process that scales exponentially with literature growth22.

The measurement landscape is equally fragmented. Psychology and related behavioral and social sciences are rife with “jingle–jangle” fallacies: identical names for fundamentally different phenomena (the jingle fallacy23) and different names for the same constructs (the jangle fallacy24). “Flourishing,” for instance, bundles conflicting theoretical approaches while preserving nominal unity25. “Executive function” spans working memory, cognitive flexibility, and inhibitory control26, with research oscillating among one-, two-, three-, and nested-factor models without converging on a stable structure27. Similar taxonomic confusions arise in economics (“poverty”), medicine (“autism”), and computer science (“artificial intelligence”), but they are especially pernicious in the behavioral sciences. Aggregating findings across such disparate conceptualizations yields statistically significant but scientifically questionable results.

Jangle problems extend to measurement proliferation. A large-scale analysis of APA databases shows that thousands of new measures are published annually, yet over 70% are never reused more than once, so the literature grows more fragmented over time28. This proliferation creates redundant research silos: “grit,” for example, shares most of its reliable variance with conscientiousness29; “psychological capital” often repackages existing well-being measures25. Semantic embeddings further suggest that the 277 distinct construct labels in the International Personality Item Pool could be collapsed into a more parsimonious taxonomy of just 68 clusters30.

A third failure compounds these problems: demographic blindness conceals effect heterogeneity31. Many studies report age and sex while omitting race/ethnicity and socioeconomic status, making it impossible to determine for whom effects actually hold32. Findings robust in U.S. college samples may shrink or reverse in other cultural contexts, yet current reporting practices render such moderation invisible until replication failures accumulate. This is not merely a representational concern but a threat to validity: unreported demographic moderators masquerade as statistical noise, obscuring the very patterns researchers seek to understand.

A fourth failure—data inaccessibility and impoverishment—renders many findings functionally unverifiable and unsynthesizable. Most research data remain unavailable upon request9,33-36, and even when shared, they typically appear as summary statistics rather than the individual participant data (IPD) required for rigorous verification and reuse37. This practice blocks robust forms of evidence synthesis, which depend on IPD to standardize outcomes, conduct proper intention-to-treat analyses, and examine effect heterogeneity across participant characteristics. The very existence of IPD meta-analysis—a methodological gold standard that requires manual collection of raw data from original authors37—indicts the standard scientific paper’s failure as a knowledge-delivery mechanism.

These four failures interact to create formidable barriers to knowledge accumulation. A study of “emotion regulation” (construct problem) using different picture sets (stimulus problem) across varied populations (demographic problem) with only summary statistics available (data problem) poses a computational impossibility: disentangling these sources of heterogeneity requires precisely the structured information that current publication practices withhold.

This fragmentation renders traditional evidence-synthesis methods inadequate for modern science. Systematic reviews are too slow to keep pace with literature growth, prone to error, and often yield inconclusive findings from incommensurable studies38. With more than 70,000 unique measures already documented in psychology28, manual curation has become computationally intractable.

Generative AI promises to automate knowledge synthesis at scale, yet this potential remains largely unrealized. LLMs now affect all stages of the writing process—with at least 13.5% of 2024 PubMed-indexed biomedical abstracts bearing AI-linked style markers39 and 22.5% of sentences in arXiv computer science abstracts estimated to be LLM-modified by September 202440. Yet they cannot synthesize knowledge from fragmented, incomparable studies. Worse, LLMs’ documented tendency to hallucinate citations and perpetuate training biases41 means that any AI-assisted synthesis must be grounded in verified, structured knowledge rather than unvetted prose.

Writing for two audiences

Addressing these crises requires rethinking the scientific paper’s structure to defragment the behavioral sciences and restart cumulative progress. We propose a dual-audience framework that serves time-pressed human readers and emerging AI systems by pairing a responsible front-loaded narrative and interactive knowledge layers with machine-readable appendices (Fig. 1).

Figure 1. Publication crisis and layered solution for human and AI readers. (A) Structural problems in behavioral-science publishing: stimulus and measurement fragmentation, demographic blindness, and inaccessible or impoverished data. (B) Human-optimized layers: a front-loaded narrative and interactive knowledge layer that give readers rapid access to key findings, methods, and materials. (C) Machine-optimized layer: a machine-readable appendix package with executable code, semantic annotations, and tidy trial-level data that powers interactive reanalysis for human readers and automated reproducibility checks, construct audits, and living meta-analysis for AI agents.

Front-loading for human readers. A front‑loaded paper inverts traditional structure: instead of shambling through a literature review before revealing findings, it begins with explicitly situated answers. The opening paragraphs should address, in order: What did you discover? Why does it matter? How does it change our understanding? They should simultaneously signal the strength of evidence, scope conditions, and how readers can verify or challenge the claims (e.g., via links to the specific scripts, containers, and data tables in the structured appendix that reproduce the headline numbers). This adapts journalism’s inverted pyramid with scientific guardrails so that readers immediately see the contribution, its limits, and the paths for inspection.

Traditional abstracts nominally front‑load some information, but severe space limits and conventions that prioritize technical precision over clarity often produce text that serves gatekeepers rather than readers.

Structured appendices for human interaction and machine processing. Front-loading alone, however, is insufficient. Papers must also become layered knowledge bases that let humans and machines directly interrogate underlying data. This requires structured appendices: a standardized knowledge package that transforms supplementary materials from passive dumping grounds into active, queryable computational infrastructure.

Each appendix is built around computational reproducibility. Instead of scattered code files and static data dumps, researchers provide executable analysis environments using containerization technologies such as Docker42 or Singularity43. These containers package code, dependencies, and configurations so that anyone can reproduce the analytical pipeline from raw data to final figures.

This approach is already standard in much of scientific computing: workflow systems such as Nextflow in bioinformatics rely on containers44, and pipelines such as fMRIPrep (neuroimaging), BioContainers (genomics), and tools in ecology (QGIS), astronomy (Astropy), and physics (CERN) demonstrate broad disciplinary adoption45.

On this foundation sit three layers of semantic structure. First, every stimulus—whether drawn from existing databases or created anew—receives a persistent identifier with standardized metadata for modality, normative ratings (valence, arousal, dominance), cultural validation samples, and licensing. This mirrors the Resource Identification Initiative, which assigns persistent identifiers to biological reagents and software to improve tracking and identifiability46.

Researchers creating novel stimuli document them in the same framework, contributing to an expanding queryable ecosystem. A study might reuse validated stimuli—for example, “all positive faces with arousal > 7 validated in East Asian samples”—or introduce new ones such as “custom workplace scenarios rated for stress and cultural relevance”; in both cases, structured metadata enables future discovery and comparison. In language‑cognition experiments, the HED LANG framework already provides a standardized, machine‑readable vocabulary for annotating stimuli47.

Second, measurement instruments are mapped to shared conceptual spaces using ontological systems, from controlled vocabularies to formal logic-based, machine-readable ontologies48. When a study uses the Beck Depression Inventory-II, for example, items are linked to standardized depression subdimensions using Uniform Resource Identifiers (URIs) from repositories such as the Cognitive Atlas49. This semantic mapping enables AI systems to detect when ostensibly different measures assess the same construct, or when identical labels mask different phenomena, addressing the jingle–jangle problem algorithmically rather than through laborious manual coding (Box 1).

This framework makes explicit four measurement questions: What construct does this instrument measure? Why was it chosen? How are responses quantified? What study‑specific modifications were made?5 Yet infrastructure alone is insufficient; meaningful interoperability requires communities to forge consensus on shared definitions and standards50. The Human Behaviour Ontology illustrates this approach, systematically defining and relating thousands of behavioral concepts to impose coherence on fragmented research domains51.

Third, all data adopt tidy trial-level formats that capture the full experimental context, linking each response to its stimulus, participant characteristics, and trial conditions using standardized, inclusive demographic coding. This operationalizes the FAIR (Findable, Accessible, Interoperable, and Reusable) principles52 by sharing data at the most granular level in standardized formats (e.g., Psych-DS for behavioral data, BIDS for neuroimaging) to maximize value for secondary analysis and reuse53.

This restructuring enables queries that are impossible with current summary-statistics approaches: “Show effects for women over 60” or “Exclude WEIRD-dominated samples” become computational operations rather than manual exclusions. Readers can instantly examine how effects vary across age, education, or cultural context without requesting raw data or running new studies. They can query the executable environment to assess heterogeneity across stimuli and analytic choices—“How sensitive are results to different stimuli or preprocessing decisions?”—and manipulate interactive visualizations, adjust parameters, and explore alternative presentations in real time.

This transforms multiverse analysis—the systematic exploration of how results vary across reasonable data-processing and analytical choices54—from a reporting burden into a native capability. Instead of cramming robustness checks into static appendices, researchers embed alternative scripts within the executable environment, allowing readers to probe a finding’s stability computationally. The appendix becomes a space where readers can extend analyses, test alternative hypotheses, and build directly on existing work without first deciphering the authors’ original narrative.

The framework can be adopted in stages, with a minimum viable version that most labs can implement now and a full version that depends on emerging infrastructure. A simple four‑tier roadmap is:

  • Tier 0: Data, code, and provenance. Authors share the analytic dataset and scripts used to generate the main results in a stable repository, with persistent identifiers, clear licenses, and a brief provenance note describing recruitment, inclusion criteria, and key preprocessing steps.
  • Tier 1: Tidy trial‑level data and basic metadata. Shared data are restructured into trial- or observation‑level tidy format and accompanied by a simple machine‑readable schema (e.g., JSON or YAML) that defines variable names, units, coding, and links between stimuli, participants, and conditions.
  • Tier 2: Executable environment and automated checks. The analysis is wrapped in a containerized environment (e.g., Docker or Singularity) together with a lightweight continuous‑integration script that reruns the main analyses and regenerates figures whenever code or data change, flagging breakage early.
  • Tier 3: Ontology linking and evidence‑network integration. Measures, tasks, and stimuli are linked to shared ontologies and persistent identifiers; effect estimates and study‑level metadata are exported in standardized form suitable for automatic ingestion into domain‑specific repositories and living meta‑analyses.

In this scheme, a minimum viable dual‑audience paper corresponds to Tiers 0–1—open data and code plus tidy trial‑level data with basic metadata. Tiers 2–3 realize the full framework: executable environments, automated robustness checks, semantic linking, and native participation in living evidence networks. Journals and funders can ratchet expectations gradually, first normalizing Tiers 0–1 and treating higher tiers as aspirational targets for consortia and well‑resourced teams, rather than insisting that the world jump directly to Tier 3.

Box 1 | Automating the Jingle–Jangle Audit
The jingle fallacy conflates distinct phenomena under identical labels; the jangle fallacy fragments identical constructs across different names. Our framework embeds a dual-layer automated audit to detect both.

The first layer employs semantic and ontological analysis. A jingle detector cross-references construct names against shared ontologies (e.g., Cognitive Atlas), using graph algorithms to identify identical labels that occupy distinct semantic neighborhoods. In parallel, a jangle detector applies LLM embeddings to cluster scale items, flagging nominally different instruments whose item content shows high semantic convergence30,55.

The second layer provides empirical validation via statistical analysis56 of structured-appendix data and the living evidence network. For jangle detection, it tests extrinsic convergent validity by comparing correlation patterns between putatively different measures and external criteria—statistically indistinguishable patterns indicate empirical redundancy. For jingle detection, it tests discriminant validity by asking whether identically labeled measures show divergent correlation with theoretically distinct criteria—systematic differences reveal that a single label masks multiple constructs.

This dual-validation approach—semantic analysis for large-scale screening, statistical comparison for empirical confirmation—builds construct hygiene into the publication infrastructure. Continuous-integration scripts run these audits automatically, generating curator reports that flag redundancies and collisions before they propagate. Construct validation thus shifts from an occasional, labor‑intensive exercise to an automated quality-control mechanism operating at the scale and speed of modern scientific publishing.

Transforming AI-assisted peer review

The dual‑audience framework also restructures peer review. A front‑loaded narrative lets reviewers judge novelty and significance within minutes, rather than spending hours parsing methods only to discover that a study is flawed or incremental.

Moreover, the structured appendix facilitates precise auditing and more rigorous evaluation. Current peer review rarely scrutinizes data and code: a recent large-scale attempt to rerun published code found a reproducibility rate below 6%, suggesting that reviewers largely take methodological claims on trust57. With structured appendices, reviewers can inspect configuration files and data, execute code, and verify that figures regenerate from the underlying data. This reduces the frustrating back-and-forth that plagues current review cycles.

Critically, this structure provides the essential infrastructure for responsible AI-assisted review (Fig. 2). Static, print-replica PDFs are poorly suited to AI systems that require structured access to tables and figures.

For example, autonomous AI agents relying only on prose in methods sections have failed to reproduce nearly half of published findings—applying incorrect or incomplete statistical methods when text is ambiguous or omits essential details58. Providing structured, machine-friendly content (e.g., CSV, Markdown, Git) would unlock new opportunities for AI-driven quality control: validating references, auditing logical consistency, checking mathematical and statistical accuracy, and systematically verifying the structured appendix. Verification can then move beyond confirming that code runs to automated multiverse analyses that vary data‑processing choices and analytic parameters to map the fragility of a study’s conclusions.

Beyond computational verification, the structured format also improves semantic clarity. Formal ontologies impose precise, unambiguous definitions on core constructs, resolving where theoretical disagreements genuinely lie59. Review shifts from semantic excavation to focused scientific dialogue. Human reviewers receive both the manuscript and an AI‑generated audit report, freeing them to concentrate on interpretive claims, novelty, and overall significance.

Figure 2. Hybrid AI–human peer review enabled by structured appendices. Authors submit a package containing a front‑loaded manuscript and structured appendix (code, data, containers, ontologies). AI review agents run reproducibility, multiverse‑robustness, and citation‑integrity checks to generate an audit summarizing status, warnings, and fragility. Human reviewers and editors use this audit alongside the narrative to request targeted author responses, assess conceptual novelty and theoretical contribution, adjudicate interpretation versus evidence, and make final publication decisions.

From isolated papers to networked knowledge

Beyond improving communication and quality assurance, the dual-audience framework creates the technical foundation for systematic knowledge integration. Papers can be linked into living evidence networks—extensions of the manually intensive “living systematic review” model60—in which each publication becomes a node that automatically contributes to evolving synthesis.

Consider how systematic reviews and meta-analyses currently work: researchers manually search databases, screen thousands of abstracts, extract data from hundreds of PDFs, code effect sizes, and produce static summary estimates—a process that typically occupies about five researchers for a year61. Yet by publication, many meta-analyses are already outdated: median “shelf life” is 5.5 years, with 23% requiring updates within two years and 7% being obsolete on arrival62. New studies accumulate in limbo until another large manual effort is mounted. This process is further compromised by systematic publication bias: null or inconvenient results remain buried in file drawers while false positives proliferate63. Fig. 3 contrasts this static, labor-intensive pipeline with the living evidence network enabled when structured papers feed domain-specific repositories and continuously updated meta-analyses.

Living evidence networks invert this workflow. When researchers publish under this framework, whether in journals or public registries64, their structured appendices automatically populate theme-specific evidence repositories. A new report on mindfulness and anxiety immediately contributes its effect sizes, sample characteristics, and methodological features to the running meta-analysis on that topic, regardless of whether results are significant, null, or contradictory. Pooled estimates update in real time. When a paper is retracted, its data points are removed from downstream syntheses within hours rather than years, automating and accelerating the update process currently managed by teams conducting living systematic reviews65.

In addition, the same structured knowledge base becomes training data for AI models that predict outcomes for novel interventions or specific populations38. The system can forecast the likely efficacy of a hypothetical intervention for a given demographic, turning evidence synthesis from retrospective summary into a tool for direct, actionable guidance.

Figure 3. From static meta-analyses to living evidence networks. (A) Traditional workflow in which scattered PDFs feed a one-off meta-analysis requiring months or years of manual screening and coding, yielding a synthesis that is already outdated at publication. (B) Structured papers with appendices support automatic extraction of effect sizes, moderators, and quality indicators into domain-specific repositories. (C) Living evidence network in which domain-specific repositories continuously update living meta-analyses and cross-domain syntheses; new studies, retractions, and registered null results dynamically alter study weights, providing real-time evidence for clinical guidelines, policy briefs, and AI prediction models.

This framework requires two additional infrastructure components beyond the semantic anchoring and ontological mapping already described. First, research artefacts—datasets, code, preregistrations, and derived effect-size tables—must be treated as versioned, citable objects. Version-control systems such as Git provide transparent audit trails, but propagating corrections requires additional layers: persistent identifiers, standardized cross‑reference metadata (e.g., DataCite‑style schemas), explicit dependency graphs linking studies to syntheses, and registries that index these links. When a dataset or analysis is corrected or retracted, systems built on this graph can automatically flag and update downstream meta‑analyses and evidence syntheses, instead of relying on formal errata and retractions that are slow and cumbersome to issue66 and often remain invisible for years67.

Second, quality indicators—sample size, pre-registration status, replication attempts, methodological rigor scores—must dynamically weight each study’s contribution to pooled estimates. High-quality, preregistered studies with large samples should carry more influence than exploratory work with convenience samples, so that evidence synthesis reflects both the quantity and quality of available research.

Barriers and failure modes

The framework outlined above is aspirational; without attention to implementation constraints, it could reinforce existing inequities. Building executable containers, tidy trial‑level datasets, and ontological mappings requires technical capacity that many small labs, non‑elite institutions, and practitioner settings do not yet have. If structured appendices become de facto requirements without parallel investments in infrastructure, training, and credit, adoption will be slow and skewed toward well‑resourced groups. Standardization for shared ontologies and demographic vocabularies also entails substantial coordination costs and risks freezing contested constructs or marginalizing alternative theoretical traditions unless governance is explicitly pluralistic, revisable, and transparent.

Sensitive and confidential data pose a different failure mode. In clinical, educational, and small or marginalized populations, naive mandates for fully open trial‑level data collide with privacy protections, data‑sovereignty claims, and legal or ethical constraints. The dual‑audience architecture must therefore support tiered access, secure data enclaves, and remote‑execution or synthetic‑data solutions so that code and metadata remain reusable even when raw data cannot be widely shared.

Finally, lowering the friction for reanalysis also lowers the friction for motivated misuse. Interactive appendices can make it easier for ideological actors to cherry‑pick specifications, ignore multiverse fragility or quality weights, and promote “do your own research” narratives that overstate the certainty of convenient results. Design choices can mitigate these risks by foregrounding robustness summaries rather than single estimates, making departures from preregistered analyses and default pipelines explicitly visible, and tying reanalyses back into version‑controlled living evidence networks where idiosyncratic claims are evaluated against the full corpus rather than in isolation. These barriers and failure modes are not reasons to abandon the framework but constraints that should shape governance, incentive design, and support from publishers and institutions.

Implementation and governance options

Publishers face an existential choice: continue distributing static PDFs as their gatekeeping role erodes under funder and institutional open‑access mandates, or help build the infrastructure that turns those same papers into queryable research environments. Scholarly societies, disciplinary and institutional repositories, and research libraries likewise must decide whether to merely mirror static PDFs or adopt shared formats for structured, executable paper packages that any interface can use.

For any host, two implementation paths emerge. One is to develop native AI systems integrated directly into their platforms—tools trained on scientific content with domain expertise in methodology, statistics, and interpretation. The other is to leverage browser‑ or operating‑system–level AI via secure APIs that let tools such as Chrome’s Gemini68 access full paper contexts, including structured appendices and executable code, rather than only rendered text.

Either approach enables text‑based services such as on‑demand translation that preserves technical precision, personalized summarization tailored to reader expertise, and advanced Q&A that can execute code for reanalysis or generate new visualizations. Audio and video services could automatically produce conversational podcasts or video summaries, making research accessible across formats and languages.

To remain viable as AI systems and interfaces evolve, these implementations should rest on open, versioned standards. The archival “paper package”—identifiers, metadata, data, and executable code—must remain portable across hosts and decoupled from any single AI model or user interface, so that the scientific record outlives particular vendors and tools.

These services imply a shift from charging primarily for document access toward supporting authenticated analytical capability. Funding models may range from institutional support and public infrastructure to subscription‑based access to advanced tools, while the underlying narrative text and structured appendices remain findable and portable even when specific interfaces are restricted. For sensitive data, access would operate through institutional agreements that create a trusted “data commons” of authorized researchers.

Proof-of-concept already exists. The Resource Identification Initiative standardizes identifiers for research materials46. Databrary provides infrastructure for sharing research data, including sensitive data53. EBRAINS “Live Papers” bundle code, data, and computational models for interactive simulation within neuroscience publications69. eLife’s Executable Research Articles let readers inspect, modify, and rerun the code that generates figures and tables in the browser. Schol-AR embeds manipulable visualizations into articles70. Code Ocean packages complete executable environments71. These pioneers demonstrate technical feasibility, but the transition from static document to dynamic resource remains fragmented across isolated initiatives.

The next evolution demands clear governance as well as technical innovation. Three questions are central. First, platforms hosting AI–paper interactions must protect the privacy of user queries and interaction data. Second, they must distinguish policies for confidential peer-review materials from those for published content and avoid sending unpublished manuscripts to external commercial AI systems that may retain proprietary data72. Third, they must state explicitly whether and how published content or interaction logs are used for model training, and on what legal and ethical basis. Any organization implementing AI solutions should guarantee privacy-first architectures in which user interactions are secure and encrypted, and in which data uses—including for training—are transparent and subject to meaningful consent and oversight.

Role of institutions

Universities and research institutions control the most powerful lever for adoption: career incentives. Promotion and tenure committees have historically undervalued digital research assets, creating a systemic disincentive to produce the high-quality data and code essential for a cumulative science73.

Reform therefore must be explicit and immediate. Institutions should revise hiring, promotion, and tenure criteria to prioritize work that advances long-term scientific progress, such as structured appendices and systematic consensus building—time-consuming but essential work that yields shared terminologies and methodological standards, making structured data interoperable and meaningful50. This aligns with the Coalition for Advancing Research Assessment (CoARA), which treats diverse contributions—datasets, software, code, and protocols—as legitimate scholarly products beyond journal impact factors.

But recognition alone is insufficient. Institutions must provide funding, computational resources, version-control systems, and technical training that make comprehensive documentation feasible rather than burdensome. Research libraries should expand from literature access to data curation, helping faculty turn messy research outputs into structured, queryable packages.

Early adopters gain competitive advantages. As AI-driven discovery tools emerge, institutions producing structured, machine-readable research will see their faculty become more visible and influential—if improved citation of open data is any guide74—creating systematic advantages in knowledge synthesis and collaboration.

Toward dynamic scientific authority

This transformation alters the epistemological relationship between author and audience. Traditionally, authority rests with the author’s narrative in the main text. Once raw data become directly accessible through AI queries, authority shifts toward machine-mediated analysis of evidence. Readers can pose counterfactual questions—“Re-run the analysis excluding subjects over 65” or “Plot the data using logarithmic scaling”—as computational operations rather than requests to authors. This moves beyond static replication toward dynamic exploration, making robustness checks, alternative specifications, and hypothesis generation cheap. Such interactive reanalyses remain exploratory; unbiased confirmatory tests still require prespecified design and analyses (ideally preregistered) evaluated on fresh or held‑out data not used to generate the hypotheses or analytic decisions.

This shift poses new questions for scientific practice: how theoretical synthesis should guide and interpret increasingly discoverable empirical patterns; how authority is distributed when any reader can interrogate data via AI interfaces; and how peer review should balance algorithmic validation of methods with human judgment about theory and significance.

This framework creates discipline‑agnostic infrastructure that both addresses current crises and positions science for AI‑enabled discovery: persistent identifiers, standardized metadata and provenance, executable analysis environments, and versioned audit trails. Psychology serves here as a stress test because its core objects—stimulus sets, tasks, and jingle–jangle‑prone constructs—are unusually tangled; other domains can substitute their own reagents, instruments, specimens, or datasets on the same scaffold. Widespread adoption would generate high‑quality, structured research artefacts that surpass the unstructured text currently training most models, providing a foundation for AI systems in which causal models, statistical engines, and verification tools operate on verifiable inputs. Rather than passively accepting commercial AI tools, the academic community must define how these pipelines integrate with scientific values of transparency and rigor.

References

1          Bornmann, L., Haunschild, R. & Mutz, R. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanities and Social Sciences Communications 8, 224 (2021). https://doi.org/10.1057/s41599-021-00903-w

2          National Science Board. Publications output: U.S. trends and international comparisons. (National Science Foundation, 2024).

3          Tenopir, C., King, D. W., Christian, L. & Volentine, R. Scholarly article seeking, reading, and use: A continuing evolution from print to electronic in the sciences and social sciences. Learned Publishing 28, 93-105 (2015). https://doi.org/10.1087/20150203

4          Lin, Z. Why and how to embrace AI such as ChatGPT in your academic life. R. Soc. Open Sci. 10, 230658 (2023). https://doi.org/10.1098/rsos.230658

5          Flake, J. K. & Fried, E. I. Measurement schmeasurement: Questionable measurement practices and how to avoid them. Adv. Meth. Pract. Psychol. Sci. 3, 456-465 (2020). https://doi.org/10.1177/2515245920952393

6          Clark, H. H. The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning and Verbal Behavior 12, 335-359 (1973). https://doi.org/10.1016/S0022-5371(73)80014-3

7          Yarkoni, T. The generalizability crisis. Behav. Brain Sci. 45, e1 (2020). https://doi.org/10.1017/s0140525x20001685

8          Sterling, E., Pearl, H., Liu, Z., Allen, J. W. & Fleischer, C. C. Demographic reporting across a decade of neuroimaging: A systematic review. Brain Imaging and Behavior 16, 2785-2796 (2022). https://doi.org/10.1007/s11682-022-00724-8

9          Tedersoo, L. et al. Data sharing practices and data availability upon request differ across scientific disciplines. Scientific Data 8, 192 (2021). https://doi.org/10.1038/s41597-021-00981-0

10        Hardwicke, T. E. et al. Estimating the prevalence of transparency and reproducibility-related research practices in psychology (2014–2017). Perspect. Psychol. Sci. 17, 239-251 (2021). https://doi.org/10.1177/1745691620979806

11        National Academies of Sciences, Engineering, and Medicine; Division of Behavioral and Social Sciences and Education; Board on Behavioral, Cognitive, and Sensory Sciences; Committee on Accelerating Behavioral Science through Ontology Development and Use,. in Ontologies in the Behavioral Sciences: Accelerating Research and the Spread of Knowledge   (eds A. S. Beatty & R. M. Kaplan)  (National Academies Press (US), 2022).

12        Hanson, M. A., Barreiro, P. G., Crosetto, P. & Brockington, D. The strain on scientific publishing. Quantitative Science Studies 5, 823-843 (2024). https://doi.org/10.1162/qss_a_00327

13        Diconne, K., Kountouriotis, G. K., Paltoglou, A. E., Parker, A. & Hostler, T. J. Presenting KAPODI—the searchable database of emotional stimuli sets. Emotion Review 14, 84-95 (2022). https://doi.org/10.1177/17540739211072803

14        Lin, Z., Ma, Q., Huang, X., Wu, X. & Zhang, Y. Pervasive failure to report properties of visual stimuli in experimental research in psychology and neuroscience: Two metascientific studies. Psychol. Bull. 149, 487-505 (2023).

15        Lang, P. J., Bradley, M. M. & Cuthbert, B. N. International affective picture system (IAPS): Affective ratings of pictures and instruction manual.  (NIMH, Center for the Study of Emotion & Attention, 2005).

16        Kurdi, B., Lozano, S. & Banaji, M. R. Introducing the Open Affective Standardized Image Set (OASIS). Behav. Res. Methods 49, 457-470 (2017). https://doi.org/10.3758/s13428-016-0715-3

17        Dan-Glauser, E. S. & Scherer, K. R. The Geneva affective picture database (GAPED): A new 730-picture database focusing on valence and normative significance. Behav. Res. Methods 43, 468-477 (2011). https://doi.org/10.3758/s13428-011-0064-1

18        Marchewka, A., Żurawski, Ł., Jednoróg, K. & Grabowska, A. The Nencki Affective Picture System (NAPS): Introduction to a novel, standardized, wide-range, high-quality, realistic picture database. Behav. Res. Methods 46, 596-610 (2014). https://doi.org/10.3758/s13428-013-0379-1

19        Weierich, M. R., Kleshchova, O., Rieder, J. K. & Reilly, D. M. The Complex Affective Scene Set (COMPASS): Solving the social content problem in affective visual stimulus sets. Collabra: Psychology 5, 53 (2019). https://doi.org/10.1525/collabra.256

20        Mancuso, V. et al. IAVRS—International Affective Virtual Reality System: Psychometric assessment of 360° images by using psychophysiological data. Sensors 24 (2024).

21        Balsamo, M., Carlucci, L., Padulo, C., Perfetti, B. & Fairfield, B. A bottom-up validation of the IAPS, GAPED, and NAPS affective picture databases: Differential effects on behavioral performance. Front Psychol 11 (2020). https://doi.org/10.3389/fpsyg.2020.02187

22        Michelson, M. & Reuter, K. The significant cost of systematic reviews and meta-analyses: A call for greater involvement of machine learning to assess the promise of clinical trials. Contemporary Clinical Trials Communications 16, 100443 (2019). https://doi.org/10.1016/j.conctc.2019.100443

23        Thorndike, E. L. Theory of mental and social measurements.  (The Science Press, 1904).

24        Kelley, T. L. Interpretation of educational measurements.  (World Book Company, 1927).

25        van Zyl, L. E. & Rothmann, S. Grand challenges for positive psychology: Future perspectives and opportunities. Front Psychol 13, 833057 (2022). https://doi.org/10.3389/fpsyg.2022.833057

26        Baggetta, P. & Alexander, P. A. Conceptualization and operationalization of executive function. Mind, Brain, and Education 10, 10-33 (2016). https://doi.org/10.1111/mbe.12100

27        Karr, J. E. et al. The unity and diversity of executive functions: A systematic review and re-analysis of latent variable studies. Psychol. Bull. 144, 1147-1185 (2018). https://doi.org/10.1037/bul0000160

28        Anvari, F. et al. Defragmenting psychology. Nat. Hum. Behav. 9, 836-839 (2025). https://doi.org/10.1038/s41562-025-02138-0

29        Ponnock, A. et al. Grit and conscientiousness: Another jangle fallacy. J Res Pers 89, 104021 (2020). https://doi.org/10.1016/j.jrp.2020.104021

30        Wulff, D. U. & Mata, R. Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nat. Hum. Behav. 9, 944-954 (2025). https://doi.org/10.1038/s41562-024-02089-y

31        von Hippel, P. T. & Schuetze, B. A. How not to fool ourselves about heterogeneity of treatment effects. Adv. Meth. Pract. Psychol. Sci. 8, 25152459241304347 (2025). https://doi.org/10.1177/25152459241304347

32        Call, C. C. et al. An ethics and social-justice approach to collecting and using demographic data for psychological researchers. Perspect. Psychol. Sci. 18, 979-995 (2022). https://doi.org/10.1177/17456916221137350

33        Danchev, V., Min, Y., Borghi, J., Baiocchi, M. & Ioannidis, J. P. A. Evaluation of data sharing after implementation of the International Committee of Medical Journal Editors data sharing statement requirement. JAMA Network Open 4, e2033972-e2033972 (2021). https://doi.org/10.1001/jamanetworkopen.2020.33972

34        Wicherts, J. M., Borsboom, D., Kats, J. & Molenaar, D. The poor availability of psychological research data for reanalysis. Am. Psychol. 61, 726-728 (2006). https://doi.org/10.1037/0003-066X.61.7.726

35        Vines, Timothy H. et al. The availability of research data declines rapidly with article age. Curr. Biol. 24, 94-97 (2014). https://doi.org/10.1016/j.cub.2013.11.014

36        Hardwicke, T. E. & Ioannidis, J. P. A. Populating the Data Ark: An attempt to retrieve, preserve, and liberate data from the most highly-cited psychology and psychiatry articles. PLOS ONE 13, e0201856 (2018). https://doi.org/10.1371/journal.pone.0201856

37        Tierney, J. F., Stewart, L. A., Clarke, M. & on behalf of the Cochrane Individual Participant Data Meta-analysis Methods Group. in Cochrane Handbook for Systematic Reviews of Interventions     643-658 (2019).

38        Castro, O., Mair, J., von Wangenheim, F. & Kowatsch, T. in Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies – HEALTHINF.  671-678 (SciTePress).

39        Kobak, D., González-Márquez, R., Horvát, E.-Á. & Lause, J. Delving into LLM-assisted writing in biomedical publications through excess vocabulary. Science Advances 11, eadt3813 (2025). https://doi.org/10.1126/sciadv.adt3813

40        Liang, W. et al. Quantifying large language model usage in scientific papers. Nat. Hum. Behav. (2025). https://doi.org/10.1038/s41562-025-02273-8

41        Walters, W. H. & Wilder, E. I. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci. Rep. 13, 14045 (2023). https://doi.org/10.1038/s41598-023-41032-5

42        Boettiger, C. An introduction to Docker for reproducible research. SIGOPS Oper. Syst. Rev. 49, 71–79 (2015). https://doi.org/10.1145/2723872.2723882

43        Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity: Scientific containers for mobility of compute. PLOS ONE 12, e0177459 (2017). https://doi.org/10.1371/journal.pone.0177459

44        Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316-319 (2017). https://doi.org/10.1038/nbt.3820

45        Moreau, D., Wiebels, K. & Boettiger, C. Containers for computational reproducibility. Nature Reviews Methods Primers 3, 50 (2023). https://doi.org/10.1038/s43586-023-00236-9

46        Bandrowski, A. et al. The Resource Identification Initiative: A cultural shift in publishing. J. Comp. Neurol. 524, 8-22 (2016). https://doi.org/10.1002/cne.23913

47        Denissen, M., Pöll, B., Robbins, K., Makeig, S. & Hutzler, F. HED LANG – A Hierarchical Event Descriptors library extension for annotation of language cognition experiments. Scientific Data 11, 1428 (2024). https://doi.org/10.1038/s41597-024-04282-0

48        Sharp, C., Kaplan, R. M. & Strauman, T. J. The use of ontologies to accelerate the behavioral sciences: Promises and challenges. Curr. Dir. Psychol. Sci. 32, 418-426 (2023). https://doi.org/10.1177/09637214231183917

49        Poldrack, R. A. et al. The Cognitive Atlas: Toward a knowledge foundation for cognitive neuroscience. Frontiers in Neuroinformatics 5 (2011). https://doi.org/10.3389/fninf.2011.00017

50        Leising, D., Liesefeld, H., Buecker, S., Glöckner, A. & Lortsch, S. A tentative roadmap for consensus building processes. Personality Science 5, 27000710241298610 (2024). https://doi.org/10.1177/27000710241298610

51        Schenk, P. et al. An ontological framework for organising and describing behaviours: The Human Behaviour Ontology. Wellcome Open Research 9 (2025). https://doi.org/10.12688/wellcomeopenres.21252.2

52        Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

53        Gilmore, R. O., Kennedy, J. L. & Adolph, K. E. Practical solutions for sharing data and materials from psychological research. Adv. Meth. Pract. Psychol. Sci. 1, 121-130 (2018). https://doi.org/10.1177/2515245917746500

54        Steegen, S., Tuerlinckx, F., Gelman, A. & Vanpaemel, W. Increasing transparency through a multiverse analysis. Perspect. Psychol. Sci. 11, 702-712 (2016). https://doi.org/10.1177/1745691616658637

55        Huang, Z., Long, Y., Peng, K. & Tong, S. An embedding-based semantic analysis approach: A preliminary study on redundancy detection in psychological concepts operationalized by scales. Journal of Intelligence 13 (2025). https://doi.org/10.3390/jintelligence13010011

56        Gonzalez, O., MacKinnon, D. P. & Muniz, F. B. Extrinsic convergent validity evidence to prevent jingle and jangle fallacies. Multivariate Behavioral Research 56, 3-19 (2021). https://doi.org/10.1080/00273171.2019.1707061

57        Samuel, S. & Mietchen, D. Computational reproducibility of Jupyter notebooks from biomedical publications. GigaScience 13, giad113 (2024). https://doi.org/10.1093/gigascience/giad113

58        Dobbins, N., Xiong, C., Lan, K. & Yetisgen, M. Large language model-based agents for automated research reproducibility: An exploratory study in Alzheimer’s disease. arXiv:2505.23852 (2025). https://doi.org/10.48550/arXiv.2505.23852

59        Michie, S. et al. Developing and using ontologies in behavioural science: addressing issues raised. Wellcome Open Research 7 (2023). https://doi.org/10.12688/wellcomeopenres.18211.2

60        Elliott, J. H. et al. Living systematic reviews: An emerging opportunity to narrow the evidence-practice gap. PLoS Med. 11, e1001603 (2014). https://doi.org/10.1371/journal.pmed.1001603

61        Borah, R., Brown, A. W., Capers, P. L. & Kaiser, K. A. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open 7, e012545 (2017). https://doi.org/10.1136/bmjopen-2016-012545

62        Shojania, K. G. et al. How quickly do systematic reviews go out of date? A survival analysis. Ann. Intern. Med. 147, 224-233 (2007). https://doi.org/10.7326/0003-4819-147-4-200708210-00179

63        Franco, A., Malhotra, N. & Simonovits, G. Underreporting in psychology experiments: Evidence from a study registry. Social Psychological and Personality Science 7, 8-12 (2016). https://doi.org/10.1177/1948550615598377

64        Laitin, D. D. et al. Reporting all results efficiently: A RARE proposal to open up the file drawer. Proc. Natl. Acad. Sci. U. S. A. 118, e2106178118 (2021). https://doi.org/10.1073/pnas.2106178118

65        Butler, A. R., Hartmann-Boyce, J., Livingstone-Banks, J., Turner, T. & Lindson, N. Optimizing process and methods for a living systematic review: 30 search updates and three review updates later. J. Clin. Epidemiol. 166, 111231 (2024). https://doi.org/10.1016/j.jclinepi.2023.111231

66        Kane, A. & Amin, B. Amending the literature through version control. Biol. Lett. 19, 20220463 (2023). https://doi.org/10.1098/rsbl.2022.0463

67        Budd, J. M., Sievert, M., Schultz, T. R. & Scoville, C. Effects of article retraction on citation and practice in medicine. Bull. Med. Libr. Assoc. 87, 437-443 (1999).

68        Lin, Z. FOCUS: an AI-assisted reading workflow for information overload. Nat. Biotechnol. 43, 2070-2075 (2025). https://doi.org/10.1038/s41587-025-02947-8

69        Appukuttan, S., Bologna, L. L., Schürmann, F., Migliore, M. & Davison, A. P. EBRAINS Live Papers – Interactive resource sheets for computational studies in neuroscience. Neuroinformatics 21, 101-113 (2023). https://doi.org/10.1007/s12021-022-09598-z

70        Ard, T. et al. Integrating data directly into publications with augmented reality and web-based technologies – Schol-AR. Scientific Data 9, 298 (2022). https://doi.org/10.1038/s41597-022-01426-y

71        Perkel, J. M. Make code accessible with these cloud services. Nature 575, 247-248 (2019). https://doi.org/10.1038/d41586-019-03366-x

72        Lin, Z. Towards an AI policy framework in scholarly publishing. Trends Cogn. Sci. 28, 85-88 (2024). https://doi.org/10.1016/j.tics.2023.12.002

73        Puebla, I. et al. Ten simple rules for recognizing data and software contributions in hiring, promotion, and tenure. PLoS Comput. Biol. 20, e1012296 (2024). https://doi.org/10.1371/journal.pcbi.1012296

74        Piwowar, H. A. & Vision, T. J. Data reuse and the open data citation advantage. PeerJ 1, e175 (2013). https://doi.org/10.7717/peerj.175

Editors

Kathryn Zeiler
Editor-in-Chief

Alex Holcombe
Handling Editor

Editorial assessment

by Alex Holcombe

DOI: 10.70744/MetaROR.321.1.ea

Reviewers found the manuscript’s motivation, to restructure scientific publishing to serve both human and AI readers, worthwhile and timely.  The reviewers suggested that the manuscript would be improved by engaging more with previous work, including initiatives related to different elements of the proposal such as the FAIR principles, the Research Object concept that included the idea of machine-readable appendices, the Force11 manifesto, and machine-actionable publishing efforts at some journals, each of which have arguably contributed to making scientific articles more machine-readable.  Additionally, the manuscript’s treatment of how AI systems actually process scientific literature was seen by the reviewers as needing updating, and would benefit from discussion of risks such as AI hallucination in proposed automated auditing functions.

Reviewers also raised some concerns about practical implementation, including a need for evidence and/or a stronger argument to back the proposed tiered adoption roadmap. It would also be worthwhile for the framework to make explicit its apparent assumption of a certain level of data and code sharing, which remains uncommon in many disciplines. One reviewer, an ethicist, raised concerns about accountability if significant portions of a paper are written for AI consumption as that could make them less readily directly interpretable by humans.

Recommendations for enhanced transparency

  • Add author ORCID iD.
  • Add a competing interest statement. Authors should report all competing interests, including not only financial interests, but any role, relationship, or commitment of an author that presents an actual or perceived threat to the integrity or independence of the research presented in the article. If no competing interests exist, authors should explicitly state this.
  • Add a funding source statement. Authors should report all funding in support of the research presented in the article. Grant reference numbers should be included. If no funding sources exist, explicitly state this in the article.

For more information on these recommendations, please refer to our author guidelines.

Competing interests: None.

Peer review 1

David Resnik

DOI: 10.70744/MetaROR.321.1.rv1

This paper argues for a new model of bifurcated model of scientific papers: a human-readable section and an AI-readable section. The rationale of this is to take better enable AI systems to synthesize the scientific literature, to facilitate peer review, and to promote more rigor in science. The idea original, important, and merits further discussion and debate. I would like to raise some philosophical and ethical concerns with the idea that the authors do not adequately address.

1. Who would take responsibility for the paper? The human? The AI? It seems that large sections of the paper humans could not be responsible for because it is all written by AIs so it will be understandable to AIs. But if humans can ‘t really take responsibility for what has be done, this creates a very dangerous situation. The AI sections of the paper could have dangerous information, for example, to make bioweapons, that is not understandable to humans but is to AIs, but the humans could do nothing about it. “Human in the loop” is a big theme in AI ethics but what the authors have proposed seems to take humans dangerously out of the loop.

2. Since authorship and responsibility go hand in hand, this of course raises major authorship questions.

3. It seems it also raise epistemological issues about knowledge that transcends human understanding (can the even be considered knowledge?) and the democratization of science. Making science even more technical than it already is seems to make it even less democratic.

4. It also seems that this approach might be more applicable to some fields rather than others. For example, computer science highly technical disciplines but not humanities (philosophy? law) and maybe not even social science. This needs to be addressed.

5. Deskilling of human is an issue too. If we get the dumbed down version, we get dumber.

See Resnik, DB, Hosseini, M & Hauswald, R. Autonomous artificial intelligence, scientific research, and human values. AI Ethics 6, 141 (2026). https://doi.org/10.1007/s43681-025-00908-0. This article touches on human in the loop issues and related issues with respect to AI agents, which raise similar concerns.

Competing interests: None.

Peer review 2

Iratxe Puebla

DOI: 10.70744/MetaROR.321.1.rv2

The manuscript proposes a new format for research articles that includes front-loading summaries of the work for human readers and executable appendices that include the data, code and other research artefacts necessary so that machines can execute the analyses reported in the article.

There are a number of things that I like about the proposal such as leveraging new digital technologies for article publishing, and the focus on maximizing the reproducibility and openness of scholarly work. I like the idea of front-loading the article with information to make it easier for readers to grasp content and decide whether it is relevant to them. I could imagine something similar to the eLife summary, that includes structured designations of rigor and novelty to assist with consistent assessment across articles. I liked the mention in the perspective about a clear signal about the limitations of the work – I view this as a key trust signal valuable to readers and missed some further elaboration of what that would look like.

I also like the idea of an AI‑generated audit report for peer reviewers. Many journals already apply AI-base checks on papers, so expanding that and making it available for papers that proceed to peer review would add transparency on journal processes – and may prevent instances of reviewers adding the papers to an AI tool -against journal policy- to generate summaries.

At the same time, I have some questions and concerns as to how the implementation of the proposed format would work in practice. The proposal appears to focus on technological opportunity without accounting for the level of adoption for certain practices needed for implemention. The machine-actionable package described requires a foundation of practices toward data and code sharing and detailed methodological reporting. This is not commonplace across papers and disciplines, and there is no discussion about the challenges that would arise from an implementation that is only applicable to articles where all associated research objects are shared and the full methodology reported.

Conceptually, I also have a concern about perpetuating a framing where data, code and other open objects are presented as ‘appendices’ or corollary to the ‘article’ rather than as research contributions on their own merit. I would argue that given the current digital platforms available, the argument for appendices or supplementary materials is weak. Objects originating from a research project can be deposited in repositories or other platforms and provided persistent identifiers and associated metadata. These can then be linked to the article. On this basis, the option of having those materials already exists and the current need relates to better systems on the journal side to link to other objects, make those connections visible in the research information ecosystem, and potentially, as discussed in the perspective, bring those into the article environment to enable greater scrutiny and re-analysis. Figure 3B points articles -> repositories in relation to information flow, I would be interested in a flow that leverages open outputs shared in repositories where the direction is repository -> article to enrich the information provided in the article narrative.

The text refers to articles several times as PDFs, this does not account for the fact that many journals use formats such as XML that are machine readable. I acknowledge that important contents of an article are not be machine readable, but it’d be worth noting that there are already formats in place for articles that are machine readable.

In the discussion of risks, it’d worth noting the risk of the executable article leading to a proliferation of yet more articles given the low bar to create aggregate datasets & analysis, for example, in the form of irrelevant meta-analyses? There have been examples around this e.g. from the large-scale reuse of NHANES database: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3003152

With regard to the aggregation of information across articles, aggregators exist that index content from different journals and other platforms (e.g. Google Scholar). Admittedly this only covers a portion of the information about articles and does not provide executable options, but one challenge relates to the availability/openness of metadata provided for articles. This is something to consider for a system where the potential for large-scale analysis relies on information flows from journals.

I felt that the section on incentives is underdeveloped. The section mentions that CoARA advocates for recognition for different research objects, but this appears at odds with the suggestion to place associated data and code as appendices within an article. There is also no discussion on how the proposed article format aligns with research assessment reform efforts, or how it would facilitate recognition for a greater diversity of research contributions as part of assessment processes.

Competing interests: I work for the Make Data Count initiative, which promotes recognition of data as important research objects.

Peer review 3

Anonymous reviewer

DOI: 10.70744/MetaROR.321.1.rv3

Note for the Author’s manuscript: this review is based on the version V4.

The article proposes a dual-audience framework for restructuring scientific papers so that they serve both human readers and AI systems. The author claims that behavioral and social sciences face a crisis of volume overload and epistemic fragmentation, caused by incomparable stimulus databases, jingle-jangle measurement errors, demographic blindness, and inaccessible raw data, and that
current AI tools, while useful for summarization, cannot resolve these problems. The proposed solution consists of two main components: a narrative layer for human readers, organized around
the key findings and a machine-readable structured appendix containing executable containerized environments, ontologically mapped constructs, persistent stimulus identifiers, and data. The article
describes how these elements could support automated jingle-jangle auditing, AI-assisted peer review, and continuously updated “living evidence networks” and systematic reviews/meta-analyses.
However, as currently written, the manuscript seems to be more a Perspective article than a research article. The contribution is difficult to isolate from what is already proposed by prior FAIR, open-
science, and Barcelona declaration. The four original contributions are restatements of existing proposals (or underdeveloped in the manuscript itself). Methodologically, the engagement with AI
systems is not well expanded and does not reflect the current state of LLM-based research pipelines. There are also several terminological and practical problems. These issues require revision before
the manuscript can be considered for publication.

Major issues

1. The manuscript opens by claiming four original contributions that distinguish it from FAIR, open-science, and reproducibility initiatives. This distinction is unconvincing. The dual-audience paper architecture is described as pairing a front-loaded narrative with a machine-actionable structured appendix to have full reproducible research “rather than treating supplementary materials as an afterthought”. Yet this is the design motivation of declaration and initiatives such as the Barcelona Declaration. The manuscript needs to demonstrate how its proposal differs/is better respect these other initiatives and frameworks in any technically or conceptually meaningful way.

2. The minimum viable structured appendix (Contribution 2) is presented as a novel specification, but its four elements (executable containers, standardized identifiers, ontological mappings, and data) are very similar to existing requirements in some highly reputational journals. Personally, I find the fourth contribution (the living systematic reviews idea) very interesting and important. However, in this sense, I think that the use of both LLMs, ontological map and data, can be used in two ways: the text information (i.e., the manuscript) and ontological map can be used in an AI agent to construct a specific RAG, Knowledge-RAG or the recent proposed LLM wiki on the specific topic, while the second one (data) can be used to update the analysis. I kindly ask the author if this is the direction that the proposed framework wants to propose and, if so, if he can expand better this part. At the same time, this kind of future raise another very important question (that is maybe beyond the author central focus): who maintains the “Living Evidence Networks”?

3. The manuscript’s central motivation is that manuscripts must be restructured for “AI systems” but the treatment of those systems is thin and does not reflect AI developments. AI world is very fast, so the architecture and proposal also must be adaptable and flexible in this sense. Specifically, the manuscript does not consider that modern document-ingestion pipelines for LLM-based research agents do not use appendices or manuscript text as described; they typically work through Markdown conversion, chunked embeddings, and/or retrieval-augmented generation over parsed text. Also, AI Agents in future could theoretically develop scripts to fully reproduce the code regarding the methodology part of the article if it is well described. The AI bottleneck for literature analysis is not only the absence of structured appendices (excluding data) but rather PDF-to-text parsing failures, different table formats, images (that, up to date, are the most difficult to analyze for a classical LLM), and citation disambiguation. I think that the Markdown format of the articles can be the possible way for the proposed architecture (and for Journal publishers) to really push on AI research pipelines.

4. The article proposes that AI perform automatic jingle-jangle audits (Box 1). However, it does not discuss the risk that the AI might hallucinate incorrect ontological relationships between constructs, creating a scientific “false truth” that is even harder to eradicate because it is “validated by the system”. A manual validation step, or maybe a Human in the Loop approach, can help to avoid this problem.

5. The manuscript is single-authored, but the author employs first-person plural: “we propose”, “we argue”, “we introduce”, “our Perspective”, “our framework”, “our goal”. Revise to singular first person (“I propose”, “I argue”) or use the objective/passive voice.

6. The article identifies privacy concerns as a “failure mode” but treats them as a constraint to be noted rather than a challenge to be addressed. Privacy is arguably the most significant barrier to widespread adoption of individual participant data sharing, particularly in clinical, educational, and cross-national research contexts. For the article, we have not only privacy issues (as described in Section Implementation and governance options), but copyright (and economic) issues. How the article is treated or used by LLM must be disclosed by the publisher and shared with the original author. Not all authors can agree to let the article be ingested by LLMs for future training.

7. The implementation roadmap (Tiers 0-3) is presented without evidence that the proposed tiers are calibrated to actual barriers to adoption. The claim that “most labs can implement Tiers 0-1 now” is asserted rather than documented. Managerial and technological barriers are arguably the main challenges to adoption that we can find in almost all the new proposals. Empirical literature on the determinants of open data adoption, including training barriers, time costs, incentive misalignment (I really suggest highlighting this aspect), and institutional risk aversion, is not cited.

8. Finally, in Figure 1, the block on “demographic blindness” is very important for the analysis of primary data collected via questionnaires and for the field of psychology, but it is not always
applicable to other types of analysis. Every type of research presents similar distortions depending on the context. For example, in economic analysis we may encounter the same problem if we do not specify the size of companies in terms of number of employees or revenues. The same applies to comparisons between universities using enrolments or academic staff. Therefore, I believe that the main issue is not solely linked to demographic data (which may perhaps represent the main problem in psychology), but to the lack of contextual data. I suggest updating Figure 1 with a section relating to this concept (perhaps something like ‘Insufficiency of contextual data’) and, particularly for work on primary data/questionnaires, focusing attention on demographic data. This may contribute to the
generalizability of the proposed framework.

Minor issues

The manuscript uses “AI system”, “AI tools”, “AI agents”, “LLM-based pipeline”, and “generative AI” in ways that are not always consistent or clearly distinguished. A brief terminological table or
definitional paragraph at the outset would reduce ambiguity.

Give to all the sections classical research articles Section Name (i.e., the first Section does not have the “Introduction” Section Name). Clarify if this is a Perspective or a Research/Review article. The Section division does not help in this sense, since the Method Section is not presented and the Framework is not presented after a “Literature Review” or “Background” Section. Please improve the article structure.

There are two DOIs links that are not currently working (even if I checked manually and the DOIs are correct). Please fix the link error:

  • Appukuttan, S., Bologna, L. L., Schürmann, F., Migliore, M. & Davison, A. P. EBRAINS Live Papers – Interactive resource sheets for computational studies in neuroscience.
  • Tedersoo, L., Küngas, R., Oras, E., Köster, K., Eenmaa, H., Leijen, Ä., … & Sepp, T. (2021). Data sharing practices and data availability upon request differ across scientific disciplines. Scientific data, 8(1), 192.

Competing interests: None.

Peer review 4

Stian Soiland-Reyes

DOI: 10.70744/MetaROR.321.1.rv4

This article proposes a framework for publishing academic articles with a duality purpose to reach both human and AI readers.

The ideas and motivation are in principle well-reasoned, however this work is hampered by a lack of background research. Notably the article does not have a good notion of Background or Existing Work, these are mainly mentioned in passing and not contrasted against the proposed framework. Notably the paper claims the perspective builds on FAIR and open science principles, but these are mainly ignored for the rest of the article.

For instance, the concept of Research Object (https://doi.org/10.1016/j.future.2011.08.004) introduced the idea of machine-readable appendices from 2009 onwards, but this seems not acknowledged by this manuscript. There have been whole conferences named “Beyond The PDF” by initiatives like Force11. The Force 11 manifesto is recommended reading. Research Data Alliance (RDA) has worked on open research practices and FAIR principles since 2013. GO FAIR initiative is backed by several government initiatives.

Likewise the FAIR principles have argued for machine-actionable metadata and data for two decades. Many research domains such as biodiversity, life sciences and biomedical are well advanced on use of FAIR, with persistent identifiers, repositories, ontologies etc. are established best practice as part of publication processes, although arguably not consistently referenced from corresponding academic articles. Psychology was one of the first fields encouraging use of reproducible code and using pre-registrations (see for instance https://doi.org/10.1177/21582440231205390). Several journals like Gigascience or RIO Journal have machine-actionable measures like embedding computational workflows and nanopublications.

Overall the article presents its framework as a new proposal, but I feel by ignoring all previous work in this area of improving scholarly communication to be machine-readable, the genuinely useful proposals from the framework (such as embedding machine-actionable reproducibility checks and audit reports into the publication pipeline) would be undermined. A major revision of the article would need to put the framework in context of the existing work, and suggest how it can be (or already is) implemented.

There is no evaluation provided of the proposed framework, or any suggestion of how its realization could be evaluated. Notably the manuscript itself is submitted as only a PDF and do not follow its own mantra, there is no machine-readable appendix package attached. Following a review of existing methodologies and background, a revised manuscript could attempt to show the capabilities of the existing techniques, e.g. it can include any of Frictionless Data package, RO-Crate, Croissant-ML appendix packages. It is not clear from the article why LLMs, which primarily are fed from natural language text, would be better suited with structured machine-readable appendices. For instance, StructGPT (https://doi.org/10.48550/arXiv.2305.09645) makes this point.

The writing and needs significant improvements, for instance “jingle-jangle” is mentioned twice before it is explained on page 3. While, MetaROR does not require any particular article format, the sections are not structured enough for an academic article, and the text read more like a blog post. As there is a lack of implementation, it can perhaps be improved to become an article in the type of an Opinion piece, but it would still need to relate its proposal with existing work.

Competing interests: My research group eScience Lab at The University of Manchester first published on the Research Object ideas in: Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Phillip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble (2013): Why Linked Data is Not Enough for Scientists. Future Generation Computer Systems 29(2) https://doi.org/10.1016/j.future.2011.08.004. I am the co-lead of the RO-Crate community.

Author response

DOI: 10.70744/MetaROR.321.1.ar

Note: A revised version of this article is available at https://osf.io/preprints/psyarxiv/c46hs_v6.

Editor

Original comment

Reviewers found the manuscript’s motivation, to restructure scientific publishing to serve both human and AI readers, worthwhile and timely. The reviewers suggested that the manuscript would be improved by engaging more with previous work, including initiatives related to different elements of the proposal such as the FAIR principles, the Research Object concept that included the idea of machine-readable appendices, the Force11 manifesto, and machine-actionable publishing efforts at some journals, each of which have arguably contributed to making scientific articles more machine-readable. Additionally, the manuscript’s treatment of how AI systems actually process scientific literature was seen by the reviewers as needing updating, and would benefit from discussion of risks such as AI hallucination in proposed automated auditing functions.

Reviewers also raised some concerns about practical implementation, including a need for evidence and/or a stronger argument to back the proposed tiered adoption roadmap. It would also be worthwhile for the framework to make explicit its apparent assumption of a certain level of data and code sharing, which remains uncommon in many disciplines. One reviewer, an ethicist, raised concerns about accountability if significant portions of a paper are written for AI consumption as that could make them less readily directly interpretable by humans.

Reply

Thank you for organizing the review and the summary report. The revision addresses the three lines of criticism: insufficient engagement with prior machine-actionable publishing work; an underdescribed and outdated treatment of how AI systems process scientific literature; and an implementation roadmap that overclaimed feasibility and elided data-sharing and accountability constraints.

Per-reviewer replies follow with section pointers and quoted passages; the principal changes are summarized here:

  • Reframed with a formal Introduction and an extensive background on FAIR, FORCE11/Beyond the PDF, the Research Object architecture, RO-Crate, Frictionless Data, GigaScience, eLife executable articles, EBRAINS Live Papers, RIO nanopublications, the Research Data Alliance, GO FAIR, and the Barcelona Declaration. The contribution is now clarified to focuses on AI-assistance and the missing behavioral-science semantic layer.
  • Replaced “appendices” with “research-object packages” throughout; data, code, materials, and schemas are framed as integral research outputs rather than annexes.
  • Generalized “demographic blindness” to “contextual blindness,” with demographic underreporting as the behavioral-science instance, and revised Figure 1 to match.
  • Rewrote the LLM-pipeline discussion to specify how current systems segment, embed, retrieve, and call tools, and identified the actual bottlenecks—PDF parsing, table extraction, figure-panel parsing, citation disambiguation—that structured packages address.
  • Calibrated the tier roadmap against current adoption: 14% raw-data and 8.5% script accessibility in a recent psychology audit, with time, training, privacy, standards, and credit identified as binding constraints.
  • Stated explicitly that the framework does not require universal open release of raw data; specified an auditable data-access model from synthetic data through controlled access to remote execution.
  • Added safeguards against LLM hallucination in construct audits: deterministic checks separated from LLM inferences; model and prompt versions logged; source spans preserved; human adjudication required before any finding feeds back into shared ontologies.
  • Disaggregated “LLM ingestion” into four distinct uses (retrieval/indexing, retrieval-augmented generation, model training, log retention), each with separate disclosure obligations.
  • Made human authorship and accountability unambiguous; required provenance logging for all AI-assisted construction; added an explicit recalibrated-expertise requirement for readers and reviewers.
  • Added Table 1 mapping deployed infrastructures (RRIDs, RO-Crate, Frictionless Data, Croissant, Databrary, eLife executable articles, Code Ocean, EBRAINS Live Papers, nanopublications, Schol-AR) to the proposed tiers and identifying what each leaves unspecified.
  • Revised first-person plural to singular throughout.

Reviewer 1: David Resnik

Comment 1

This paper argues for a new model of bifurcated model of scientific papers: a human-readable section and an AI-readable section. The rationale of this is to take better enable AI systems to synthesize the scientific literature, to facilitate peer review, and to promote more rigor in science. The idea original, important, and merits further discussion and debate. I would like to raise some philosophical and ethical concerns with the idea that the authors do not adequately address.

Reply

Thank you for the thoughtful suggestions and questions. The original framing did imply a bifurcation in which some content might be optimized for machines at the expense of human interpretability. The revision clarifies that unintended implication. The structured component is no longer described as an AI-only section but as a layered, human-inspectable, executable research-object package whose explicit semantics aid both human readers and AI systems. §A dual-audience architecture for the AI era now states:

“The two audiences engage the same artifacts differently: machines parse code, schemas, ontology mappings, and trial-level tables, while any reader can inspect them directly—their semantics are explicit and their behavior executable, rather than asserted in prose that cannot be run.”

Inspectability, human responsibility, and human review are now central features of the architecture rather than safeguards added to it.

Comment 2
  1. Who would take responsibility for the paper? The human? The AI? It seems that large sections of the paper humans could not be responsible for because it is all written by AIs so it will be understandable to AIs. But if humans can ‘t really take responsibility for what has be done, this creates a very dangerous situation. The AI sections of the paper could have dangerous information, for example, to make bioweapons, that is not understandable to humans but is to AIs, but the humans could do nothing about it. “Human in the loop” is a big theme in AI ethics but what the authors have proposed seems to take humans dangerously out of the loop.
Reply

The bioweapons framing suggests that a research-object package full of executable code and dense schemas could in principle conceal content from the very humans asked to vouch for it. The revision rejects that possibility on two levels.

First, the structured package is not an autonomous AI-authored layer. §A dual-audience architecture for the AI era now states:

“Writing for AI systems is therefore not to route content past human inspection but a way to expose structure that prose tends to obscure. Authors carry the same accountability for every component of the package as for the narrative.”

And §Papers as queryable research environments now states the position cleanly:

“Machines have no authorship standing; responsibility for every component rests with the humans who verify and submit it.”

Provenance logging applies to all machine-assisted construction:

“Any AI-assisted construction—code drafting, ontology mapping, data annotation—is logged in the same provenance layer with model and prompt versions, so that authors and reviewers can distinguish what machines drafted from what humans verified.”

Second—and this is the point the original missed—inspectability is not free. A package of containers, schemas, and ontology mappings is more verifiable than a methods paragraph only if reviewers and readers have the competencies to inspect it. The revision states this directly. §A dual-audience architecture for the AI era now reads:

“as generative AI enters reading, review, and synthesis, expertise must be recalibrated rather than bypassed—researchers need the skills to direct AI systems, discern errors in their outputs, and check machine-generated summaries or analyses against domain standards.”

The framework therefore presupposes the building of those reviewer competencies, not their substitution by machines. Provenance logs give reviewers a place to look; the recalibration requirement makes clear that human-in-the-loop oversight depends on humans whose loop has been updated.

Comment 3
  1. Since authorship and responsibility go hand in hand, this of course raises major authorship questions.
Reply

The original treatment was indirect. §Papers as queryable research environments now states the position cleanly:

“Machines have no authorship standing; responsibility for every component rests with the humans who verify and submit it.”

AI assistance is logged separately rather than absorbed silently into authorship:

“Any AI-assisted construction—code drafting, ontology mapping, data annotation—is logged in the same provenance layer with model and prompt versions, so that authors and reviewers can distinguish what machines drafted from what humans verified.”

Comment 4
  1. It seems it also raise epistemological issues about knowledge that transcends human understanding (can the even be considered knowledge?) and the democratization of science. Making science even more technical than it already is seems to make it even less democratic.
Reply

The worry as posed presupposes that machine-readable structure is opaque to humans, so that science would “transcend” human understanding when it became machine-tractable. With respect, the framework runs in the opposite direction.

A typed schema, an executable container, and an ontology mapping are not less inspectable than a methods paragraph; they are more so. A schema is exhaustive about every variable, its type, its units, and its allowed values. A container is exhaustive about software versions and runtime conditions. An ontology mapping is explicit about which constructs an instrument is intended to measure. A prose methods section, by contrast, is a compressed, interpretive narration in which assumptions are routinely implicit and details routinely lost. The structured layer makes those assumptions explicit and contestable. The form of knowledge that should worry us is the kind that cannot be checked at all because the necessary materials are absent—not the kind that becomes more checkable when more of it is exposed.

The democratization concern is concerning where it applies—to the requirement of computational and statistical literacy for reading at depth—but it does not turn on opacity. It turns on whether the framework lowers or raises the cost of inspecting evidence. By design it lowers it: a reader who today cannot reproduce a result because raw data and code are inaccessible can, under the proposed architecture, run the analysis at minimum and probe its assumptions at most. §Inverting the narrative now states:

“Front-loading does not substitute summary for substance; it is the top layer of a drill-down architecture—claim, evidence, scope, methods, code, data—through which readers can probe to whatever depth their question demands.”

And the conclusion adds:

“AI systems make evidence easier to query, but their outputs remain interpretations requiring human judgment, theoretical context, and accountability.”

The framework therefore narrows, rather than widens, the gap between stated claims and inspectable evidence.

Comment 5
  1. It also seems that this approach might be more applicable to some fields rather than others. For example, computer science highly technical disciplines but not humanities (philosophy? law) and maybe not even social science. This needs to be addressed.
Reply

A fair correction. The original overclaimed for empirical, data-rich science. The revision separates the full executable form of the architecture from its underlying principle. The Introduction now states:

“The architecture’s full executable form suits empirical, data-rich fields most directly; in interpretive disciplines the relevant research objects shift to corpora, editions, annotation schemes, and the provenance of coding or editorial decisions, but the underlying principle—claims linked to their inspectable evidence, methods, and revision history—holds across fields, implemented through whatever objects each discipline treats as its evidence.”

§Conclusions and outlook generalizes the point:

“The scaffold is discipline-agnostic. Psychology is the stress test because its core objects—stimulus sets, tasks, and jingle–jangle-prone constructs—are unusually tangled; other domains can substitute their own reagents, instruments, specimens, datasets—or, in interpretive fields, corpora, editions, translations, and annotation provenance.”

The strongest claims are confined to empirical, data-rich fields; the broader principle of linking claims to inspectable evidence is preserved as a generalization.

Comment 6
  1. Deskilling of human is an issue too. If we get the dumbed down version, we get dumber.
Reply

The deskilling worry presupposes that the dominant move is replacing depth with summary. The revision establishes that the front-loaded layer is the entry to depth, not a substitute for it. §Inverting the narrative now reads:

“Front-loading does not substitute summary for substance; it is the top layer of a drill-down architecture—claim, evidence, scope, methods, code, data—through which readers can probe to whatever depth their question demands.”

The relevant comparison is not between expert reading of unstructured prose and dumbed-down AI summaries; it is between current practice—in which most readers cannot inspect what they take on trust—and a future in which inspection is at least possible for those equipped to do it.

Where deskilling pressure does come into play is on the reviewer and reader side: AI-assisted reading and review will not replace expertise but will demand a different kind of it. §A dual-audience architecture for the AI era states:

“as generative AI enters reading, review, and synthesis, expertise must be recalibrated rather than bypassed—researchers need the skills to direct AI systems, discern errors in their outputs, and check machine-generated summaries or analyses against domain standards.”

Comment 6

See Resnik, DB, Hosseini, M & Hauswald, R. Autonomous artificial intelligence, scientific research, and human values. AI Ethics 6, 141 (2026). https://doi.org/10.1007/s43681-025-00908-0. This article touches on human in the loop issues and related issues with respect to AI agents, which raise similar concerns.

Reply

Cited. The reference now anchors the discussion of human oversight, accountability, and recalibrated expertise in §A dual-audience architecture for the AI era, supporting the position that human-in-the-loop oversight requires both retained authorial responsibility and the ongoing development of reviewer competencies for AI-assisted reading, review, and synthesis.

Reviewer 2: Iratxe Puebla

Comment 1

The manuscript proposes a new format for research articles that includes front-loading summaries of the work for human readers and executable appendices that include the data, code and other research artefacts necessary so that machines can execute the analyses reported in the article.

Reply

Thank you for the careful engagement. Before I address the specific comments below, let me beging with the terminology changes: “Appendices” is replaced by “research-object packages” throughout. The Abstract reads:

“I propose restructuring scientific papers for dual audiences: front-loaded narratives for time-pressed human readers, paired with research-object packages containing executable code, semantic annotations, and tidy trial-level data.”

§Papers as queryable research environments opens with: “Front-loading must be paired with a citable research-object package.”

Comment 2

There are a number of things that I like about the proposal such as leveraging new digital technologies for article publishing, and the focus on maximizing the reproducibility and openness of scholarly work. I like the idea of front-loading the article with information to make it easier for readers to grasp content and decide whether it is relevant to them. I could imagine something similar to the eLife summary, that includes structured designations of rigor and novelty to assist with consistent assessment across articles. I liked the mention in the perspective about a clear signal about the limitations of the work – I view this as a key trust signal valuable to readers and missed some further elaboration of what that would look like.

Reply

Thank you—the trust-signal framing is an interesting way to think about what the front-loaded layer should deliver, and the original underspecified it. §Inverting the narrative now describes what the layer must contain:

“The opening paragraphs should answer, in order—What did you discover? Why does it matter? How does it change our understanding?—and specify the evidence type (confirmatory, exploratory, descriptive, simulation-based, or causal), the population and setting in which the claim holds, the moderators or assumptions most likely to overturn it, and direct links to the scripts, containers, data tables, and robustness checks that reproduce or probe the headline numbers.”

The trust signal is therefore operational: front-loading reports not only the result but its evidential status, boundary conditions, and reproducibility path.

Comment 3

I also like the idea of an AI‑generated audit report for peer reviewers. Many journals already apply AI-base checks on papers, so expanding that and making it available for papers that proceed to peer review would add transparency on journal processes – and may prevent instances of reviewers adding the papers to an AI tool -against journal policy- to generate summaries.

Reply

§Restructuring peer review for executable verification reframes audit reports as inputs to human reviewers, not substitutes:

“Human reviewers receive both the manuscript and an AI-generated audit report and can direct their attention to interpretive claims, novelty, and theoretical significance—judgments that require domain expertise and that automated checks cannot make.”

§A roadmap for federated stewardship adds the governance issue raised here about reviewers using external AI tools:

“Platforms hosting AI–paper interactions must protect the privacy of user queries and interaction logs. Second, they must distinguish policies for confidential peer-review materials from those for published content, and avoid sending unpublished manuscripts to external commercial AI systems that may retain proprietary data.”

Sanctioned audit reports thus replace, rather than ride alongside, the ad hoc external use that current journal policies cannot effectively prohibit.

Comment 4

At the same time, I have some questions and concerns as to how the implementation of the proposed format would work in practice. The proposal appears to focus on technological opportunity without accounting for the level of adoption for certain practices needed for implemention. The machine-actionable package described requires a foundation of practices toward data and code sharing and detailed methodological reporting. This is not commonplace across papers and disciplines, and there is no discussion about the challenges that would arise from an implementation that is only applicable to articles where all associated research objects are shared and the full methodology reported.

Reply

The original was too optimistic. §A dual-audience architecture for the AI era now distinguishes technical feasibility from institutional adoption and reports the empirical baseline:

“Tiers 0–1 use infrastructure that is technically mature and common in some fields but unevenly normalized in behavioral science… The lower tiers themselves are technically available but not institutionally normalized: a 2022 audit of empirical psychology articles found immediately accessible raw data in 14% of cases and analysis scripts in 8.5%, with time costs, limited training, privacy exposure, uncertain standards, and absent credit for curation as the main constraints.”

The same section drops the universal-openness premise:

“The framework does not require universal public release of raw data. Instead, each article should expose the most reusable package compatible with ethical, legal, and practical constraints: open code and metadata at minimum, plus an auditable data-access model ranging from synthetic or de-identified demonstration data, through controlled-access repositories, to remote-execution interfaces in which reviewers can run code against protected data without downloading them.”

Implementation is therefore framed as a staged, constraint-sensitive adoption problem, not a presumed technological leap.

Comment 5

Conceptually, I also have a concern about perpetuating a framing where data, code and other open objects are presented as ‘appendices’ or corollary to the ‘article’ rather than as research contributions on their own merit. I would argue that given the current digital platforms available, the argument for appendices or supplementary materials is weak. Objects originating from a research project can be deposited in repositories or other platforms and provided persistent identifiers and associated metadata. These can then be linked to the article. On this basis, the option of having those materials already exists and the current need relates to better systems on the journal side to link to other objects, make those connections visible in the research information ecosystem, and potentially, as discussed in the perspective, bring those into the article environment to enable greater scrutiny and re-analysis. Figure 3B points articles -> repositories in relation to information flow, I would be interested in a flow that leverages open outputs shared in repositories where the direction is repository -> article to enrich the information provided in the article narrative.

Reply

This was the most consequential of the conceptual corrections, and it shaped multiple sections of the revision. “Appendices” is replaced as indicated above. §Papers as queryable research environments treats data, code, materials, and schemas as integral outputs:

“Provenance metadata binds these components to the article and to one another, making each object independently citable and the package as a whole auditable.”

Figure 3 is now bidirectional, as you suggested:

“Articles and their research-object packages contribute effect sizes, moderators, and quality indicators into domain-specific repositories, and repository-curated updates—new estimates, retraction notices—flow back to each article, so that information moves in both directions rather than only from article to repository.”

The incentives section credits these objects directly:

“Institutions should revise hiring, promotion, and tenure criteria to credit the contributions a research-object package makes visible: curated datasets, executable code, validated stimuli and measures, ontological mappings, and the systematic consensus-building that yields shared terminologies and methodological standards.”

The article is now framed as one component of a linked research ecosystem rather than the sole scholarly product to which other objects are attached.

Comment 6

The text refers to articles several times as PDFs, this does not account for the fact that many journals use formats such as XML that are machine readable. I acknowledge that important contents of an article are not be machine readable, but it’d be worth noting that there are already formats in place for articles that are machine readable.

Reply

A fair correction. The Introduction now reads:

“Some publishers already provide JATS XML or HTML serializations alongside the rendered PDF, which reduces parsing errors at the article-text level; the more consequential gap is that the research objects on which verification depends—data, code, stimuli, protocols, schemas, and provenance—are often absent or weakly linked.”

§A roadmap for federated stewardship adds an implementation recommendation:

“Publishers should also treat machine-readable serializations of the article itself—JATS XML and Markdown alongside the rendered PDF—as standard deliverables rather than typesetting byproducts; this relatively low-cost change removes the parsing layer on which most AI ingestion errors are concentrated.”

The argument is preserved—typed, linked, executable research objects are the deeper bottleneck—but PDF is not the only relevant article format.

Comment 7

In the discussion of risks, it’d worth noting the risk of the executable article leading to a proliferation of yet more articles given the low bar to create aggregate datasets & analysis, for example, in the form of irrelevant meta-analyses? There have been examples around this e.g. from the large-scale reuse of NHANES database: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3003152

Reply

An important concern (and reality) and a useful pointer. §Navigating implementation risks and inequities now states:

“A related risk is the proliferation of formulaic secondary literatures. AI-ready public datasets can be mined into single-factor association papers that ignore interactions, choose subsets selectively, skip multiple-testing correction, and—at the extreme—feed paper-mill production lines; a recent analysis of NHANES-derived publications documents this pattern at scale.”

The remedy follows:

“preregistration of confirmatory reuse, principled justification of subset selection with multiple-testing correction, reuse identifiers for high-value datasets, and editorial screening for formulaic designs. Living evidence networks should label exploratory reuse separately from confirmatory evidence and weight syntheses by design quality rather than publication count.”

The infrastructure that makes reuse cheap should also make reuse auditable.

Comment 8

With regard to the aggregation of information across articles, aggregators exist that index content from different journals and other platforms (e.g. Google Scholar). Admittedly this only covers a portion of the information about articles and does not provide executable options, but one challenge relates to the availability/openness of metadata provided for articles. This is something to consider for a system where the potential for large-scale analysis relies on information flows from journals.

Reply

§Constructing living evidence networks distinguishes living networks from general discovery aggregators:

“This indexing function is distinct from general discovery aggregators such as Google Scholar or OpenAlex, which do not generally expose full typed provenance and retraction metadata.”

Federated stewardship is specified:

“Journals provide article-object metadata; repositories host versioned executable research objects; domain societies or designated registry boards maintain construct and effect-size records; curators adjudicate contested mappings and retractions; and aggregators index typed links among articles, data, code, and synthesis nodes.”

The implementation recommendation invokes existing relationship metadata so the connections are not built from scratch:

“They should likewise expose typed bidirectional links between each article and the objects in its package—datasets, code, stimuli, protocols, preregistrations—using existing Crossref and DataCite relationship metadata so that the connections are visible to both readers and machines.”

Large-scale synthesis depends on open, typed, bidirectional metadata linking articles to their research objects and downstream syntheses, not merely on indexing articles.

Comment 9

I felt that the section on incentives is underdeveloped. The section mentions that CoARA advocates for recognition for different research objects, but this appears at odds with the suggestion to place associated data and code as appendices within an article. There is also no discussion on how the proposed article format aligns with research assessment reform efforts, or how it would facilitate recognition for a greater diversity of research contributions as part of assessment processes.

Reply

You are right. §Institutional incentives and career reform now ties the framework to research-assessment reform directly:

“This aligns with the Coalition for Advancing Research Assessment (CoARA) and the Declaration on Research Assessment (DORA), which call for assessment of diverse outputs beyond journal impact factors, and with the CRediT taxonomy, which already provides standardized roles—data curation, software, resources, validation, visualization—through which contributors receive explicit credit for the work of creating those objects, not only for co-authorship of the article.”

Object-level metrics extend this:

“Beyond explicit credit, citation and reuse metrics for data and software render object-level contributions measurable, so that assessment systems can weight evidence of impact at the object level rather than article authorship alone.”

Reviewer 3

Comment 1

Note for the Author’s manuscript: this review is based on the version V4. The article proposes a dual-audience framework for restructuring scientific papers so that they serve both human readers and AI systems. The author claims that behavioral and social sciences face a crisis of volume overload and epistemic fragmentation, caused by incomparable stimulus databases, jingle-jangle measurement errors, demographic blindness, and inaccessible raw data, and that current AI tools, while useful for summarization, cannot resolve these problems. The proposed solution consists of two main components: a narrative layer for human readers, organized around the key findings and a machine-readable structured appendix containing executable containerized environments, ontologically mapped constructs, persistent stimulus identifiers, and data. The article describes how these elements could support automated jingle-jangle auditing, AI-assisted peer review, and continuously updated “living evidence networks” and systematic reviews/meta-analyses. However, as currently written, the manuscript seems to be more a Perspective article than a research article. The contribution is difficult to isolate from what is already proposed by prior FAIR, open-science, and Barcelona declaration. The four original contributions are restatements of existing proposals (or underdeveloped in the manuscript itself). Methodologically, the engagement with AI systems is not well expanded and does not reflect the current state of LLM-based research pipelines. There are also several terminological and practical problems. These issues require revision before the manuscript can be considered for publication.

Reply

Thank you for the thoughtful review. You are correct that the manuscript is explicitly a Perspective. The Introduction situates the contribution against FAIR, FORCE11, Research Objects, RO-Crate, GigaScience, RIO, EBRAINS, RDA, GO FAIR, and the Barcelona Declaration; the LLM-pipeline discussion is rewritten; and the terminology—including “AI system,” “AI agent,” “AI tools,” “Generative AI,” and the replacement of “appendices” with “research-object packages”—is fixed throughout. The framework’s claim is now narrower and, I believe, more defensible: not a new platform or standard per se, but a specification of the role of AI and the missing semantic layer that existing infrastructure does not by itself supply for behavioral science.

Comment 2

1. The manuscript opens by claiming four original contributions that distinguish it from FAIR, open-science, and reproducibility initiatives. This distinction is unconvincing. The dual-audience paper architecture is described as pairing a front-loaded narrative with a machine-actionable structured appendix to have full reproducible research “rather than treating supplementary materials as an afterthought”. Yet this is the design motivation of declaration and initiatives such as the Barcelona Declaration. The manuscript needs to demonstrate how its proposal differs/is better respect these other initiatives and frameworks in any technically or conceptually meaningful way.

Reply

The revised Introduction acknowledges the prior infrastructure work:

“This Perspective extends two decades of work on machine-actionable scholarship. Data, metadata, tools, and workflows must be findable, accessible, interoperable, and reusable (FAIR). The FORCE11 community and the ‘Beyond the PDF’ movement argued that articles should treat data, software, and protocols as integral research objects rather than appendages; the Research Object (RO) architecture formalized how to bundle them with provenance and attribution.”

The contribution is then specified:

“Packaging standards provide a substrate for meaning but do not, by themselves, supply the domain-specific semantic layer. RO-Crate can bundle a stimulus set; it does not specify which persistent identifier scheme applies to affective images, which normative dimensions must be recorded, or whether the construct labeled ‘executive function’ in one container refers to the same phenomenon as in another.”

The Perspective therefore claims more a domain specification than a packaging standard.

Comment 3
  1. The minimum viable structured appendix (Contribution 2) is presented as a novel specification, but its four elements (executable containers, standardized identifiers, ontological mappings, and data) are very similar to existing requirements in some highly reputational journals. Personally, I find the fourth contribution (the living systematic reviews idea) very interesting and important. However, in this sense, I think that the use of both LLMs, ontological map and data, can be used in two ways: the text information (i.e., the manuscript) and ontological map can be used in an AI agent to construct a specific RAG, Knowledge-RAG or the recent proposed LLM wiki on the specific topic, while the second one (data) can be used to update the analysis. I kindly ask the author if this is the direction that the proposed framework wants to propose and, if so, if he can expand better this part. At the same time, this kind of future raise another very important question (that is maybe beyond the author central focus): who maintains the “Living Evidence Networks”?
Reply

The two-layer characterization captures the right distinction. §Constructing living evidence networks now separates them explicitly:

“Such networks have two coupled layers: a knowledge layer, in which article text, claims, citations, methods, and ontological mappings support retrieval-augmented or graph-based queries over what has been claimed, disputed, replicated, or retracted; and a synthesis layer, in which effect estimates, IPD summaries, contextual moderators, preregistration status, data and code versions, and quality indicators feed living meta-analyses.”

This is the direction described in the comment: RAG and graph-based retrieval over the text and ontology layer; updated synthesis over the data and effect-size layer. On governance:

“Maintaining living evidence networks requires federated stewardship. Journals provide article-object metadata; repositories host versioned executable research objects; domain societies or designated registry boards maintain construct and effect-size records; curators adjudicate contested mappings and retractions; and aggregators index typed links among articles, data, code, and synthesis nodes.”

Each network requires a designated steward “empowered to issue versioned releases, adjudicate mappings, and log decisions with revision history.” Living networks are framed as a federated stewardship problem, not a technical consequence of better article structure.

Comment 4
  1. The manuscript’s central motivation is that manuscripts must be restructured for “AI systems” but the treatment of those systems is thin and does not reflect AI developments. AI world is very fast, so the architecture and proposal also must be adaptable and flexible in this sense. Specifically, the manuscript does not consider that modern document-ingestion pipelines for LLM-based research agents do not use appendices or manuscript text as described; they typically work through Markdown conversion, chunked embeddings, and/or retrieval-augmented generation over parsed text. Also, AI Agents in future could theoretically develop scripts to fully reproduce the code regarding the methodology part of the article if it is well described. The AI bottleneck for literature analysis is not only the absence of structured appendices (excluding data) but rather PDF-to-text parsing failures, different table formats, images (that, up to date, are the most difficult to analyze for a classical LLM), and citation disambiguation. I think that the Markdown format of the articles can be the possible way for the proposed architecture (and for Journal publishers) to really push on AI research pipelines.
Reply

The Introduction now describes how current systems actually ingest articles:

“Currently, such pipelines convert source files into segmented text, index those segments with vector and keyword retrieval, and call tools for tasks such as code execution under LLM-orchestrated planning.”

The specific bottlenecks are stated in turn:

“Each step introduces potential errors: PDF parsing loses reading order, equations, and metadata; table extraction remains brittle; figure-panel parsing and citation disambiguation likewise rely on inference from rendered pages rather than typed structure.”

On agent-based reconstruction of methods, §Restructuring peer review for executable verification now states:

“Such agent-generated reconstruction is a useful fallback when no original code exists, but it is not (currently) a substitute for the executable workflow an author can package with the article.”

Markdown and JATS XML are added to the implementation recommendations as the article-level deliverables that remove most of the parsing layer where AI ingestion errors concentrate:

“Publishers should also treat machine-readable serializations of the article itself—JATS XML and Markdown alongside the rendered PDF—as standard deliverables rather than typesetting byproducts.”

Comment 5

4. The article proposes that AI perform automatic jingle-jangle audits (Box 1). However, it does not discuss the risk that the AI might hallucinate incorrect ontological relationships between constructs, creating a scientific “false truth” that is even harder to eradicate because it is “validated by the system” . A manual validation step, or maybe a Human in the Loop approach, can help to avoid this problem.

Reply

The original treated AI audit outputs as if they were authoritative; they are not. Box 1 now states:

“The audits screen for construct redundancy; they do not validate construct identity, which remains a theoretical judgment.”

The hallucination risk is addressed directly:

“Curator reports therefore separate deterministic checks from LLM-generated inferences, log model and prompt versions, preserve source spans and code or data references, and require human adjudication before any finding feeds back into shared ontologies or downstream syntheses.”

Human adjudication is the gating step before any machine-suggested construct relationship enters shared infrastructure.

Comment 6

5. The manuscript is single-authored, but the author employs first-person plural: “we propose” , “we argue” , “we introduce” , “our Perspective”, “our framework” , “our goal” . Revise to singular first person (“I propose” , “I argue”) or use the objective/passive voice.

Reply

Fixed throughout. The Abstract now opens with “I propose…”; the Introduction with “I argue…”; remaining first-person plural is replaced by singular, by the framework as the agent (“the framework specifies…”), or by passive constructions where the actor is unimportant.

Comment 7

6. The article identifies privacy concerns as a “failure mode” but treats them as a constraint to be noted rather than a challenge to be addressed. Privacy is arguably the most significant barrier to widespread adoption of individual participant data sharing, particularly in clinical, educational, and cross-national research contexts. For the article, we have not only privacy issues (as described in Section Implementation and governance options), but copyright (and economic) issues. How the article is treated or used by LLM must be disclosed by the publisher and shared with the original author. Not all authors can agree to let the article be ingested by LLMs for future training.

Reply

The original collapsed several distinct concerns into “privacy.” The revision separates them. On data:

“The framework does not require universal public release of raw data. Instead, each article should expose the most reusable package compatible with ethical, legal, and practical constraints…”

On confidential research-object content, §Navigating implementation risks and inequities specifies:

“The dual-audience architecture must therefore support tiered access, secure data enclaves, and remote-execution or synthetic-data solutions so that code and metadata remain reusable even when raw data cannot be widely shared.”

On AI uses of articles, §A roadmap for federated stewardship disaggregates four uses too often collapsed under “LLM ingestion”:

“Publisher AI policies should disaggregate uses often collapsed under ‘LLM ingestion’: retrieval or indexing of the published article; retrieval-augmented generation over the article and its research-object package; model training or fine-tuning; and retention or analysis of reader, author, and reviewer interaction logs.”

Disclosure obligations follow:

“Publishers should disclose at submission and publication which uses are permitted, which are opt-in or opt-out, what is retained, whether third-party vendors receive content, and whether interaction logs feed product development or model training.”

Authors retain rights-reservation options consistent with applicable law. Training, in particular, is no longer treated as a default consequence of publication.

Comment 8

7. The implementation roadmap (Tiers 0-3) is presented without evidence that the proposed tiers are calibrated to actual barriers to adoption. The claim that “most labs can implement Tiers 0-1 now” is asserted rather than documented. Managerial and technological barriers are arguably the main challenges to adoption that we can find in almost all the new proposals. Empirical literature on the determinants of open data adoption, including training barriers, time costs, incentive misalignment (I really suggest highlighting this aspect), and institutional risk aversion, is not cited.

Reply

§A dual-audience architecture for the AI era reports the empirical baseline:

“a 2022 audit of empirical psychology articles found immediately accessible raw data in 14% of cases and analysis scripts in 8.5%, with time costs, limited training, privacy exposure, uncertain standards, and absent credit for curation as the main constraints.”

On feasibility versus normalization:

“Tiers 0–1 use infrastructure that is technically mature and common in some fields but unevenly normalized in behavioral science; Tier 2 is technically demanding but achievable at the lab level using widely available containerization… Tier 3 requires community infrastructure no individual lab can supply alone.”

On incentives and institutional support, §Institutional incentives and career reform adds:

“Recognition alone is insufficient. Institutions must provide funding, computational resources, version-control systems, and technical training that make comprehensive documentation feasible rather than burdensome.”

Incentive misalignment, training, and institutional support are now treated as binding rather than incidental constraints.

Comment 9

8. Finally, in Figure 1, the block on “demographic blindness” is very important for the analysis of primary data collected via questionnaires and for the field of psychology, but it is not always applicable to other types of analysis. Every type of research presents similar distortions depending on the context. For example, in economic analysis we may encounter the same problem if we do not specify the size of companies in terms of number of employees or revenues. The same applies to comparisons between universities using enrolments or academic staff. Therefore, I believe that the main issue is not solely linked to demographic data (which may perhaps represent the main problem in psychology), but to the lack of contextual data. I suggest updating Figure 1 with a section relating to this concept (perhaps something like ‘Insufficiency of contextual data’) and, particularly for work on primary data/questionnaires, focusing attention on demographic data. This may contribute to the generalizability of the proposed framework.

Reply

The reframing from “demographic” to “contextual” is right, and the revision adopts it. §The volume–fragmentation spiral now states:

“A third failure compounds these problems: contextual blindness conceals effect heterogeneity. In psychology and behavioral science specifically, the most consequential omissions are demographic—age, sex, race/ethnicity, and socioeconomic status are routinely underreported.”

It then generalizes:

“The same logic extends beyond demographics—to firm size and industry in economics; school resources and teacher characteristics in education; dose, provider, and fidelity in intervention research; and stimuli and tasks across experimental psychology. The general requirement is documentation of the variables over which a claim is intended to generalize.”

Figure 1 now reads “contextual (e.g., demographic) blindness,” preserving the behavioral-science motivation while broadening the category.

Minor issue 1

Comment 10

The manuscript uses “AI system”, “AI tools”, “AI agents”, “LLM-based pipeline”, and “generative AI” in ways that are not always consistent or clearly distinguished. A brief terminological table or definitional paragraph at the outset would reduce ambiguity.

Reply

Definitional paragraph added in the Introduction:

“Here ‘AI system’ denotes an LLM-based pipeline that combines generative models with tools, external memory, and deterministic analysis modules, rather than a standalone language model. I use ‘Generative AI’ for the broader class of models that produce text, code, or images; ‘AI agent’ for an AI system delegated to perform multi-step tool use within scoped task limits…; and ‘AI tools’ for user-facing utilities such as summarizers or citation matchers that do not involve delegated task execution.”

Minor issue 2

Comment 11

Give to all the sections classical research articles Section Name (i.e., the first Section does not have the “Introduction” Section Name). Clarify if this is a Perspective or a Research/Review article. The Section division does not help in this sense, since the Method Section is not presented and the Framework is not presented after a “Literature Review” or “Background” Section. Please improve the article structure.

Reply

The article is identified as a Perspective in the Introduction (“Against that background, this Perspective makes four contributions”), and the section structure is reorganized into the standard Perspective sequence: Introduction; The volume–fragmentation spiral; A dual-audience architecture for the AI era; Restructuring peer review for executable verification; Constructing living evidence networks; Navigating implementation risks and inequities; A roadmap for federated stewardship; Conclusions and outlook.

Minor issue 3

Comment 12

There are two DOIs links that are not currently working (even if I checked manually and the DOIs are correct). Please fix the link error: • Appukuttan, S., Bologna, L. L., Schürmann, F., Migliore, M. & Davison, A. P. EBRAINS Live Papers – Interactive resource sheets for computational studies in neuroscience. • Tedersoo, L., Küngas, R., Oras, E., Köster, K., Eenmaa, H., Leijen, Ä., … & Sepp, T. (2021). Data sharing practices and data availability upon request differ across scientific disciplines. Scientific data, 8(1), 192.

Reply

Both DOIs are corrected. Tedersoo et al.: https://doi.org/10.1038/s41597-021-00981-0. Appukuttan et al.: https://doi.org/10.1007/s12021-022-09598-z.

Reviewer 4: Stian Soiland-Reyes

Comment 1

This article proposes a framework for publishing academic articles with a duality purpose to reach both human and AI readers.

Reply

To clarify on a point implicit in this summary: the architecture is layered, not bifurcated. §A dual-audience architecture for the AI era now states:

“The two audiences engage the same artifacts differently: machines parse code, schemas, ontology mappings, and trial-level tables, while any reader can inspect them directly—their semantics are explicit and their behavior executable, rather than asserted in prose that cannot be run.”

Machine readability does not entail sacrificing human interpretability—that would defeat the proposal.

Comment 2

The ideas and motivation are in principle well-reasoned, however this work is hampered by a lack of background research. Notably the article does not have a good notion of Background or Existing Work, these are mainly mentioned in passing and not contrasted against the proposed framework. Notably the paper claims the perspective builds on FAIR and open science principles, but these are mainly ignored for the rest of the article.

Reply

A fair criticism. The revised Introduction now traces FAIR, FORCE11/Beyond the PDF, the Research Object architecture, RO-Crate, Frictionless Data, GigaScience, eLife executable articles, EBRAINS Live Papers, RIO nanopublications, RDA, GO FAIR, and the Barcelona Declaration, and locates the manuscript’s contribution in what those efforts deliberately leave underspecified—the domain-specific semantic layer for behavioral science. Table 1 maps existing infrastructures to the proposed tiers, showing where each tier already has precursors in adjacent disciplines and what remains missing for psychology.

Comment 3

For instance, the concept of Research Object (https://doi.org/10.1016/j.future.2011.08.004) introduced the idea of machine-readable appendices from 2009 onwards, but this seems not acknowledged by this manuscript. There have been whole conferences named “Beyond The PDF” by initiatives like Force11. The Force 11 manifesto is recommended reading. Research Data Alliance (RDA) has worked on open research practices and FAIR principles since 2013. GO FAIR initiative is backed by several government initiatives.

Likewise the FAIR principles have argued for machine-actionable metadata and data for two decades. Many research domains such as biodiversity, life sciences and biomedical are well advanced on use of FAIR, with persistent identifiers, repositories, ontologies etc. are established best practice as part of publication processes, although arguably not consistently referenced from corresponding academic articles. Psychology was one of the first fields encouraging use of reproducible code and using pre-registrations (see for instance https://doi.org/10.1177/21582440231205390). Several journals like Gigascience or RIO Journal have machine-actionable measures like embedding computational workflows and nanopublications.

Reply

Thanks for the suggestions. All cited and used. The revised Introduction acknowledges Research Objects (Bechhofer et al.), FORCE11/Beyond the PDF, the RDA, GO FAIR, GigaScience/GigaDB, RIO nanopublications, eLife executable articles, EBRAINS Live Papers, and the Barcelona Declaration. The specific paper on psychology and reproducibility, Mullen 2024 (https://doi.org/10.1177/21582440231205390), is cited as ref 27, supporting the explicit acknowledgment that:

“Psychology has not been a bystander: preregistration, Registered Reports, and reproducible code pipelines aim to reform how behavioral research is conducted and reported.”

The biodiversity and life-science exemplars are also acknowledged: the Introduction notes that “many research domains have advanced FAIR implementation through persistent identifiers, repositories, and ontologies” before specifying that the psychological and behavioral-science semantic layer remains comparatively underspecified.

Comment 4

Overall the article presents its framework as a new proposal, but I feel by ignoring all previous work in this area of improving scholarly communication to be machine-readable, the genuinely useful proposals from the framework (such as embedding machine-actionable reproducibility checks and audit reports into the publication pipeline) would be undermined. A major revision of the article would need to put the framework in context of the existing work, and suggest how it can be (or already is) implemented.

Reply

The framework is not presented as a new machine-actionable publishing paradigm; it is presented as a specification of what behavioral science still needs on top of the substrate the prior community has built and the role of AI. The Introduction states:

“Packaging standards provide a substrate for meaning but do not, by themselves, supply the domain-specific semantic layer.”

Table 1—RRIDs, RO-Crate, Frictionless Data, Croissant, Databrary, eLife executable articles, Code Ocean, EBRAINS Live Papers, nanopublications, Schol-AR—names existing infrastructure tier by tier and, for each entry, identifies what it does not yet do for psychological constructs, stimuli, tasks, and trial-level schemas. The useful proposals—publication-layer reproducibility checks, jingle–jangle audits, evidence-network exports—are now embedded in this prior work rather than presented as standalone.

Comment 5

There is no evaluation provided of the proposed framework, or any suggestion of how its realization could be evaluated. Notably the manuscript itself is submitted as only a PDF and do not follow its own mantra, there is no machine-readable appendix package attached. Following a review of existing methodologies and background, a revised manuscript could attempt to show the capabilities of the existing techniques, e.g. it can include any of Frictionless Data package, RO-Crate, Croissant-ML appendix packages. It is not clear from the article why LLMs, which primarily are fed from natural language text, would be better suited with structured machine-readable appendices. For instance, StructGPT (https://doi.org/10.48550/arXiv.2305.09645) makes this point.

Reply

Three parts to this.

On evaluation. §A roadmap for federated stewardship now specifies an empirical evaluation path:

“Whether the proposed structure pays for itself is an empirical question. Comparing AI extraction accuracy, citation accuracy, table recovery, reproduction success, and hallucination rate across PDF-only, machine-readable text, and full structured-package conditions would establish where the marginal benefit justifies the marginal author cost.”

Feasibility metrics are added alongside performance metrics:

“author preparation time per tier, curator labor per submission, code-execution success rate (distinct from full reproduction success), frequency of privacy-driven exceptions, reviewer burden, and downstream reuse.”

On why structured packages benefit LLM-based systems. §Restructuring peer review for executable verification now states the StructGPT-style argument explicitly and cites Jiang et al. (StructGPT, EMNLP 2023, ref 39):

“Structured packages also let an LLM delegate reading to deterministic interfaces and reserve generation for synthesis and explanation; an analogous division has been shown to outperform direct text serialization of structured content in multi-step reasoning.”

Structured inputs are not better because LLMs read them better as text. They are better because they let the model offload retrieval, lookup, and arithmetic to deterministic tools and reserve generation for the steps where it adds value.

On the apparent contradiction of submitting a PDF-only argument for machine-readable scholarship. Thanks for the note. First, this is a Perspective, not an empirical paper: there is no data, code, or analytic output to package as an executable research object. The article-internal demonstration the comment requests would have to take the form of an article-text serialization (Markdown source, vector figures, machine-readable reference metadata, an RO-Crate or Frictionless manifest enumerating those files) rather than a data-and-code container. Second, the present submission is to MetaROR’s open meta-research peer review, not yet to a final journal; the eventual journal submission is a separate stage. The structured deliverables will be supplied alongside the journal version, where they correspond to what the manuscript itself prescribes for the article-text layer.

Comment 6

The writing and needs significant improvements, for instance “jingle-jangle” is mentioned twice before it is explained on page 3. While, MetaROR does not require any particular article format, the sections are not structured enough for an academic article, and the text read more like a blog post. As there is a lack of implementation, it can perhaps be improved to become an article in the type of an Opinion piece, but it would still need to relate its proposal with existing work.

Reply

Three changes. First, jingle–jangle is now defined at first use in the Abstract:

“jingle–jangle measurement fallacies (same label, distinct constructs; different labels, same construct).”

Second, the article is identified as a Perspective in the Introduction, and the relation to existing work is now substantial (see, e.g., replies to Comments 2–4). Third, the section structure is reorganized into a more standard Perspective sequence: Introduction; The volume–fragmentation spiral; A dual-audience architecture for the AI era; Restructuring peer review for executable verification; Constructing living evidence networks; Navigating implementation risks and inequities; A roadmap for federated stewardship; Conclusions and outlook.

The prose throughout has been tightened to remove conversational asides and to align section transitions with the genre.

Leave a comment