Published at MetaROR
May 6, 2026
Table of contents
Restructuring scientific papers for human and AI readers
Originally published on January 21, 2026 at:
Editors
Kathryn Zeiler
Alex Holcombe
Editorial assessment
by Alex Holcombe
Reviewers found the manuscript’s motivation, to restructure scientific publishing to serve both human and AI readers, worthwhile and timely. The reviewers suggested that the manuscript would be improved by engaging more with previous work, including initiatives related to different elements of the proposal such as the FAIR principles, the Research Object concept that included the idea of machine-readable appendices, the Force11 manifesto, and machine-actionable publishing efforts at some journals, each of which have arguably contributed to making scientific articles more machine-readable. Additionally, the manuscript’s treatment of how AI systems actually process scientific literature was seen by the reviewers as needing updating, and would benefit from discussion of risks such as AI hallucination in proposed automated auditing functions.
Reviewers also raised some concerns about practical implementation, including a need for evidence and/or a stronger argument to back the proposed tiered adoption roadmap. It would also be worthwhile for the framework to make explicit its apparent assumption of a certain level of data and code sharing, which remains uncommon in many disciplines. One reviewer, an ethicist, raised concerns about accountability if significant portions of a paper are written for AI consumption as that could make them less readily directly interpretable by humans.
Recommendations for enhanced transparency
- Add author ORCID iD.
- Add a competing interest statement. Authors should report all competing interests, including not only financial interests, but any role, relationship, or commitment of an author that presents an actual or perceived threat to the integrity or independence of the research presented in the article. If no competing interests exist, authors should explicitly state this.
- Add a funding source statement. Authors should report all funding in support of the research presented in the article. Grant reference numbers should be included. If no funding sources exist, explicitly state this in the article.
For more information on these recommendations, please refer to our author guidelines.
Peer review 1
This paper argues for a new model of bifurcated model of scientific papers: a human-readable section and an AI-readable section. The rationale of this is to take better enable AI systems to synthesize the scientific literature, to facilitate peer review, and to promote more rigor in science. The idea original, important, and merits further discussion and debate. I would like to raise some philosophical and ethical concerns with the idea that the authors do not adequately address.
1. Who would take responsibility for the paper? The human? The AI? It seems that large sections of the paper humans could not be responsible for because it is all written by AIs so it will be understandable to AIs. But if humans can ‘t really take responsibility for what has be done, this creates a very dangerous situation. The AI sections of the paper could have dangerous information, for example, to make bioweapons, that is not understandable to humans but is to AIs, but the humans could do nothing about it. “Human in the loop” is a big theme in AI ethics but what the authors have proposed seems to take humans dangerously out of the loop.
2. Since authorship and responsibility go hand in hand, this of course raises major authorship questions.
3. It seems it also raise epistemological issues about knowledge that transcends human understanding (can the even be considered knowledge?) and the democratization of science. Making science even more technical than it already is seems to make it even less democratic.
4. It also seems that this approach might be more applicable to some fields rather than others. For example, computer science highly technical disciplines but not humanities (philosophy? law) and maybe not even social science. This needs to be addressed.
5. Deskilling of human is an issue too. If we get the dumbed down version, we get dumber.
See Resnik, DB, Hosseini, M & Hauswald, R. Autonomous artificial intelligence, scientific research, and human values. AI Ethics 6, 141 (2026). https://doi.org/10.1007/s43681-025-00908-0. This article touches on human in the loop issues and related issues with respect to AI agents, which raise similar concerns.
Peer review 2
The manuscript proposes a new format for research articles that includes front-loading summaries of the work for human readers and executable appendices that include the data, code and other research artefacts necessary so that machines can execute the analyses reported in the article.
There are a number of things that I like about the proposal such as leveraging new digital technologies for article publishing, and the focus on maximizing the reproducibility and openness of scholarly work. I like the idea of front-loading the article with information to make it easier for readers to grasp content and decide whether it is relevant to them. I could imagine something similar to the eLife summary, that includes structured designations of rigor and novelty to assist with consistent assessment across articles. I liked the mention in the perspective about a clear signal about the limitations of the work – I view this as a key trust signal valuable to readers and missed some further elaboration of what that would look like.
I also like the idea of an AI‑generated audit report for peer reviewers. Many journals already apply AI-base checks on papers, so expanding that and making it available for papers that proceed to peer review would add transparency on journal processes – and may prevent instances of reviewers adding the papers to an AI tool -against journal policy- to generate summaries.
At the same time, I have some questions and concerns as to how the implementation of the proposed format would work in practice. The proposal appears to focus on technological opportunity without accounting for the level of adoption for certain practices needed for implemention. The machine-actionable package described requires a foundation of practices toward data and code sharing and detailed methodological reporting. This is not commonplace across papers and disciplines, and there is no discussion about the challenges that would arise from an implementation that is only applicable to articles where all associated research objects are shared and the full methodology reported.
Conceptually, I also have a concern about perpetuating a framing where data, code and other open objects are presented as ‘appendices’ or corollary to the ‘article’ rather than as research contributions on their own merit. I would argue that given the current digital platforms available, the argument for appendices or supplementary materials is weak. Objects originating from a research project can be deposited in repositories or other platforms and provided persistent identifiers and associated metadata. These can then be linked to the article. On this basis, the option of having those materials already exists and the current need relates to better systems on the journal side to link to other objects, make those connections visible in the research information ecosystem, and potentially, as discussed in the perspective, bring those into the article environment to enable greater scrutiny and re-analysis. Figure 3B points articles -> repositories in relation to information flow, I would be interested in a flow that leverages open outputs shared in repositories where the direction is repository -> article to enrich the information provided in the article narrative.
The text refers to articles several times as PDFs, this does not account for the fact that many journals use formats such as XML that are machine readable. I acknowledge that important contents of an article are not be machine readable, but it’d be worth noting that there are already formats in place for articles that are machine readable.
In the discussion of risks, it’d worth noting the risk of the executable article leading to a proliferation of yet more articles given the low bar to create aggregate datasets & analysis, for example, in the form of irrelevant meta-analyses? There have been examples around this e.g. from the large-scale reuse of NHANES database: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3003152
With regard to the aggregation of information across articles, aggregators exist that index content from different journals and other platforms (e.g. Google Scholar). Admittedly this only covers a portion of the information about articles and does not provide executable options, but one challenge relates to the availability/openness of metadata provided for articles. This is something to consider for a system where the potential for large-scale analysis relies on information flows from journals.
I felt that the section on incentives is underdeveloped. The section mentions that CoARA advocates for recognition for different research objects, but this appears at odds with the suggestion to place associated data and code as appendices within an article. There is also no discussion on how the proposed article format aligns with research assessment reform efforts, or how it would facilitate recognition for a greater diversity of research contributions as part of assessment processes.
Peer review 3
Anonymous reviewer
Note for the Author’s manuscript: this review is based on the version V4.
The article proposes a dual-audience framework for restructuring scientific papers so that they serve both human readers and AI systems. The author claims that behavioral and social sciences face a crisis of volume overload and epistemic fragmentation, caused by incomparable stimulus databases, jingle-jangle measurement errors, demographic blindness, and inaccessible raw data, and that
current AI tools, while useful for summarization, cannot resolve these problems. The proposed solution consists of two main components: a narrative layer for human readers, organized around
the key findings and a machine-readable structured appendix containing executable containerized environments, ontologically mapped constructs, persistent stimulus identifiers, and data. The article
describes how these elements could support automated jingle-jangle auditing, AI-assisted peer review, and continuously updated “living evidence networks” and systematic reviews/meta-analyses.
However, as currently written, the manuscript seems to be more a Perspective article than a research article. The contribution is difficult to isolate from what is already proposed by prior FAIR, open-
science, and Barcelona declaration. The four original contributions are restatements of existing proposals (or underdeveloped in the manuscript itself). Methodologically, the engagement with AI
systems is not well expanded and does not reflect the current state of LLM-based research pipelines. There are also several terminological and practical problems. These issues require revision before
the manuscript can be considered for publication.
Major issues
1. The manuscript opens by claiming four original contributions that distinguish it from FAIR, open-science, and reproducibility initiatives. This distinction is unconvincing. The dual-audience paper architecture is described as pairing a front-loaded narrative with a machine-actionable structured appendix to have full reproducible research “rather than treating supplementary materials as an afterthought”. Yet this is the design motivation of declaration and initiatives such as the Barcelona Declaration. The manuscript needs to demonstrate how its proposal differs/is better respect these other initiatives and frameworks in any technically or conceptually meaningful way.
2. The minimum viable structured appendix (Contribution 2) is presented as a novel specification, but its four elements (executable containers, standardized identifiers, ontological mappings, and data) are very similar to existing requirements in some highly reputational journals. Personally, I find the fourth contribution (the living systematic reviews idea) very interesting and important. However, in this sense, I think that the use of both LLMs, ontological map and data, can be used in two ways: the text information (i.e., the manuscript) and ontological map can be used in an AI agent to construct a specific RAG, Knowledge-RAG or the recent proposed LLM wiki on the specific topic, while the second one (data) can be used to update the analysis. I kindly ask the author if this is the direction that the proposed framework wants to propose and, if so, if he can expand better this part. At the same time, this kind of future raise another very important question (that is maybe beyond the author central focus): who maintains the “Living Evidence Networks”?
3. The manuscript’s central motivation is that manuscripts must be restructured for “AI systems” but the treatment of those systems is thin and does not reflect AI developments. AI world is very fast, so the architecture and proposal also must be adaptable and flexible in this sense. Specifically, the manuscript does not consider that modern document-ingestion pipelines for LLM-based research agents do not use appendices or manuscript text as described; they typically work through Markdown conversion, chunked embeddings, and/or retrieval-augmented generation over parsed text. Also, AI Agents in future could theoretically develop scripts to fully reproduce the code regarding the methodology part of the article if it is well described. The AI bottleneck for literature analysis is not only the absence of structured appendices (excluding data) but rather PDF-to-text parsing failures, different table formats, images (that, up to date, are the most difficult to analyze for a classical LLM), and citation disambiguation. I think that the Markdown format of the articles can be the possible way for the proposed architecture (and for Journal publishers) to really push on AI research pipelines.
4. The article proposes that AI perform automatic jingle-jangle audits (Box 1). However, it does not discuss the risk that the AI might hallucinate incorrect ontological relationships between constructs, creating a scientific “false truth” that is even harder to eradicate because it is “validated by the system”. A manual validation step, or maybe a Human in the Loop approach, can help to avoid this problem.
5. The manuscript is single-authored, but the author employs first-person plural: “we propose”, “we argue”, “we introduce”, “our Perspective”, “our framework”, “our goal”. Revise to singular first person (“I propose”, “I argue”) or use the objective/passive voice.
6. The article identifies privacy concerns as a “failure mode” but treats them as a constraint to be noted rather than a challenge to be addressed. Privacy is arguably the most significant barrier to widespread adoption of individual participant data sharing, particularly in clinical, educational, and cross-national research contexts. For the article, we have not only privacy issues (as described in Section Implementation and governance options), but copyright (and economic) issues. How the article is treated or used by LLM must be disclosed by the publisher and shared with the original author. Not all authors can agree to let the article be ingested by LLMs for future training.
7. The implementation roadmap (Tiers 0-3) is presented without evidence that the proposed tiers are calibrated to actual barriers to adoption. The claim that “most labs can implement Tiers 0-1 now” is asserted rather than documented. Managerial and technological barriers are arguably the main challenges to adoption that we can find in almost all the new proposals. Empirical literature on the determinants of open data adoption, including training barriers, time costs, incentive misalignment (I really suggest highlighting this aspect), and institutional risk aversion, is not cited.
8. Finally, in Figure 1, the block on “demographic blindness” is very important for the analysis of primary data collected via questionnaires and for the field of psychology, but it is not always
applicable to other types of analysis. Every type of research presents similar distortions depending on the context. For example, in economic analysis we may encounter the same problem if we do not specify the size of companies in terms of number of employees or revenues. The same applies to comparisons between universities using enrolments or academic staff. Therefore, I believe that the main issue is not solely linked to demographic data (which may perhaps represent the main problem in psychology), but to the lack of contextual data. I suggest updating Figure 1 with a section relating to this concept (perhaps something like ‘Insufficiency of contextual data’) and, particularly for work on primary data/questionnaires, focusing attention on demographic data. This may contribute to the
generalizability of the proposed framework.
Minor issues
The manuscript uses “AI system”, “AI tools”, “AI agents”, “LLM-based pipeline”, and “generative AI” in ways that are not always consistent or clearly distinguished. A brief terminological table or
definitional paragraph at the outset would reduce ambiguity.
Give to all the sections classical research articles Section Name (i.e., the first Section does not have the “Introduction” Section Name). Clarify if this is a Perspective or a Research/Review article. The Section division does not help in this sense, since the Method Section is not presented and the Framework is not presented after a “Literature Review” or “Background” Section. Please improve the article structure.
There are two DOIs links that are not currently working (even if I checked manually and the DOIs are correct). Please fix the link error:
- Appukuttan, S., Bologna, L. L., Schürmann, F., Migliore, M. & Davison, A. P. EBRAINS Live Papers – Interactive resource sheets for computational studies in neuroscience.
- Tedersoo, L., Küngas, R., Oras, E., Köster, K., Eenmaa, H., Leijen, Ä., … & Sepp, T. (2021). Data sharing practices and data availability upon request differ across scientific disciplines. Scientific data, 8(1), 192.
Peer review 4
This article proposes a framework for publishing academic articles with a duality purpose to reach both human and AI readers.
The ideas and motivation are in principle well-reasoned, however this work is hampered by a lack of background research. Notably the article does not have a good notion of Background or Existing Work, these are mainly mentioned in passing and not contrasted against the proposed framework. Notably the paper claims the perspective builds on FAIR and open science principles, but these are mainly ignored for the rest of the article.
For instance, the concept of Research Object (https://doi.org/10.1016/j.future.2011.08.004) introduced the idea of machine-readable appendices from 2009 onwards, but this seems not acknowledged by this manuscript. There have been whole conferences named “Beyond The PDF” by initiatives like Force11. The Force 11 manifesto is recommended reading. Research Data Alliance (RDA) has worked on open research practices and FAIR principles since 2013. GO FAIR initiative is backed by several government initiatives.
Likewise the FAIR principles have argued for machine-actionable metadata and data for two decades. Many research domains such as biodiversity, life sciences and biomedical are well advanced on use of FAIR, with persistent identifiers, repositories, ontologies etc. are established best practice as part of publication processes, although arguably not consistently referenced from corresponding academic articles. Psychology was one of the first fields encouraging use of reproducible code and using pre-registrations (see for instance https://doi.org/10.1177/21582440231205390). Several journals like Gigascience or RIO Journal have machine-actionable measures like embedding computational workflows and nanopublications.
Overall the article presents its framework as a new proposal, but I feel by ignoring all previous work in this area of improving scholarly communication to be machine-readable, the genuinely useful proposals from the framework (such as embedding machine-actionable reproducibility checks and audit reports into the publication pipeline) would be undermined. A major revision of the article would need to put the framework in context of the existing work, and suggest how it can be (or already is) implemented.
There is no evaluation provided of the proposed framework, or any suggestion of how its realization could be evaluated. Notably the manuscript itself is submitted as only a PDF and do not follow its own mantra, there is no machine-readable appendix package attached. Following a review of existing methodologies and background, a revised manuscript could attempt to show the capabilities of the existing techniques, e.g. it can include any of Frictionless Data package, RO-Crate, Croissant-ML appendix packages. It is not clear from the article why LLMs, which primarily are fed from natural language text, would be better suited with structured machine-readable appendices. For instance, StructGPT (https://doi.org/10.48550/arXiv.2305.09645) makes this point.
The writing and needs significant improvements, for instance “jingle-jangle” is mentioned twice before it is explained on page 3. While, MetaROR does not require any particular article format, the sections are not structured enough for an academic article, and the text read more like a blog post. As there is a lack of implementation, it can perhaps be improved to become an article in the type of an Opinion piece, but it would still need to relate its proposal with existing work.
Author response
Note: A revised version of this article is available at https://osf.io/preprints/psyarxiv/c46hs_v6.
Editor
Original comment
Reviewers found the manuscript’s motivation, to restructure scientific publishing to serve both human and AI readers, worthwhile and timely. The reviewers suggested that the manuscript would be improved by engaging more with previous work, including initiatives related to different elements of the proposal such as the FAIR principles, the Research Object concept that included the idea of machine-readable appendices, the Force11 manifesto, and machine-actionable publishing efforts at some journals, each of which have arguably contributed to making scientific articles more machine-readable. Additionally, the manuscript’s treatment of how AI systems actually process scientific literature was seen by the reviewers as needing updating, and would benefit from discussion of risks such as AI hallucination in proposed automated auditing functions.
Reviewers also raised some concerns about practical implementation, including a need for evidence and/or a stronger argument to back the proposed tiered adoption roadmap. It would also be worthwhile for the framework to make explicit its apparent assumption of a certain level of data and code sharing, which remains uncommon in many disciplines. One reviewer, an ethicist, raised concerns about accountability if significant portions of a paper are written for AI consumption as that could make them less readily directly interpretable by humans.
Reply
Thank you for organizing the review and the summary report. The revision addresses the three lines of criticism: insufficient engagement with prior machine-actionable publishing work; an underdescribed and outdated treatment of how AI systems process scientific literature; and an implementation roadmap that overclaimed feasibility and elided data-sharing and accountability constraints.
Per-reviewer replies follow with section pointers and quoted passages; the principal changes are summarized here:
- Reframed with a formal Introduction and an extensive background on FAIR, FORCE11/Beyond the PDF, the Research Object architecture, RO-Crate, Frictionless Data, GigaScience, eLife executable articles, EBRAINS Live Papers, RIO nanopublications, the Research Data Alliance, GO FAIR, and the Barcelona Declaration. The contribution is now clarified to focuses on AI-assistance and the missing behavioral-science semantic layer.
- Replaced “appendices” with “research-object packages” throughout; data, code, materials, and schemas are framed as integral research outputs rather than annexes.
- Generalized “demographic blindness” to “contextual blindness,” with demographic underreporting as the behavioral-science instance, and revised Figure 1 to match.
- Rewrote the LLM-pipeline discussion to specify how current systems segment, embed, retrieve, and call tools, and identified the actual bottlenecks—PDF parsing, table extraction, figure-panel parsing, citation disambiguation—that structured packages address.
- Calibrated the tier roadmap against current adoption: 14% raw-data and 8.5% script accessibility in a recent psychology audit, with time, training, privacy, standards, and credit identified as binding constraints.
- Stated explicitly that the framework does not require universal open release of raw data; specified an auditable data-access model from synthetic data through controlled access to remote execution.
- Added safeguards against LLM hallucination in construct audits: deterministic checks separated from LLM inferences; model and prompt versions logged; source spans preserved; human adjudication required before any finding feeds back into shared ontologies.
- Disaggregated “LLM ingestion” into four distinct uses (retrieval/indexing, retrieval-augmented generation, model training, log retention), each with separate disclosure obligations.
- Made human authorship and accountability unambiguous; required provenance logging for all AI-assisted construction; added an explicit recalibrated-expertise requirement for readers and reviewers.
- Added Table 1 mapping deployed infrastructures (RRIDs, RO-Crate, Frictionless Data, Croissant, Databrary, eLife executable articles, Code Ocean, EBRAINS Live Papers, nanopublications, Schol-AR) to the proposed tiers and identifying what each leaves unspecified.
- Revised first-person plural to singular throughout.
Reviewer 1: David Resnik
Comment 1
This paper argues for a new model of bifurcated model of scientific papers: a human-readable section and an AI-readable section. The rationale of this is to take better enable AI systems to synthesize the scientific literature, to facilitate peer review, and to promote more rigor in science. The idea original, important, and merits further discussion and debate. I would like to raise some philosophical and ethical concerns with the idea that the authors do not adequately address.
Reply
Thank you for the thoughtful suggestions and questions. The original framing did imply a bifurcation in which some content might be optimized for machines at the expense of human interpretability. The revision clarifies that unintended implication. The structured component is no longer described as an AI-only section but as a layered, human-inspectable, executable research-object package whose explicit semantics aid both human readers and AI systems. §A dual-audience architecture for the AI era now states:
“The two audiences engage the same artifacts differently: machines parse code, schemas, ontology mappings, and trial-level tables, while any reader can inspect them directly—their semantics are explicit and their behavior executable, rather than asserted in prose that cannot be run.”
Inspectability, human responsibility, and human review are now central features of the architecture rather than safeguards added to it.
Comment 2
- Who would take responsibility for the paper? The human? The AI? It seems that large sections of the paper humans could not be responsible for because it is all written by AIs so it will be understandable to AIs. But if humans can ‘t really take responsibility for what has be done, this creates a very dangerous situation. The AI sections of the paper could have dangerous information, for example, to make bioweapons, that is not understandable to humans but is to AIs, but the humans could do nothing about it. “Human in the loop” is a big theme in AI ethics but what the authors have proposed seems to take humans dangerously out of the loop.
Reply
The bioweapons framing suggests that a research-object package full of executable code and dense schemas could in principle conceal content from the very humans asked to vouch for it. The revision rejects that possibility on two levels.
First, the structured package is not an autonomous AI-authored layer. §A dual-audience architecture for the AI era now states:
“Writing for AI systems is therefore not to route content past human inspection but a way to expose structure that prose tends to obscure. Authors carry the same accountability for every component of the package as for the narrative.”
And §Papers as queryable research environments now states the position cleanly:
“Machines have no authorship standing; responsibility for every component rests with the humans who verify and submit it.”
Provenance logging applies to all machine-assisted construction:
“Any AI-assisted construction—code drafting, ontology mapping, data annotation—is logged in the same provenance layer with model and prompt versions, so that authors and reviewers can distinguish what machines drafted from what humans verified.”
Second—and this is the point the original missed—inspectability is not free. A package of containers, schemas, and ontology mappings is more verifiable than a methods paragraph only if reviewers and readers have the competencies to inspect it. The revision states this directly. §A dual-audience architecture for the AI era now reads:
“as generative AI enters reading, review, and synthesis, expertise must be recalibrated rather than bypassed—researchers need the skills to direct AI systems, discern errors in their outputs, and check machine-generated summaries or analyses against domain standards.”
The framework therefore presupposes the building of those reviewer competencies, not their substitution by machines. Provenance logs give reviewers a place to look; the recalibration requirement makes clear that human-in-the-loop oversight depends on humans whose loop has been updated.
Comment 3
- Since authorship and responsibility go hand in hand, this of course raises major authorship questions.
Reply
The original treatment was indirect. §Papers as queryable research environments now states the position cleanly:
“Machines have no authorship standing; responsibility for every component rests with the humans who verify and submit it.”
AI assistance is logged separately rather than absorbed silently into authorship:
“Any AI-assisted construction—code drafting, ontology mapping, data annotation—is logged in the same provenance layer with model and prompt versions, so that authors and reviewers can distinguish what machines drafted from what humans verified.”
Comment 4
- It seems it also raise epistemological issues about knowledge that transcends human understanding (can the even be considered knowledge?) and the democratization of science. Making science even more technical than it already is seems to make it even less democratic.
Reply
The worry as posed presupposes that machine-readable structure is opaque to humans, so that science would “transcend” human understanding when it became machine-tractable. With respect, the framework runs in the opposite direction.
A typed schema, an executable container, and an ontology mapping are not less inspectable than a methods paragraph; they are more so. A schema is exhaustive about every variable, its type, its units, and its allowed values. A container is exhaustive about software versions and runtime conditions. An ontology mapping is explicit about which constructs an instrument is intended to measure. A prose methods section, by contrast, is a compressed, interpretive narration in which assumptions are routinely implicit and details routinely lost. The structured layer makes those assumptions explicit and contestable. The form of knowledge that should worry us is the kind that cannot be checked at all because the necessary materials are absent—not the kind that becomes more checkable when more of it is exposed.
The democratization concern is concerning where it applies—to the requirement of computational and statistical literacy for reading at depth—but it does not turn on opacity. It turns on whether the framework lowers or raises the cost of inspecting evidence. By design it lowers it: a reader who today cannot reproduce a result because raw data and code are inaccessible can, under the proposed architecture, run the analysis at minimum and probe its assumptions at most. §Inverting the narrative now states:
“Front-loading does not substitute summary for substance; it is the top layer of a drill-down architecture—claim, evidence, scope, methods, code, data—through which readers can probe to whatever depth their question demands.”
And the conclusion adds:
“AI systems make evidence easier to query, but their outputs remain interpretations requiring human judgment, theoretical context, and accountability.”
The framework therefore narrows, rather than widens, the gap between stated claims and inspectable evidence.
Comment 5
- It also seems that this approach might be more applicable to some fields rather than others. For example, computer science highly technical disciplines but not humanities (philosophy? law) and maybe not even social science. This needs to be addressed.
Reply
A fair correction. The original overclaimed for empirical, data-rich science. The revision separates the full executable form of the architecture from its underlying principle. The Introduction now states:
“The architecture’s full executable form suits empirical, data-rich fields most directly; in interpretive disciplines the relevant research objects shift to corpora, editions, annotation schemes, and the provenance of coding or editorial decisions, but the underlying principle—claims linked to their inspectable evidence, methods, and revision history—holds across fields, implemented through whatever objects each discipline treats as its evidence.”
§Conclusions and outlook generalizes the point:
“The scaffold is discipline-agnostic. Psychology is the stress test because its core objects—stimulus sets, tasks, and jingle–jangle-prone constructs—are unusually tangled; other domains can substitute their own reagents, instruments, specimens, datasets—or, in interpretive fields, corpora, editions, translations, and annotation provenance.”
The strongest claims are confined to empirical, data-rich fields; the broader principle of linking claims to inspectable evidence is preserved as a generalization.
Comment 6
- Deskilling of human is an issue too. If we get the dumbed down version, we get dumber.
Reply
The deskilling worry presupposes that the dominant move is replacing depth with summary. The revision establishes that the front-loaded layer is the entry to depth, not a substitute for it. §Inverting the narrative now reads:
“Front-loading does not substitute summary for substance; it is the top layer of a drill-down architecture—claim, evidence, scope, methods, code, data—through which readers can probe to whatever depth their question demands.”
The relevant comparison is not between expert reading of unstructured prose and dumbed-down AI summaries; it is between current practice—in which most readers cannot inspect what they take on trust—and a future in which inspection is at least possible for those equipped to do it.
Where deskilling pressure does come into play is on the reviewer and reader side: AI-assisted reading and review will not replace expertise but will demand a different kind of it. §A dual-audience architecture for the AI era states:
“as generative AI enters reading, review, and synthesis, expertise must be recalibrated rather than bypassed—researchers need the skills to direct AI systems, discern errors in their outputs, and check machine-generated summaries or analyses against domain standards.”
Comment 6
See Resnik, DB, Hosseini, M & Hauswald, R. Autonomous artificial intelligence, scientific research, and human values. AI Ethics 6, 141 (2026). https://doi.org/10.1007/s43681-025-00908-0. This article touches on human in the loop issues and related issues with respect to AI agents, which raise similar concerns.
Reply
Cited. The reference now anchors the discussion of human oversight, accountability, and recalibrated expertise in §A dual-audience architecture for the AI era, supporting the position that human-in-the-loop oversight requires both retained authorial responsibility and the ongoing development of reviewer competencies for AI-assisted reading, review, and synthesis.
Reviewer 2: Iratxe Puebla
Comment 1
The manuscript proposes a new format for research articles that includes front-loading summaries of the work for human readers and executable appendices that include the data, code and other research artefacts necessary so that machines can execute the analyses reported in the article.
Reply
Thank you for the careful engagement. Before I address the specific comments below, let me beging with the terminology changes: “Appendices” is replaced by “research-object packages” throughout. The Abstract reads:
“I propose restructuring scientific papers for dual audiences: front-loaded narratives for time-pressed human readers, paired with research-object packages containing executable code, semantic annotations, and tidy trial-level data.”
§Papers as queryable research environments opens with: “Front-loading must be paired with a citable research-object package.”
Comment 2
There are a number of things that I like about the proposal such as leveraging new digital technologies for article publishing, and the focus on maximizing the reproducibility and openness of scholarly work. I like the idea of front-loading the article with information to make it easier for readers to grasp content and decide whether it is relevant to them. I could imagine something similar to the eLife summary, that includes structured designations of rigor and novelty to assist with consistent assessment across articles. I liked the mention in the perspective about a clear signal about the limitations of the work – I view this as a key trust signal valuable to readers and missed some further elaboration of what that would look like.
Reply
Thank you—the trust-signal framing is an interesting way to think about what the front-loaded layer should deliver, and the original underspecified it. §Inverting the narrative now describes what the layer must contain:
“The opening paragraphs should answer, in order—What did you discover? Why does it matter? How does it change our understanding?—and specify the evidence type (confirmatory, exploratory, descriptive, simulation-based, or causal), the population and setting in which the claim holds, the moderators or assumptions most likely to overturn it, and direct links to the scripts, containers, data tables, and robustness checks that reproduce or probe the headline numbers.”
The trust signal is therefore operational: front-loading reports not only the result but its evidential status, boundary conditions, and reproducibility path.
Comment 3
I also like the idea of an AI‑generated audit report for peer reviewers. Many journals already apply AI-base checks on papers, so expanding that and making it available for papers that proceed to peer review would add transparency on journal processes – and may prevent instances of reviewers adding the papers to an AI tool -against journal policy- to generate summaries.
Reply
§Restructuring peer review for executable verification reframes audit reports as inputs to human reviewers, not substitutes:
“Human reviewers receive both the manuscript and an AI-generated audit report and can direct their attention to interpretive claims, novelty, and theoretical significance—judgments that require domain expertise and that automated checks cannot make.”
§A roadmap for federated stewardship adds the governance issue raised here about reviewers using external AI tools:
“Platforms hosting AI–paper interactions must protect the privacy of user queries and interaction logs. Second, they must distinguish policies for confidential peer-review materials from those for published content, and avoid sending unpublished manuscripts to external commercial AI systems that may retain proprietary data.”
Sanctioned audit reports thus replace, rather than ride alongside, the ad hoc external use that current journal policies cannot effectively prohibit.
Comment 4
At the same time, I have some questions and concerns as to how the implementation of the proposed format would work in practice. The proposal appears to focus on technological opportunity without accounting for the level of adoption for certain practices needed for implemention. The machine-actionable package described requires a foundation of practices toward data and code sharing and detailed methodological reporting. This is not commonplace across papers and disciplines, and there is no discussion about the challenges that would arise from an implementation that is only applicable to articles where all associated research objects are shared and the full methodology reported.
Reply
The original was too optimistic. §A dual-audience architecture for the AI era now distinguishes technical feasibility from institutional adoption and reports the empirical baseline:
“Tiers 0–1 use infrastructure that is technically mature and common in some fields but unevenly normalized in behavioral science… The lower tiers themselves are technically available but not institutionally normalized: a 2022 audit of empirical psychology articles found immediately accessible raw data in 14% of cases and analysis scripts in 8.5%, with time costs, limited training, privacy exposure, uncertain standards, and absent credit for curation as the main constraints.”
The same section drops the universal-openness premise:
“The framework does not require universal public release of raw data. Instead, each article should expose the most reusable package compatible with ethical, legal, and practical constraints: open code and metadata at minimum, plus an auditable data-access model ranging from synthetic or de-identified demonstration data, through controlled-access repositories, to remote-execution interfaces in which reviewers can run code against protected data without downloading them.”
Implementation is therefore framed as a staged, constraint-sensitive adoption problem, not a presumed technological leap.
Comment 5
Conceptually, I also have a concern about perpetuating a framing where data, code and other open objects are presented as ‘appendices’ or corollary to the ‘article’ rather than as research contributions on their own merit. I would argue that given the current digital platforms available, the argument for appendices or supplementary materials is weak. Objects originating from a research project can be deposited in repositories or other platforms and provided persistent identifiers and associated metadata. These can then be linked to the article. On this basis, the option of having those materials already exists and the current need relates to better systems on the journal side to link to other objects, make those connections visible in the research information ecosystem, and potentially, as discussed in the perspective, bring those into the article environment to enable greater scrutiny and re-analysis. Figure 3B points articles -> repositories in relation to information flow, I would be interested in a flow that leverages open outputs shared in repositories where the direction is repository -> article to enrich the information provided in the article narrative.
Reply
This was the most consequential of the conceptual corrections, and it shaped multiple sections of the revision. “Appendices” is replaced as indicated above. §Papers as queryable research environments treats data, code, materials, and schemas as integral outputs:
“Provenance metadata binds these components to the article and to one another, making each object independently citable and the package as a whole auditable.”
Figure 3 is now bidirectional, as you suggested:
“Articles and their research-object packages contribute effect sizes, moderators, and quality indicators into domain-specific repositories, and repository-curated updates—new estimates, retraction notices—flow back to each article, so that information moves in both directions rather than only from article to repository.”
The incentives section credits these objects directly:
“Institutions should revise hiring, promotion, and tenure criteria to credit the contributions a research-object package makes visible: curated datasets, executable code, validated stimuli and measures, ontological mappings, and the systematic consensus-building that yields shared terminologies and methodological standards.”
The article is now framed as one component of a linked research ecosystem rather than the sole scholarly product to which other objects are attached.
Comment 6
The text refers to articles several times as PDFs, this does not account for the fact that many journals use formats such as XML that are machine readable. I acknowledge that important contents of an article are not be machine readable, but it’d be worth noting that there are already formats in place for articles that are machine readable.
Reply
A fair correction. The Introduction now reads:
“Some publishers already provide JATS XML or HTML serializations alongside the rendered PDF, which reduces parsing errors at the article-text level; the more consequential gap is that the research objects on which verification depends—data, code, stimuli, protocols, schemas, and provenance—are often absent or weakly linked.”
§A roadmap for federated stewardship adds an implementation recommendation:
“Publishers should also treat machine-readable serializations of the article itself—JATS XML and Markdown alongside the rendered PDF—as standard deliverables rather than typesetting byproducts; this relatively low-cost change removes the parsing layer on which most AI ingestion errors are concentrated.”
The argument is preserved—typed, linked, executable research objects are the deeper bottleneck—but PDF is not the only relevant article format.
Comment 7
In the discussion of risks, it’d worth noting the risk of the executable article leading to a proliferation of yet more articles given the low bar to create aggregate datasets & analysis, for example, in the form of irrelevant meta-analyses? There have been examples around this e.g. from the large-scale reuse of NHANES database: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3003152
Reply
An important concern (and reality) and a useful pointer. §Navigating implementation risks and inequities now states:
“A related risk is the proliferation of formulaic secondary literatures. AI-ready public datasets can be mined into single-factor association papers that ignore interactions, choose subsets selectively, skip multiple-testing correction, and—at the extreme—feed paper-mill production lines; a recent analysis of NHANES-derived publications documents this pattern at scale.”
The remedy follows:
“preregistration of confirmatory reuse, principled justification of subset selection with multiple-testing correction, reuse identifiers for high-value datasets, and editorial screening for formulaic designs. Living evidence networks should label exploratory reuse separately from confirmatory evidence and weight syntheses by design quality rather than publication count.”
The infrastructure that makes reuse cheap should also make reuse auditable.
Comment 8
With regard to the aggregation of information across articles, aggregators exist that index content from different journals and other platforms (e.g. Google Scholar). Admittedly this only covers a portion of the information about articles and does not provide executable options, but one challenge relates to the availability/openness of metadata provided for articles. This is something to consider for a system where the potential for large-scale analysis relies on information flows from journals.
Reply
§Constructing living evidence networks distinguishes living networks from general discovery aggregators:
“This indexing function is distinct from general discovery aggregators such as Google Scholar or OpenAlex, which do not generally expose full typed provenance and retraction metadata.”
Federated stewardship is specified:
“Journals provide article-object metadata; repositories host versioned executable research objects; domain societies or designated registry boards maintain construct and effect-size records; curators adjudicate contested mappings and retractions; and aggregators index typed links among articles, data, code, and synthesis nodes.”
The implementation recommendation invokes existing relationship metadata so the connections are not built from scratch:
“They should likewise expose typed bidirectional links between each article and the objects in its package—datasets, code, stimuli, protocols, preregistrations—using existing Crossref and DataCite relationship metadata so that the connections are visible to both readers and machines.”
Large-scale synthesis depends on open, typed, bidirectional metadata linking articles to their research objects and downstream syntheses, not merely on indexing articles.
Comment 9
I felt that the section on incentives is underdeveloped. The section mentions that CoARA advocates for recognition for different research objects, but this appears at odds with the suggestion to place associated data and code as appendices within an article. There is also no discussion on how the proposed article format aligns with research assessment reform efforts, or how it would facilitate recognition for a greater diversity of research contributions as part of assessment processes.
Reply
You are right. §Institutional incentives and career reform now ties the framework to research-assessment reform directly:
“This aligns with the Coalition for Advancing Research Assessment (CoARA) and the Declaration on Research Assessment (DORA), which call for assessment of diverse outputs beyond journal impact factors, and with the CRediT taxonomy, which already provides standardized roles—data curation, software, resources, validation, visualization—through which contributors receive explicit credit for the work of creating those objects, not only for co-authorship of the article.”
Object-level metrics extend this:
“Beyond explicit credit, citation and reuse metrics for data and software render object-level contributions measurable, so that assessment systems can weight evidence of impact at the object level rather than article authorship alone.”
Reviewer 3
Comment 1
Note for the Author’s manuscript: this review is based on the version V4. The article proposes a dual-audience framework for restructuring scientific papers so that they serve both human readers and AI systems. The author claims that behavioral and social sciences face a crisis of volume overload and epistemic fragmentation, caused by incomparable stimulus databases, jingle-jangle measurement errors, demographic blindness, and inaccessible raw data, and that current AI tools, while useful for summarization, cannot resolve these problems. The proposed solution consists of two main components: a narrative layer for human readers, organized around the key findings and a machine-readable structured appendix containing executable containerized environments, ontologically mapped constructs, persistent stimulus identifiers, and data. The article describes how these elements could support automated jingle-jangle auditing, AI-assisted peer review, and continuously updated “living evidence networks” and systematic reviews/meta-analyses. However, as currently written, the manuscript seems to be more a Perspective article than a research article. The contribution is difficult to isolate from what is already proposed by prior FAIR, open-science, and Barcelona declaration. The four original contributions are restatements of existing proposals (or underdeveloped in the manuscript itself). Methodologically, the engagement with AI systems is not well expanded and does not reflect the current state of LLM-based research pipelines. There are also several terminological and practical problems. These issues require revision before the manuscript can be considered for publication.
Reply
Thank you for the thoughtful review. You are correct that the manuscript is explicitly a Perspective. The Introduction situates the contribution against FAIR, FORCE11, Research Objects, RO-Crate, GigaScience, RIO, EBRAINS, RDA, GO FAIR, and the Barcelona Declaration; the LLM-pipeline discussion is rewritten; and the terminology—including “AI system,” “AI agent,” “AI tools,” “Generative AI,” and the replacement of “appendices” with “research-object packages”—is fixed throughout. The framework’s claim is now narrower and, I believe, more defensible: not a new platform or standard per se, but a specification of the role of AI and the missing semantic layer that existing infrastructure does not by itself supply for behavioral science.
Comment 2
1. The manuscript opens by claiming four original contributions that distinguish it from FAIR, open-science, and reproducibility initiatives. This distinction is unconvincing. The dual-audience paper architecture is described as pairing a front-loaded narrative with a machine-actionable structured appendix to have full reproducible research “rather than treating supplementary materials as an afterthought”. Yet this is the design motivation of declaration and initiatives such as the Barcelona Declaration. The manuscript needs to demonstrate how its proposal differs/is better respect these other initiatives and frameworks in any technically or conceptually meaningful way.
Reply
The revised Introduction acknowledges the prior infrastructure work:
“This Perspective extends two decades of work on machine-actionable scholarship. Data, metadata, tools, and workflows must be findable, accessible, interoperable, and reusable (FAIR). The FORCE11 community and the ‘Beyond the PDF’ movement argued that articles should treat data, software, and protocols as integral research objects rather than appendages; the Research Object (RO) architecture formalized how to bundle them with provenance and attribution.”
The contribution is then specified:
“Packaging standards provide a substrate for meaning but do not, by themselves, supply the domain-specific semantic layer. RO-Crate can bundle a stimulus set; it does not specify which persistent identifier scheme applies to affective images, which normative dimensions must be recorded, or whether the construct labeled ‘executive function’ in one container refers to the same phenomenon as in another.”
The Perspective therefore claims more a domain specification than a packaging standard.
Comment 3
- The minimum viable structured appendix (Contribution 2) is presented as a novel specification, but its four elements (executable containers, standardized identifiers, ontological mappings, and data) are very similar to existing requirements in some highly reputational journals. Personally, I find the fourth contribution (the living systematic reviews idea) very interesting and important. However, in this sense, I think that the use of both LLMs, ontological map and data, can be used in two ways: the text information (i.e., the manuscript) and ontological map can be used in an AI agent to construct a specific RAG, Knowledge-RAG or the recent proposed LLM wiki on the specific topic, while the second one (data) can be used to update the analysis. I kindly ask the author if this is the direction that the proposed framework wants to propose and, if so, if he can expand better this part. At the same time, this kind of future raise another very important question (that is maybe beyond the author central focus): who maintains the “Living Evidence Networks”?
Reply
The two-layer characterization captures the right distinction. §Constructing living evidence networks now separates them explicitly:
“Such networks have two coupled layers: a knowledge layer, in which article text, claims, citations, methods, and ontological mappings support retrieval-augmented or graph-based queries over what has been claimed, disputed, replicated, or retracted; and a synthesis layer, in which effect estimates, IPD summaries, contextual moderators, preregistration status, data and code versions, and quality indicators feed living meta-analyses.”
This is the direction described in the comment: RAG and graph-based retrieval over the text and ontology layer; updated synthesis over the data and effect-size layer. On governance:
“Maintaining living evidence networks requires federated stewardship. Journals provide article-object metadata; repositories host versioned executable research objects; domain societies or designated registry boards maintain construct and effect-size records; curators adjudicate contested mappings and retractions; and aggregators index typed links among articles, data, code, and synthesis nodes.”
Each network requires a designated steward “empowered to issue versioned releases, adjudicate mappings, and log decisions with revision history.” Living networks are framed as a federated stewardship problem, not a technical consequence of better article structure.
Comment 4
- The manuscript’s central motivation is that manuscripts must be restructured for “AI systems” but the treatment of those systems is thin and does not reflect AI developments. AI world is very fast, so the architecture and proposal also must be adaptable and flexible in this sense. Specifically, the manuscript does not consider that modern document-ingestion pipelines for LLM-based research agents do not use appendices or manuscript text as described; they typically work through Markdown conversion, chunked embeddings, and/or retrieval-augmented generation over parsed text. Also, AI Agents in future could theoretically develop scripts to fully reproduce the code regarding the methodology part of the article if it is well described. The AI bottleneck for literature analysis is not only the absence of structured appendices (excluding data) but rather PDF-to-text parsing failures, different table formats, images (that, up to date, are the most difficult to analyze for a classical LLM), and citation disambiguation. I think that the Markdown format of the articles can be the possible way for the proposed architecture (and for Journal publishers) to really push on AI research pipelines.
Reply
The Introduction now describes how current systems actually ingest articles:
“Currently, such pipelines convert source files into segmented text, index those segments with vector and keyword retrieval, and call tools for tasks such as code execution under LLM-orchestrated planning.”
The specific bottlenecks are stated in turn:
“Each step introduces potential errors: PDF parsing loses reading order, equations, and metadata; table extraction remains brittle; figure-panel parsing and citation disambiguation likewise rely on inference from rendered pages rather than typed structure.”
On agent-based reconstruction of methods, §Restructuring peer review for executable verification now states:
“Such agent-generated reconstruction is a useful fallback when no original code exists, but it is not (currently) a substitute for the executable workflow an author can package with the article.”
Markdown and JATS XML are added to the implementation recommendations as the article-level deliverables that remove most of the parsing layer where AI ingestion errors concentrate:
“Publishers should also treat machine-readable serializations of the article itself—JATS XML and Markdown alongside the rendered PDF—as standard deliverables rather than typesetting byproducts.”
Comment 5
4. The article proposes that AI perform automatic jingle-jangle audits (Box 1). However, it does not discuss the risk that the AI might hallucinate incorrect ontological relationships between constructs, creating a scientific “false truth” that is even harder to eradicate because it is “validated by the system” . A manual validation step, or maybe a Human in the Loop approach, can help to avoid this problem.
Reply
The original treated AI audit outputs as if they were authoritative; they are not. Box 1 now states:
“The audits screen for construct redundancy; they do not validate construct identity, which remains a theoretical judgment.”
The hallucination risk is addressed directly:
“Curator reports therefore separate deterministic checks from LLM-generated inferences, log model and prompt versions, preserve source spans and code or data references, and require human adjudication before any finding feeds back into shared ontologies or downstream syntheses.”
Human adjudication is the gating step before any machine-suggested construct relationship enters shared infrastructure.
Comment 6
5. The manuscript is single-authored, but the author employs first-person plural: “we propose” , “we argue” , “we introduce” , “our Perspective”, “our framework” , “our goal” . Revise to singular first person (“I propose” , “I argue”) or use the objective/passive voice.
Reply
Fixed throughout. The Abstract now opens with “I propose…”; the Introduction with “I argue…”; remaining first-person plural is replaced by singular, by the framework as the agent (“the framework specifies…”), or by passive constructions where the actor is unimportant.
Comment 7
6. The article identifies privacy concerns as a “failure mode” but treats them as a constraint to be noted rather than a challenge to be addressed. Privacy is arguably the most significant barrier to widespread adoption of individual participant data sharing, particularly in clinical, educational, and cross-national research contexts. For the article, we have not only privacy issues (as described in Section Implementation and governance options), but copyright (and economic) issues. How the article is treated or used by LLM must be disclosed by the publisher and shared with the original author. Not all authors can agree to let the article be ingested by LLMs for future training.
Reply
The original collapsed several distinct concerns into “privacy.” The revision separates them. On data:
“The framework does not require universal public release of raw data. Instead, each article should expose the most reusable package compatible with ethical, legal, and practical constraints…”
On confidential research-object content, §Navigating implementation risks and inequities specifies:
“The dual-audience architecture must therefore support tiered access, secure data enclaves, and remote-execution or synthetic-data solutions so that code and metadata remain reusable even when raw data cannot be widely shared.”
On AI uses of articles, §A roadmap for federated stewardship disaggregates four uses too often collapsed under “LLM ingestion”:
“Publisher AI policies should disaggregate uses often collapsed under ‘LLM ingestion’: retrieval or indexing of the published article; retrieval-augmented generation over the article and its research-object package; model training or fine-tuning; and retention or analysis of reader, author, and reviewer interaction logs.”
Disclosure obligations follow:
“Publishers should disclose at submission and publication which uses are permitted, which are opt-in or opt-out, what is retained, whether third-party vendors receive content, and whether interaction logs feed product development or model training.”
Authors retain rights-reservation options consistent with applicable law. Training, in particular, is no longer treated as a default consequence of publication.
Comment 8
7. The implementation roadmap (Tiers 0-3) is presented without evidence that the proposed tiers are calibrated to actual barriers to adoption. The claim that “most labs can implement Tiers 0-1 now” is asserted rather than documented. Managerial and technological barriers are arguably the main challenges to adoption that we can find in almost all the new proposals. Empirical literature on the determinants of open data adoption, including training barriers, time costs, incentive misalignment (I really suggest highlighting this aspect), and institutional risk aversion, is not cited.
Reply
§A dual-audience architecture for the AI era reports the empirical baseline:
“a 2022 audit of empirical psychology articles found immediately accessible raw data in 14% of cases and analysis scripts in 8.5%, with time costs, limited training, privacy exposure, uncertain standards, and absent credit for curation as the main constraints.”
On feasibility versus normalization:
“Tiers 0–1 use infrastructure that is technically mature and common in some fields but unevenly normalized in behavioral science; Tier 2 is technically demanding but achievable at the lab level using widely available containerization… Tier 3 requires community infrastructure no individual lab can supply alone.”
On incentives and institutional support, §Institutional incentives and career reform adds:
“Recognition alone is insufficient. Institutions must provide funding, computational resources, version-control systems, and technical training that make comprehensive documentation feasible rather than burdensome.”
Incentive misalignment, training, and institutional support are now treated as binding rather than incidental constraints.
Comment 9
8. Finally, in Figure 1, the block on “demographic blindness” is very important for the analysis of primary data collected via questionnaires and for the field of psychology, but it is not always applicable to other types of analysis. Every type of research presents similar distortions depending on the context. For example, in economic analysis we may encounter the same problem if we do not specify the size of companies in terms of number of employees or revenues. The same applies to comparisons between universities using enrolments or academic staff. Therefore, I believe that the main issue is not solely linked to demographic data (which may perhaps represent the main problem in psychology), but to the lack of contextual data. I suggest updating Figure 1 with a section relating to this concept (perhaps something like ‘Insufficiency of contextual data’) and, particularly for work on primary data/questionnaires, focusing attention on demographic data. This may contribute to the generalizability of the proposed framework.
Reply
The reframing from “demographic” to “contextual” is right, and the revision adopts it. §The volume–fragmentation spiral now states:
“A third failure compounds these problems: contextual blindness conceals effect heterogeneity. In psychology and behavioral science specifically, the most consequential omissions are demographic—age, sex, race/ethnicity, and socioeconomic status are routinely underreported.”
It then generalizes:
“The same logic extends beyond demographics—to firm size and industry in economics; school resources and teacher characteristics in education; dose, provider, and fidelity in intervention research; and stimuli and tasks across experimental psychology. The general requirement is documentation of the variables over which a claim is intended to generalize.”
Figure 1 now reads “contextual (e.g., demographic) blindness,” preserving the behavioral-science motivation while broadening the category.
Minor issue 1
Comment 10
The manuscript uses “AI system”, “AI tools”, “AI agents”, “LLM-based pipeline”, and “generative AI” in ways that are not always consistent or clearly distinguished. A brief terminological table or definitional paragraph at the outset would reduce ambiguity.
Reply
Definitional paragraph added in the Introduction:
“Here ‘AI system’ denotes an LLM-based pipeline that combines generative models with tools, external memory, and deterministic analysis modules, rather than a standalone language model. I use ‘Generative AI’ for the broader class of models that produce text, code, or images; ‘AI agent’ for an AI system delegated to perform multi-step tool use within scoped task limits…; and ‘AI tools’ for user-facing utilities such as summarizers or citation matchers that do not involve delegated task execution.”
Minor issue 2
Comment 11
Give to all the sections classical research articles Section Name (i.e., the first Section does not have the “Introduction” Section Name). Clarify if this is a Perspective or a Research/Review article. The Section division does not help in this sense, since the Method Section is not presented and the Framework is not presented after a “Literature Review” or “Background” Section. Please improve the article structure.
Reply
The article is identified as a Perspective in the Introduction (“Against that background, this Perspective makes four contributions”), and the section structure is reorganized into the standard Perspective sequence: Introduction; The volume–fragmentation spiral; A dual-audience architecture for the AI era; Restructuring peer review for executable verification; Constructing living evidence networks; Navigating implementation risks and inequities; A roadmap for federated stewardship; Conclusions and outlook.
Minor issue 3
Comment 12
There are two DOIs links that are not currently working (even if I checked manually and the DOIs are correct). Please fix the link error: • Appukuttan, S., Bologna, L. L., Schürmann, F., Migliore, M. & Davison, A. P. EBRAINS Live Papers – Interactive resource sheets for computational studies in neuroscience. • Tedersoo, L., Küngas, R., Oras, E., Köster, K., Eenmaa, H., Leijen, Ä., … & Sepp, T. (2021). Data sharing practices and data availability upon request differ across scientific disciplines. Scientific data, 8(1), 192.
Reply
Both DOIs are corrected. Tedersoo et al.: https://doi.org/10.1038/s41597-021-00981-0. Appukuttan et al.: https://doi.org/10.1007/s12021-022-09598-z.
Reviewer 4: Stian Soiland-Reyes
Comment 1
This article proposes a framework for publishing academic articles with a duality purpose to reach both human and AI readers.
Reply
To clarify on a point implicit in this summary: the architecture is layered, not bifurcated. §A dual-audience architecture for the AI era now states:
“The two audiences engage the same artifacts differently: machines parse code, schemas, ontology mappings, and trial-level tables, while any reader can inspect them directly—their semantics are explicit and their behavior executable, rather than asserted in prose that cannot be run.”
Machine readability does not entail sacrificing human interpretability—that would defeat the proposal.
Comment 2
The ideas and motivation are in principle well-reasoned, however this work is hampered by a lack of background research. Notably the article does not have a good notion of Background or Existing Work, these are mainly mentioned in passing and not contrasted against the proposed framework. Notably the paper claims the perspective builds on FAIR and open science principles, but these are mainly ignored for the rest of the article.
Reply
A fair criticism. The revised Introduction now traces FAIR, FORCE11/Beyond the PDF, the Research Object architecture, RO-Crate, Frictionless Data, GigaScience, eLife executable articles, EBRAINS Live Papers, RIO nanopublications, RDA, GO FAIR, and the Barcelona Declaration, and locates the manuscript’s contribution in what those efforts deliberately leave underspecified—the domain-specific semantic layer for behavioral science. Table 1 maps existing infrastructures to the proposed tiers, showing where each tier already has precursors in adjacent disciplines and what remains missing for psychology.
Comment 3
For instance, the concept of Research Object (https://doi.org/10.1016/j.future.2011.08.004) introduced the idea of machine-readable appendices from 2009 onwards, but this seems not acknowledged by this manuscript. There have been whole conferences named “Beyond The PDF” by initiatives like Force11. The Force 11 manifesto is recommended reading. Research Data Alliance (RDA) has worked on open research practices and FAIR principles since 2013. GO FAIR initiative is backed by several government initiatives.
Likewise the FAIR principles have argued for machine-actionable metadata and data for two decades. Many research domains such as biodiversity, life sciences and biomedical are well advanced on use of FAIR, with persistent identifiers, repositories, ontologies etc. are established best practice as part of publication processes, although arguably not consistently referenced from corresponding academic articles. Psychology was one of the first fields encouraging use of reproducible code and using pre-registrations (see for instance https://doi.org/10.1177/21582440231205390). Several journals like Gigascience or RIO Journal have machine-actionable measures like embedding computational workflows and nanopublications.
Reply
Thanks for the suggestions. All cited and used. The revised Introduction acknowledges Research Objects (Bechhofer et al.), FORCE11/Beyond the PDF, the RDA, GO FAIR, GigaScience/GigaDB, RIO nanopublications, eLife executable articles, EBRAINS Live Papers, and the Barcelona Declaration. The specific paper on psychology and reproducibility, Mullen 2024 (https://doi.org/10.1177/21582440231205390), is cited as ref 27, supporting the explicit acknowledgment that:
“Psychology has not been a bystander: preregistration, Registered Reports, and reproducible code pipelines aim to reform how behavioral research is conducted and reported.”
The biodiversity and life-science exemplars are also acknowledged: the Introduction notes that “many research domains have advanced FAIR implementation through persistent identifiers, repositories, and ontologies” before specifying that the psychological and behavioral-science semantic layer remains comparatively underspecified.
Comment 4
Overall the article presents its framework as a new proposal, but I feel by ignoring all previous work in this area of improving scholarly communication to be machine-readable, the genuinely useful proposals from the framework (such as embedding machine-actionable reproducibility checks and audit reports into the publication pipeline) would be undermined. A major revision of the article would need to put the framework in context of the existing work, and suggest how it can be (or already is) implemented.
Reply
The framework is not presented as a new machine-actionable publishing paradigm; it is presented as a specification of what behavioral science still needs on top of the substrate the prior community has built and the role of AI. The Introduction states:
“Packaging standards provide a substrate for meaning but do not, by themselves, supply the domain-specific semantic layer.”
Table 1—RRIDs, RO-Crate, Frictionless Data, Croissant, Databrary, eLife executable articles, Code Ocean, EBRAINS Live Papers, nanopublications, Schol-AR—names existing infrastructure tier by tier and, for each entry, identifies what it does not yet do for psychological constructs, stimuli, tasks, and trial-level schemas. The useful proposals—publication-layer reproducibility checks, jingle–jangle audits, evidence-network exports—are now embedded in this prior work rather than presented as standalone.
Comment 5
There is no evaluation provided of the proposed framework, or any suggestion of how its realization could be evaluated. Notably the manuscript itself is submitted as only a PDF and do not follow its own mantra, there is no machine-readable appendix package attached. Following a review of existing methodologies and background, a revised manuscript could attempt to show the capabilities of the existing techniques, e.g. it can include any of Frictionless Data package, RO-Crate, Croissant-ML appendix packages. It is not clear from the article why LLMs, which primarily are fed from natural language text, would be better suited with structured machine-readable appendices. For instance, StructGPT (https://doi.org/10.48550/arXiv.2305.09645) makes this point.
Reply
Three parts to this.
On evaluation. §A roadmap for federated stewardship now specifies an empirical evaluation path:
“Whether the proposed structure pays for itself is an empirical question. Comparing AI extraction accuracy, citation accuracy, table recovery, reproduction success, and hallucination rate across PDF-only, machine-readable text, and full structured-package conditions would establish where the marginal benefit justifies the marginal author cost.”
Feasibility metrics are added alongside performance metrics:
“author preparation time per tier, curator labor per submission, code-execution success rate (distinct from full reproduction success), frequency of privacy-driven exceptions, reviewer burden, and downstream reuse.”
On why structured packages benefit LLM-based systems. §Restructuring peer review for executable verification now states the StructGPT-style argument explicitly and cites Jiang et al. (StructGPT, EMNLP 2023, ref 39):
“Structured packages also let an LLM delegate reading to deterministic interfaces and reserve generation for synthesis and explanation; an analogous division has been shown to outperform direct text serialization of structured content in multi-step reasoning.”
Structured inputs are not better because LLMs read them better as text. They are better because they let the model offload retrieval, lookup, and arithmetic to deterministic tools and reserve generation for the steps where it adds value.
On the apparent contradiction of submitting a PDF-only argument for machine-readable scholarship. Thanks for the note. First, this is a Perspective, not an empirical paper: there is no data, code, or analytic output to package as an executable research object. The article-internal demonstration the comment requests would have to take the form of an article-text serialization (Markdown source, vector figures, machine-readable reference metadata, an RO-Crate or Frictionless manifest enumerating those files) rather than a data-and-code container. Second, the present submission is to MetaROR’s open meta-research peer review, not yet to a final journal; the eventual journal submission is a separate stage. The structured deliverables will be supplied alongside the journal version, where they correspond to what the manuscript itself prescribes for the article-text layer.
Comment 6
The writing and needs significant improvements, for instance “jingle-jangle” is mentioned twice before it is explained on page 3. While, MetaROR does not require any particular article format, the sections are not structured enough for an academic article, and the text read more like a blog post. As there is a lack of implementation, it can perhaps be improved to become an article in the type of an Opinion piece, but it would still need to relate its proposal with existing work.
Reply
Three changes. First, jingle–jangle is now defined at first use in the Abstract:
“jingle–jangle measurement fallacies (same label, distinct constructs; different labels, same construct).”
Second, the article is identified as a Perspective in the Introduction, and the relation to existing work is now substantial (see, e.g., replies to Comments 2–4). Third, the section structure is reorganized into a more standard Perspective sequence: Introduction; The volume–fragmentation spiral; A dual-audience architecture for the AI era; Restructuring peer review for executable verification; Constructing living evidence networks; Navigating implementation risks and inequities; A roadmap for federated stewardship; Conclusions and outlook.
The prose throughout has been tightened to remove conversational asides and to align section transitions with the genre.





