I do not think that anyone knows how to demonstrate "clarity of the
copyright status in relation to training". This makes the exception
process for AI-generated code both impossible to use, and useless as a
way to inform future changes to QEMU's code provenance policies.
On the other hand, AI tools can be used as a natural language refactoring
engine for simple tasks such as modifying all callers of a given function
or even less simple ones such as adding Python type annotations.
These tasks have a very low risk of introducing training material in
the code base, and can provide noticeable time savings because they are
easily tested and reviewed; for the lack of a better term, I will call
these "tasks with limited or non-existing creative content".
Allow requesting an exception on the grounds of lack of creative content,
while keeping it clear that maintainers can deny it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
docs/devel/code-provenance.rst | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
index a5838f63649..bfc659d2b4e 100644
--- a/docs/devel/code-provenance.rst
+++ b/docs/devel/code-provenance.rst
@@ -327,9 +327,17 @@ The QEMU project requires contributors to refrain from using AI content
generators without going through an exception request process.
AI-generated code will only be included in the project after the
exception request has been evaluated by the QEMU project. To be
-granted an exception, a contributor will need to demonstrate clarity of
-the license and copyright status for the tool's output in relation to its
-training model and code, to the satisfaction of the project maintainers.
+granted an exception, a contributor will need to demonstrate one of the
+following, to the satisfaction of the project maintainers:
+
+* clarity of the license and copyright status for the tool's output in
+ relation to its training model and code;
+
+* limited or non-existing creative content of the contribution.
+
+It is highly encouraged to provide background information such as the
+prompts that were used, and to not mix AI- and human-written code in the
+same commit, as much as possible.
Maintainers are not allow to grant an exception on their own patch
submissions.
--
2.51.0
On Mon, 22 Sept 2025 at 12:32, Paolo Bonzini <pbonzini@redhat.com> wrote: > > I do not think that anyone knows how to demonstrate "clarity of the > copyright status in relation to training". Yes; to me this is the whole driving force behind the policy. > On the other hand, AI tools can be used as a natural language refactoring > engine for simple tasks such as modifying all callers of a given function > or even less simple ones such as adding Python type annotations. > These tasks have a very low risk of introducing training material in > the code base, and can provide noticeable time savings because they are > easily tested and reviewed; for the lack of a better term, I will call > these "tasks with limited or non-existing creative content". Does anybody know how to demonstrate "limited or non-existing creative content", which I assume is a standin here for "not copyrightable" ? -- PMM
On Mon, Sep 22, 2025 at 12:46:51PM +0100, Peter Maydell wrote: > On Mon, 22 Sept 2025 at 12:32, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > I do not think that anyone knows how to demonstrate "clarity of the > > copyright status in relation to training". > > Yes; to me this is the whole driving force behind the policy. > > > On the other hand, AI tools can be used as a natural language refactoring > > engine for simple tasks such as modifying all callers of a given function > > or even less simple ones such as adding Python type annotations. > > These tasks have a very low risk of introducing training material in > > the code base, and can provide noticeable time savings because they are > > easily tested and reviewed; for the lack of a better term, I will call > > these "tasks with limited or non-existing creative content". > > Does anybody know how to demonstrate "limited or non-existing > creative content", which I assume is a standin here for > "not copyrightable" ? That was something we aimed to intentionally avoid specifying in the policy. It is very hard to define it in a way that will be clearly understood by all contributors. Furthermore by defining it explicitly QEMU also weakens its legal position should any issues arise, because it has pre-emptively documented its acceptance of certain scenearios. This has the effect of directing risk away from contributors and back onto the project. We want to be very clear that the burden / requirement for determining legal / license compliance of contributions rests on the contributor, not the project, whether AI is involve or not. In terms of historical practice, when contributors have come to us with legal questions about whether they can contribute something or the legality of cerati nchange, as a general rule we will avoid giving any clear legal guidance from the project's POV. Especially with any corporate contributor the rule is to refer that person back to their own organization's legal department. This makes it clear where the responsibility is and avoids the QEMU project pre-emptively setting out its legal interpretation. TL;DR: I don't think we should attempt to define whether the boundary is between copyrightable and non-copyrightable code changes. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Mon, 22 Sept 2025 at 14:05, Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Mon, Sep 22, 2025 at 12:46:51PM +0100, Peter Maydell wrote: > > On Mon, 22 Sept 2025 at 12:32, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > > > I do not think that anyone knows how to demonstrate "clarity of the > > > copyright status in relation to training". > > > > Yes; to me this is the whole driving force behind the policy. > > > > > On the other hand, AI tools can be used as a natural language refactoring > > > engine for simple tasks such as modifying all callers of a given function > > > or even less simple ones such as adding Python type annotations. > > > These tasks have a very low risk of introducing training material in > > > the code base, and can provide noticeable time savings because they are > > > easily tested and reviewed; for the lack of a better term, I will call > > > these "tasks with limited or non-existing creative content". > > > > Does anybody know how to demonstrate "limited or non-existing > > creative content", which I assume is a standin here for > > "not copyrightable" ? > > That was something we aimed to intentionally avoid specifying in the > policy. It is very hard to define it in a way that will be clearly > understood by all contributors. > TL;DR: I don't think we should attempt to define whether the boundary > is between copyrightable and non-copyrightable code changes. Well, this is why I think a policy that just says "no" is more easily understandable and followable. As soon as we start defining and granting exceptions then we're effectively in the position of making judgements and defining the boundary. -- PMM
On Mon, Sep 22, 2025 at 02:26:00PM +0100, Peter Maydell wrote: > On Mon, 22 Sept 2025 at 14:05, Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > On Mon, Sep 22, 2025 at 12:46:51PM +0100, Peter Maydell wrote: > > > On Mon, 22 Sept 2025 at 12:32, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > > > > > I do not think that anyone knows how to demonstrate "clarity of the > > > > copyright status in relation to training". > > > > > > Yes; to me this is the whole driving force behind the policy. > > > > > > > On the other hand, AI tools can be used as a natural language refactoring > > > > engine for simple tasks such as modifying all callers of a given function > > > > or even less simple ones such as adding Python type annotations. > > > > These tasks have a very low risk of introducing training material in > > > > the code base, and can provide noticeable time savings because they are > > > > easily tested and reviewed; for the lack of a better term, I will call > > > > these "tasks with limited or non-existing creative content". > > > > > > Does anybody know how to demonstrate "limited or non-existing > > > creative content", which I assume is a standin here for > > > "not copyrightable" ? > > > > That was something we aimed to intentionally avoid specifying in the > > policy. It is very hard to define it in a way that will be clearly > > understood by all contributors. > > > TL;DR: I don't think we should attempt to define whether the boundary > > is between copyrightable and non-copyrightable code changes. > > Well, this is why I think a policy that just says "no" is > more easily understandable and followable. As soon as we > start defining and granting exceptions then we're effectively > in the position of making judgements and defining the boundary. Whether we have our AI policy or not, contributors are still required to abide by the terms of the DCO, which requires them to understand the legal situation of any contribution. Our policy is effectively saying that most use of AI is such that we don't think it is possible for contributions to claim DCO compliance. If we think there are situations where it might be credible for a contributor to claim DCO compliance, we can try to find a way to describe that situation, without having to explicitly state our legal interpretation of the "copyrightable vs non-copyrightable" boundary. At KVM Forum what was notably raised as the topic fo code refactoring and whether it is practical to allow some such usage. We have historically allowed machine refactoring done by Coccinelle for example. Someone could asks an AI agent to write a Coccinelle script for a given task, and then tells the AI to run that script across the code base. I think that might be a situation where it would be reasonable to accept the AI driven refactoring, as the substance of the comit is clearly defined by the Coccinelle script. Could that be summarized by saying that we'll allow refactoring if driven via an intermediate script ? That is still quite a strict definition that could frustrate much usage, but it at least feels like something that should have greatl]y reduced risk compared to direct refactoring by an opaque agent. As an example though, we have the scripts/clean-includes.pl script that Markus wrote for manipulating code into our preferred style for headers. Whether the headers change is done manually by a human, automated with Markus' perl script or automated by an AI agent, the end result should be identical, as there is only one possible end point and you can describe what that end point should look like. That said there is still a questionmark over complexity. Getting to the end point may be a trival & mundane exercise in some cases, while requiring considerable intellectual thought in other cases. The latter is perhaps especially true if wanting simple, easily bisected series of small steps rather than a big bang conversion. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On 9/22/25 16:03, Daniel P. Berrangé wrote: > Whether we have our AI policy or not, contributors are still required > to abide by the terms of the DCO, which requires them to understand > the legal situation of any contribution. > > Our policy is effectively saying that most use of AI is such that we > don't think it is possible for contributions to claim DCO compliance. > > If we think there are situations where it might be credible for a > contributor to claim DCO compliance, we can try to find a way to > describe that situation, without having to explicitly state our > legal interpretation of the "copyrightable vs non-copyrightable" > boundary. Right. I am sure that a lawyer would find some overlap between my definition of "where the creativity lies" and the law's definition of "copyrightability", but that's not where I am coming from and I am not even pretending to be dispensing legal advice. The point is more that the tool shouldn't have any bearing on DCO compliance if the same contributor can reasonably make the same change with different tools or with just an editor. And we have dozens of mechanical changes contributed every year, written either by hand or with a wide variety of tools. I have no QEMU example at hand, but let's look at a commit like https://github.com/bonzini/meson/commit/09765594d. Something like this could be plausibly created with AI. What I care about is: * to what degree can I automate what I could do by hand. An AI tool moves the break-even point more towards automation. I would not bring up Coccinelle for a 10 line change, in fact I looked by hand at every occurrence of ".cfg" and relied on mypy to check if I missed something. Maybe an experienced AI user would have reached to AI as the first step?[1] * keeping people honest. Between the two cases of "they don't tell and I don't realize it is AI-generated" and "they split the commit clearly into AI-generated and human-generated parts", an exception makes the latter more likely to happen. > That said there is still a questionmark over complexity. Getting > to the end point may be a trival & mundane exercise in some cases, > while requiring considerable intellectual thought in other cases. > The latter is perhaps especially true if wanting simple, easily > bisected series of small steps rather than a big bang conversion. We encourage anyway people to isolate the mundane parts, therefore they could use AI for them if they see fit. Independent of whether the contributor has worked on QEMU before, the more complex parts are also signed-off on (and we'd much more likely spot signs of AI usage when reviewing them) and that makes me more willing to trust their good faith. Paolo [1] I tried "I want to track the PackageConfiguration object per machine in mesonbuild/cargo/interpreter.py. Make PackageState.cfg a PerMachine object. Initialize PackageState.cfg when the PackageState is created. The old pkg.cfg becomes pkg.cfg[MachineChoice.HOST]" and it did pretty much the same changes in a bit more than 2 minutes. Including the time to write the prompt it's almost certainly more than it took me to do it by hand, but this time I was doing something else in the meanwhile. :)
On Mon, Sep 22, 2025 at 05:10:24PM +0200, Paolo Bonzini wrote: > > I have no QEMU example at hand, but let's look at a commit like > https://github.com/bonzini/meson/commit/09765594d. Something like this > could be plausibly created with AI. What I care about is: I'd agree it is something AI could likely come up with, given the right prompt, but in terms of defining policy that conceptally feels more like new functionality, mixed in with refactoring. > * to what degree can I automate what I could do by hand. An AI tool moves > the break-even point more towards automation. I would not bring up > Coccinelle for a 10 line change, in fact I looked by hand at every > occurrence of ".cfg" and relied on mypy to check if I missed something. > Maybe an experienced AI user would have reached to AI as the first step?[1] What matters is not whether Coccinelle was practical to use or not, and also not whether it was possible to express the concept in its particular language. Rather I'm thinking about it as a conceptual guide for whether a change might be expressible as a plain transformation or not. I don't think the meson change satisfies that, because you wouldn't express the new class level properties, or the new get_or_create_cfg code as an algorithmic refactoring. Those are a case of creative coding. > * keeping people honest. Between the two cases of "they don't tell and I > don't realize it is AI-generated" and "they split the commit clearly into > AI-generated and human-generated parts", an exception makes the latter more > likely to happen. > [1] I tried "I want to track the PackageConfiguration object per machine in > mesonbuild/cargo/interpreter.py. Make PackageState.cfg a PerMachine object. > Initialize PackageState.cfg when the PackageState is created. The old > pkg.cfg becomes pkg.cfg[MachineChoice.HOST]" and it did pretty much the same > changes in a bit more than 2 minutes. Including the time to write the > prompt it's almost certainly more than it took me to do it by hand, but this > time I was doing something else in the meanwhile. :) When we talk about "limited / non-creative refactoring", my interpretation would be that it conceptually applies to changes which could be describe as an algorithmic transformation. This prompt and the resulting code feel like more than that. The prompt is expressing a creative change, and while the result includes some algorithmic refactoring it, includes other stuff too. Describing a policy that allows your meson example, in a way that will be interpreted in a reasonably consistent way by contributors looks like a challenge to me. On the flip side, you might have written the new property / getter method manually and asked the agent to finish the conversion, and that would have been acceptable. This is a can or worms to express in a policy. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Mon, Sep 22, 2025 at 6:37 PM Daniel P. Berrangé <berrange@redhat.com> wrote: > On Mon, Sep 22, 2025 at 05:10:24PM +0200, Paolo Bonzini wrote: > > I have no QEMU example at hand, but let's look at a commit like > > https://github.com/bonzini/meson/commit/09765594d. Something like this > > could be plausibly created with AI. What I care about is: > > I'd agree it is something AI could likely come up with, given the > right prompt, but in terms of defining policy it conceptually > feels more like new functionality, mixed in with refactoring. > [...] > you wouldn't express the new class level properties, or the new > get_or_create_cfg code as an algorithmic refactoring. Those > are a case of creative coding. Yes, I agree. Those are creative, and obviously not part of what the LLM can produce with a pure "refactoring prompt". In that commit, clearly, I hadn't made a strong attempt at splitting out new functionality and refactoring; I might even do that now. :) > When we talk about "limited / non-creative refactoring", my interpretation > would be that it conceptually applies to changes which could be describe as > an algorithmic transformation. This prompt and the resulting code feel like > more than that. The prompt is expressing a creative change, and while the > result includes some algorithmic refactoring it, includes other stuff too. > > Describing a policy that allows your meson example, in a way that will be > interpreted in a reasonably consistent way by contributors looks like a > challenge to me. I agree with your reasoning that the commit goes beyond the "no creative change" line, or at least parts of it do. Inadvertently, this is also an example of how the policy helps AI users follow our existing contribution standards. > On the flip side, you might have written the new property / getter method > manually and asked the agent to finish the conversion, and that would > have been acceptable. This is a can or worms to express in a policy. Yes, a better approach would have been to change the initializer and ask AI to do the mechanical parts. Something like, in a commit message: Note: after changing the initializer, the bulk of the changes were done with the following prompt: "finish this conversion - i want to track the PackageConfiguration object per machine, with pkg.cfg becoming pkg.cfg[MachineChoice.HOST]". Still, putting the two together follows the exception text encouraging "to not mix AI- and human-written code in the same commit, *as much as possible*". Again, this is just an example, and in practice the amount of non-creative refactoring would be much larger than the rest. Paolo
On Mon, Sep 22, 2025 at 1:47 PM Peter Maydell <peter.maydell@linaro.org> wrote: > > On the other hand, AI tools can be used as a natural language refactoring > > engine for simple tasks such as modifying all callers of a given function > > or even less simple ones such as adding Python type annotations. > > These tasks have a very low risk of introducing training material in > > the code base, and can provide noticeable time savings because they are > > easily tested and reviewed; for the lack of a better term, I will call > > these "tasks with limited or non-existing creative content". > > Does anybody know how to demonstrate "limited or non-existing > creative content", which I assume is a standin here for > "not copyrightable" ? The way *I* would demonstrate it is "there is exactly (or pretty much) one way to do this change". Any way to do that change (sed, coccinelle, AI or by hand) would result in the same modification to the code, with no real freedom to pick an algorithm, a data structure, or even a way to organize the code. I wouldn't say however that this is equivalent to non copyrightable. It's more that the creativity lies in "deciding to do it" rather than in "coming up with the code to do it". This is also why I mention having prompts in the commit message; the prompt tells you whether the AI is making design decisions or just executing a mechanical transformation. There's still a substantial amount of grey and I'm okay with treating anything grey as a "no". If something like "convert this script from bash to Python" comes up, I'd not try to claim it as "limited creative content". It may be a boring task with limited variability in output; but it's still creative and has substantially more copyright infringement risk. Paolo
© 2016 - 2025 Red Hat, Inc.