[v3] docs: define policy forbidding use of "AI" / LLM code generators

[PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Markus Armbruster 8 months, 1 week ago

From: Daniel P. Berrangé <berrange@redhat.com>

There has been an explosion of interest in so called AI code
generators. Thus far though, this is has not been matched by a broadly
accepted legal interpretation of the licensing implications for code
generator outputs. While the vendors may claim there is no problem and
a free choice of license is possible, they have an inherent conflict
of interest in promoting this interpretation. More broadly there is,
as yet, no broad consensus on the licensing implications of code
generators trained on inputs under a wide variety of licenses

The DCO requires contributors to assert they have the right to
contribute under the designated project license. Given the lack of
consensus on the licensing of AI code generator output, it is not
considered credible to assert compliance with the DCO clause (b) or (c)
where a patch includes such generated code.

This patch thus defines a policy that the QEMU project will currently
not accept contributions where use of AI code generators is either
known, or suspected.

These are early days of AI-assisted software development. The legal
questions will be resolved eventually. The tools will mature, and we
can expect some to become safely usable in free software projects.
The policy we set now must be for today, and be open to revision. It's
best to start strict and safe, then relax.

Meanwhile requests for exceptions can also be considered on a case by
case basis.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@gmail.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 docs/devel/code-provenance.rst | 50 +++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
index c27d8fe649..261263cfba 100644
--- a/docs/devel/code-provenance.rst
+++ b/docs/devel/code-provenance.rst
@@ -270,4 +270,52 @@ boilerplate code template which is then filled in to produce the final patch.
 The output of such a tool would still be considered the "preferred format",
 since it is intended to be a foundation for further human authored changes.
 Such tools are acceptable to use, provided they follow a deterministic process
-and there is clearly defined copyright and licensing for their output.
+and there is clearly defined copyright and licensing for their output. Note
+in particular the caveats applying to AI code generators below.
+
+Use of AI code generators
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+TL;DR:
+
+  **Current QEMU project policy is to DECLINE any contributions which are
+  believed to include or derive from AI generated code. This includes ChatGPT,
+  CoPilot, Llama and similar tools**
+
+The increasing prevalence of AI code generators, most notably but not limited
+to, `Large Language Models <https://en.wikipedia.org/wiki/Large_language_model>`__
+(LLMs) results in a number of difficult legal questions and risks for software
+projects, including QEMU.
+
+The QEMU community requires that contributors certify their patch submissions
+are made in accordance with the rules of the dco_ (DCO).
+
+To satisfy the DCO, the patch contributor has to fully understand the
+copyright and license status of code they are contributing to QEMU. With AI
+code generators, the copyright and license status of the output is ill-defined
+with no generally accepted, settled legal foundation.
+
+Where the training material is known, it is common for it to include large
+volumes of material under restrictive licensing/copyright terms. Even where
+the training material is all known to be under open source licenses, it is
+likely to be under a variety of terms, not all of which will be compatible
+with QEMU's licensing requirements.
+
+How contributors could comply with DCO terms (b) or (c) for the output of AI
+code generators commonly available today is unclear.  The QEMU project is not
+willing or able to accept the legal risks of non-compliance.
+
+The QEMU project thus requires that contributors refrain from using AI code
+generators on patches intended to be submitted to the project, and will
+decline any contribution if use of AI is either known or suspected.
+
+Examples of tools impacted by this policy includes both GitHub's CoPilot,
+OpenAI's ChatGPT, and Meta's Code Llama, amongst many others which are less
+well known.
+
+This policy may evolve as AI tools mature and the legal situation is
+clarifed. In the meanwhile, requests for exceptions to this policy will be
+evaluated by the QEMU project on a case by case basis. To be granted an
+exception, a contributor will need to demonstrate clarity of the license and
+copyright status for the tool's output in relation to its training model and
+code, to the satisfaction of the project maintainers.
-- 
2.48.1

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Stefan Hajnoczi 8 months, 1 week ago

On Tue, Jun 3, 2025 at 10:25 AM Markus Armbruster <armbru@redhat.com> wrote:
>
> From: Daniel P. Berrangé <berrange@redhat.com>
>
> There has been an explosion of interest in so called AI code
> generators. Thus far though, this is has not been matched by a broadly
> accepted legal interpretation of the licensing implications for code
> generator outputs. While the vendors may claim there is no problem and
> a free choice of license is possible, they have an inherent conflict
> of interest in promoting this interpretation. More broadly there is,
> as yet, no broad consensus on the licensing implications of code
> generators trained on inputs under a wide variety of licenses
>
> The DCO requires contributors to assert they have the right to
> contribute under the designated project license. Given the lack of
> consensus on the licensing of AI code generator output, it is not
> considered credible to assert compliance with the DCO clause (b) or (c)
> where a patch includes such generated code.
>
> This patch thus defines a policy that the QEMU project will currently
> not accept contributions where use of AI code generators is either
> known, or suspected.
>
> These are early days of AI-assisted software development. The legal
> questions will be resolved eventually. The tools will mature, and we
> can expect some to become safely usable in free software projects.
> The policy we set now must be for today, and be open to revision. It's
> best to start strict and safe, then relax.
>
> Meanwhile requests for exceptions can also be considered on a case by
> case basis.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> Acked-by: Stefan Hajnoczi <stefanha@gmail.com>
> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
>  docs/devel/code-provenance.rst | 50 +++++++++++++++++++++++++++++++++-
>  1 file changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> index c27d8fe649..261263cfba 100644
> --- a/docs/devel/code-provenance.rst
> +++ b/docs/devel/code-provenance.rst
> @@ -270,4 +270,52 @@ boilerplate code template which is then filled in to produce the final patch.
>  The output of such a tool would still be considered the "preferred format",
>  since it is intended to be a foundation for further human authored changes.
>  Such tools are acceptable to use, provided they follow a deterministic process
> -and there is clearly defined copyright and licensing for their output.
> +and there is clearly defined copyright and licensing for their output. Note
> +in particular the caveats applying to AI code generators below.
> +
> +Use of AI code generators
> +~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +TL;DR:
> +
> +  **Current QEMU project policy is to DECLINE any contributions which are
> +  believed to include or derive from AI generated code. This includes ChatGPT,
> +  CoPilot, Llama and similar tools**

GitHub spells it "Copilot".

Claude is very popular for coding at the moment and probably worth mentioning.

> +
> +The increasing prevalence of AI code generators, most notably but not limited

More detail is needed on what an "AI code generator" is. Coding
assistant tools range from autocompletion to linters to automatic code
generators. In addition there are other AI-related tools like ChatGPT
or Gemini as a chatbot that can people use like Stackoverflow or an
API documentation summarizer.

I think the intent is to say: do not put code that comes from _any_ AI
tool into QEMU.

It would be okay to use AI to research APIs, algorithms, brainstorm
ideas, debug the code, analyze the code, etc but the actual code
changes must not be generated by AI.

> +to, `Large Language Models <https://en.wikipedia.org/wiki/Large_language_model>`__
> +(LLMs) results in a number of difficult legal questions and risks for software
> +projects, including QEMU.
> +
> +The QEMU community requires that contributors certify their patch submissions
> +are made in accordance with the rules of the dco_ (DCO).
> +
> +To satisfy the DCO, the patch contributor has to fully understand the
> +copyright and license status of code they are contributing to QEMU. With AI
> +code generators, the copyright and license status of the output is ill-defined
> +with no generally accepted, settled legal foundation.
> +
> +Where the training material is known, it is common for it to include large
> +volumes of material under restrictive licensing/copyright terms. Even where
> +the training material is all known to be under open source licenses, it is
> +likely to be under a variety of terms, not all of which will be compatible
> +with QEMU's licensing requirements.
> +
> +How contributors could comply with DCO terms (b) or (c) for the output of AI
> +code generators commonly available today is unclear.  The QEMU project is not
> +willing or able to accept the legal risks of non-compliance.
> +
> +The QEMU project thus requires that contributors refrain from using AI code
> +generators on patches intended to be submitted to the project, and will
> +decline any contribution if use of AI is either known or suspected.
> +
> +Examples of tools impacted by this policy includes both GitHub's CoPilot,

Copilot

> +OpenAI's ChatGPT, and Meta's Code Llama, amongst many others which are less
> +well known.
> +
> +This policy may evolve as AI tools mature and the legal situation is
> +clarifed. In the meanwhile, requests for exceptions to this policy will be
> +evaluated by the QEMU project on a case by case basis. To be granted an
> +exception, a contributor will need to demonstrate clarity of the license and
> +copyright status for the tool's output in relation to its training model and
> +code, to the satisfaction of the project maintainers.
> --
> 2.48.1
>

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Daniel P. Berrangé 8 months, 1 week ago

On Tue, Jun 03, 2025 at 02:25:42PM -0400, Stefan Hajnoczi wrote:
> On Tue, Jun 3, 2025 at 10:25 AM Markus Armbruster <armbru@redhat.com> wrote:
> >
> > From: Daniel P. Berrangé <berrange@redhat.com>
> >
> > There has been an explosion of interest in so called AI code
> > generators. Thus far though, this is has not been matched by a broadly
> > accepted legal interpretation of the licensing implications for code
> > generator outputs. While the vendors may claim there is no problem and
> > a free choice of license is possible, they have an inherent conflict
> > of interest in promoting this interpretation. More broadly there is,
> > as yet, no broad consensus on the licensing implications of code
> > generators trained on inputs under a wide variety of licenses
> >
> > The DCO requires contributors to assert they have the right to
> > contribute under the designated project license. Given the lack of
> > consensus on the licensing of AI code generator output, it is not
> > considered credible to assert compliance with the DCO clause (b) or (c)
> > where a patch includes such generated code.
> >
> > This patch thus defines a policy that the QEMU project will currently
> > not accept contributions where use of AI code generators is either
> > known, or suspected.
> >
> > These are early days of AI-assisted software development. The legal
> > questions will be resolved eventually. The tools will mature, and we
> > can expect some to become safely usable in free software projects.
> > The policy we set now must be for today, and be open to revision. It's
> > best to start strict and safe, then relax.
> >
> > Meanwhile requests for exceptions can also be considered on a case by
> > case basis.
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > Acked-by: Stefan Hajnoczi <stefanha@gmail.com>
> > Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> > Signed-off-by: Markus Armbruster <armbru@redhat.com>
> > ---
> >  docs/devel/code-provenance.rst | 50 +++++++++++++++++++++++++++++++++-
> >  1 file changed, 49 insertions(+), 1 deletion(-)
> >
> > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> > index c27d8fe649..261263cfba 100644
> > --- a/docs/devel/code-provenance.rst
> > +++ b/docs/devel/code-provenance.rst
> > @@ -270,4 +270,52 @@ boilerplate code template which is then filled in to produce the final patch.
> >  The output of such a tool would still be considered the "preferred format",
> >  since it is intended to be a foundation for further human authored changes.
> >  Such tools are acceptable to use, provided they follow a deterministic process
> > -and there is clearly defined copyright and licensing for their output.
> > +and there is clearly defined copyright and licensing for their output. Note
> > +in particular the caveats applying to AI code generators below.
> > +
> > +Use of AI code generators
> > +~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +TL;DR:
> > +
> > +  **Current QEMU project policy is to DECLINE any contributions which are
> > +  believed to include or derive from AI generated code. This includes ChatGPT,
> > +  CoPilot, Llama and similar tools**
> 
> GitHub spells it "Copilot".
> 
> Claude is very popular for coding at the moment and probably worth mentioning.
> 
> > +
> > +The increasing prevalence of AI code generators, most notably but not limited
> 
> More detail is needed on what an "AI code generator" is. Coding
> assistant tools range from autocompletion to linters to automatic code
> generators. In addition there are other AI-related tools like ChatGPT
> or Gemini as a chatbot that can people use like Stackoverflow or an
> API documentation summarizer.
> 
> I think the intent is to say: do not put code that comes from _any_ AI
> tool into QEMU.

Right, the intent is that any copyrightable portion of a commit must
not have come directly from an AI/LLM tool, or from an agent which
indirectly/internally uses an AI/LLM tool.

"code generator" is possibly a little overly specific, as this is really
about any type of tool which emits content that will make its way into
qemu.git, whether code or non-code content (docs, images, etc).

> It would be okay to use AI to research APIs, algorithms, brainstorm
> ideas, debug the code, analyze the code, etc but the actual code
> changes must not be generated by AI.

Mostly yes - there's a fuzzy boundary in the debug/analyze use cases,
if the tool is also suggesting code changes to fix issues.

If the scope of the suggested changes meets the threshold for being
(likely) copyrightable code, that would fall under the policy.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Markus Armbruster 8 months, 1 week ago

Stefan Hajnoczi <stefanha@gmail.com> writes:

> On Tue, Jun 3, 2025 at 10:25 AM Markus Armbruster <armbru@redhat.com> wrote:
>>
>> From: Daniel P. Berrangé <berrange@redhat.com>
>>
>> There has been an explosion of interest in so called AI code
>> generators. Thus far though, this is has not been matched by a broadly
>> accepted legal interpretation of the licensing implications for code
>> generator outputs. While the vendors may claim there is no problem and
>> a free choice of license is possible, they have an inherent conflict
>> of interest in promoting this interpretation. More broadly there is,
>> as yet, no broad consensus on the licensing implications of code
>> generators trained on inputs under a wide variety of licenses
>>
>> The DCO requires contributors to assert they have the right to
>> contribute under the designated project license. Given the lack of
>> consensus on the licensing of AI code generator output, it is not
>> considered credible to assert compliance with the DCO clause (b) or (c)
>> where a patch includes such generated code.
>>
>> This patch thus defines a policy that the QEMU project will currently
>> not accept contributions where use of AI code generators is either
>> known, or suspected.
>>
>> These are early days of AI-assisted software development. The legal
>> questions will be resolved eventually. The tools will mature, and we
>> can expect some to become safely usable in free software projects.
>> The policy we set now must be for today, and be open to revision. It's
>> best to start strict and safe, then relax.
>>
>> Meanwhile requests for exceptions can also be considered on a case by
>> case basis.
>>
>> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>> Acked-by: Stefan Hajnoczi <stefanha@gmail.com>
>> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>  docs/devel/code-provenance.rst | 50 +++++++++++++++++++++++++++++++++-
>>  1 file changed, 49 insertions(+), 1 deletion(-)
>>
>> diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
>> index c27d8fe649..261263cfba 100644
>> --- a/docs/devel/code-provenance.rst
>> +++ b/docs/devel/code-provenance.rst
>> @@ -270,4 +270,52 @@ boilerplate code template which is then filled in to produce the final patch.
>>  The output of such a tool would still be considered the "preferred format",
>>  since it is intended to be a foundation for further human authored changes.
>>  Such tools are acceptable to use, provided they follow a deterministic process
>> -and there is clearly defined copyright and licensing for their output.
>> +and there is clearly defined copyright and licensing for their output. Note
>> +in particular the caveats applying to AI code generators below.
>> +
>> +Use of AI code generators
>> +~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +TL;DR:
>> +
>> +  **Current QEMU project policy is to DECLINE any contributions which are
>> +  believed to include or derive from AI generated code. This includes ChatGPT,
>> +  CoPilot, Llama and similar tools**
>
> GitHub spells it "Copilot".

I'll fix it.

> Claude is very popular for coding at the moment and probably worth mentioning.

Will do.

>> +
>> +The increasing prevalence of AI code generators, most notably but not limited
>
> More detail is needed on what an "AI code generator" is. Coding
> assistant tools range from autocompletion to linters to automatic code
> generators. In addition there are other AI-related tools like ChatGPT
> or Gemini as a chatbot that can people use like Stackoverflow or an
> API documentation summarizer.
>
> I think the intent is to say: do not put code that comes from _any_ AI
> tool into QEMU.
>
> It would be okay to use AI to research APIs, algorithms, brainstorm
> ideas, debug the code, analyze the code, etc but the actual code
> changes must not be generated by AI.

The existing text is about "AI code generators".  However, the "most
notably LLMs" that follows it could lead readers to believe it's about
more than just code generation, because LLMs are in fact used for more.
I figure this is your concern.

We could instead start wide, then narrow the focus to code generation.
Here's my try:

  The increasing prevalence of AI-assisted software development results
  in a number of difficult legal questions and risks for software
  projects, including QEMU.  Of particular concern is code generated by
  `Large Language Models
  <https://en.wikipedia.org/wiki/Large_language_model>`__ (LLMs).

If we want to mention uses of AI we consider okay, I'd do so further
down, to not distract from the main point here.  Perhaps:

  The QEMU project thus requires that contributors refrain from using AI code
  generators on patches intended to be submitted to the project, and will
  decline any contribution if use of AI is either known or suspected.

  This policy does not apply to other uses of AI, such as researching APIs or
  algorithms, static analysis, or debugging.

  Examples of tools impacted by this policy includes both GitHub's CoPilot,
  OpenAI's ChatGPT, and Meta's Code Llama, amongst many others which are less
  well known.

The paragraph in the middle is new, the other two are unchanged.

Thoughts?

>> +to, `Large Language Models <https://en.wikipedia.org/wiki/Large_language_model>`__
>> +(LLMs) results in a number of difficult legal questions and risks for software
>> +projects, including QEMU.

Thanks!

[...]

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Daniel P. Berrangé 8 months, 1 week ago

On Wed, Jun 04, 2025 at 08:17:27AM +0200, Markus Armbruster wrote:
> Stefan Hajnoczi <stefanha@gmail.com> writes:
> 
> > On Tue, Jun 3, 2025 at 10:25 AM Markus Armbruster <armbru@redhat.com> wrote:
> >>
> >> From: Daniel P. Berrangé <berrange@redhat.com>
 >> +
> >> +The increasing prevalence of AI code generators, most notably but not limited
> >
> > More detail is needed on what an "AI code generator" is. Coding
> > assistant tools range from autocompletion to linters to automatic code
> > generators. In addition there are other AI-related tools like ChatGPT
> > or Gemini as a chatbot that can people use like Stackoverflow or an
> > API documentation summarizer.
> >
> > I think the intent is to say: do not put code that comes from _any_ AI
> > tool into QEMU.
> >
> > It would be okay to use AI to research APIs, algorithms, brainstorm
> > ideas, debug the code, analyze the code, etc but the actual code
> > changes must not be generated by AI.

The scope of the policy is around contributions we receive as
patches with SoB. Researching / brainstorming / analysis etc
are not contribution activities, so not covered by the policy
IMHO.

> 
> The existing text is about "AI code generators".  However, the "most
> notably LLMs" that follows it could lead readers to believe it's about
> more than just code generation, because LLMs are in fact used for more.
> I figure this is your concern.
> 
> We could instead start wide, then narrow the focus to code generation.
> Here's my try:
> 
>   The increasing prevalence of AI-assisted software development results
>   in a number of difficult legal questions and risks for software
>   projects, including QEMU.  Of particular concern is code generated by
>   `Large Language Models
>   <https://en.wikipedia.org/wiki/Large_language_model>`__ (LLMs).

Documentation we maintain has the same concerns as code.
So I'd suggest to substitute 'code' with 'code / content'.

> If we want to mention uses of AI we consider okay, I'd do so further
> down, to not distract from the main point here.  Perhaps:
> 
>   The QEMU project thus requires that contributors refrain from using AI code
>   generators on patches intended to be submitted to the project, and will
>   decline any contribution if use of AI is either known or suspected.
> 
>   This policy does not apply to other uses of AI, such as researching APIs or
>   algorithms, static analysis, or debugging.
> 
>   Examples of tools impacted by this policy includes both GitHub's CoPilot,
>   OpenAI's ChatGPT, and Meta's Code Llama, amongst many others which are less
>   well known.
> 
> The paragraph in the middle is new, the other two are unchanged.
> 
> Thoughts?

IMHO its redundant, as the policy is expressly around contribution of
code/content, and those activities as not contribution related, so
outside the scope already.

> 
> >> +to, `Large Language Models <https://en.wikipedia.org/wiki/Large_language_model>`__
> >> +(LLMs) results in a number of difficult legal questions and risks for software
> >> +projects, including QEMU.
> 
> Thanks!
> 
> [...]
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Markus Armbruster 8 months, 1 week ago

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Wed, Jun 04, 2025 at 08:17:27AM +0200, Markus Armbruster wrote:
>> Stefan Hajnoczi <stefanha@gmail.com> writes:
>> 
>> > On Tue, Jun 3, 2025 at 10:25 AM Markus Armbruster <armbru@redhat.com> wrote:
>> >>
>> >> From: Daniel P. Berrangé <berrange@redhat.com>
>  >> +
>> >> +The increasing prevalence of AI code generators, most notably but not limited
>> >
>> > More detail is needed on what an "AI code generator" is. Coding
>> > assistant tools range from autocompletion to linters to automatic code
>> > generators. In addition there are other AI-related tools like ChatGPT
>> > or Gemini as a chatbot that can people use like Stackoverflow or an
>> > API documentation summarizer.
>> >
>> > I think the intent is to say: do not put code that comes from _any_ AI
>> > tool into QEMU.
>> >
>> > It would be okay to use AI to research APIs, algorithms, brainstorm
>> > ideas, debug the code, analyze the code, etc but the actual code
>> > changes must not be generated by AI.
>
> The scope of the policy is around contributions we receive as
> patches with SoB. Researching / brainstorming / analysis etc
> are not contribution activities, so not covered by the policy
> IMHO.

Yes.  More below.

>> The existing text is about "AI code generators".  However, the "most
>> notably LLMs" that follows it could lead readers to believe it's about
>> more than just code generation, because LLMs are in fact used for more.
>> I figure this is your concern.
>> 
>> We could instead start wide, then narrow the focus to code generation.
>> Here's my try:
>> 
>>   The increasing prevalence of AI-assisted software development results
>>   in a number of difficult legal questions and risks for software
>>   projects, including QEMU.  Of particular concern is code generated by
>>   `Large Language Models
>>   <https://en.wikipedia.org/wiki/Large_language_model>`__ (LLMs).
>
> Documentation we maintain has the same concerns as code.
> So I'd suggest to substitute 'code' with 'code / content'.

Makes sense, thanks!

>> If we want to mention uses of AI we consider okay, I'd do so further
>> down, to not distract from the main point here.  Perhaps:
>> 
>>   The QEMU project thus requires that contributors refrain from using AI code
>>   generators on patches intended to be submitted to the project, and will
>>   decline any contribution if use of AI is either known or suspected.
>> 
>>   This policy does not apply to other uses of AI, such as researching APIs or
>>   algorithms, static analysis, or debugging.
>> 
>>   Examples of tools impacted by this policy includes both GitHub's CoPilot,
>>   OpenAI's ChatGPT, and Meta's Code Llama, amongst many others which are less
>>   well known.
>> 
>> The paragraph in the middle is new, the other two are unchanged.
>> 
>> Thoughts?
>
> IMHO its redundant, as the policy is expressly around contribution of
> code/content, and those activities as not contribution related, so
> outside the scope already.

The very first paragraph in this file already set the scope: "provenance
of patch submissions [...] to the project", so you have a point here.
But does repeating the scope here hurt or help?

>> >> +to, `Large Language Models <https://en.wikipedia.org/wiki/Large_language_model>`__
>> >> +(LLMs) results in a number of difficult legal questions and risks for software
>> >> +projects, including QEMU.
>> 
>> Thanks!
>> 
>> [...]
>> 
>
> With regards,
> Daniel

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Daniel P. Berrangé 8 months, 1 week ago

On Wed, Jun 04, 2025 at 10:58:38AM +0200, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Wed, Jun 04, 2025 at 08:17:27AM +0200, Markus Armbruster wrote:
> >> Stefan Hajnoczi <stefanha@gmail.com> writes:
> >> 
> >> > On Tue, Jun 3, 2025 at 10:25 AM Markus Armbruster <armbru@redhat.com> wrote:
> >> >>
> >> >> From: Daniel P. Berrangé <berrange@redhat.com>
> >  >> +
> >> >> +The increasing prevalence of AI code generators, most notably but not limited
> >> >
> >> > More detail is needed on what an "AI code generator" is. Coding
> >> > assistant tools range from autocompletion to linters to automatic code
> >> > generators. In addition there are other AI-related tools like ChatGPT
> >> > or Gemini as a chatbot that can people use like Stackoverflow or an
> >> > API documentation summarizer.
> >> >
> >> > I think the intent is to say: do not put code that comes from _any_ AI
> >> > tool into QEMU.
> >> >
> >> > It would be okay to use AI to research APIs, algorithms, brainstorm
> >> > ideas, debug the code, analyze the code, etc but the actual code
> >> > changes must not be generated by AI.
> >
> > The scope of the policy is around contributions we receive as
> > patches with SoB. Researching / brainstorming / analysis etc
> > are not contribution activities, so not covered by the policy
> > IMHO.
> 
> Yes.  More below.
> 
> >> The existing text is about "AI code generators".  However, the "most
> >> notably LLMs" that follows it could lead readers to believe it's about
> >> more than just code generation, because LLMs are in fact used for more.
> >> I figure this is your concern.
> >> 
> >> We could instead start wide, then narrow the focus to code generation.
> >> Here's my try:
> >> 
> >>   The increasing prevalence of AI-assisted software development results
> >>   in a number of difficult legal questions and risks for software
> >>   projects, including QEMU.  Of particular concern is code generated by
> >>   `Large Language Models
> >>   <https://en.wikipedia.org/wiki/Large_language_model>`__ (LLMs).
> >
> > Documentation we maintain has the same concerns as code.
> > So I'd suggest to substitute 'code' with 'code / content'.
> 
> Makes sense, thanks!
> 
> >> If we want to mention uses of AI we consider okay, I'd do so further
> >> down, to not distract from the main point here.  Perhaps:
> >> 
> >>   The QEMU project thus requires that contributors refrain from using AI code
> >>   generators on patches intended to be submitted to the project, and will
> >>   decline any contribution if use of AI is either known or suspected.
> >> 
> >>   This policy does not apply to other uses of AI, such as researching APIs or
> >>   algorithms, static analysis, or debugging.
> >> 
> >>   Examples of tools impacted by this policy includes both GitHub's CoPilot,
> >>   OpenAI's ChatGPT, and Meta's Code Llama, amongst many others which are less
> >>   well known.
> >> 
> >> The paragraph in the middle is new, the other two are unchanged.
> >> 
> >> Thoughts?
> >
> > IMHO its redundant, as the policy is expressly around contribution of
> > code/content, and those activities as not contribution related, so
> > outside the scope already.
> 
> The very first paragraph in this file already set the scope: "provenance
> of patch submissions [...] to the project", so you have a point here.
> But does repeating the scope here hurt or help?

I guess it probably doesn't hurt to have it. Perhaps tweak to

 This policy does not apply to other uses of AI, such as researching APIs or
 algorithms, static analysis, or debugging, provided their output is not
 to be included in contributions.

and for the last paragraph remove 'both' and add a tailer

   Examples of tools impacted by this policy include GitHub's CoPilot,
   OpenAI's ChatGPT, and Meta's Code Llama (amongst many others which are less
   well known), and code/content generation agents which are built on top of
   such tools.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Markus Armbruster 8 months, 1 week ago

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Wed, Jun 04, 2025 at 10:58:38AM +0200, Markus Armbruster wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > On Wed, Jun 04, 2025 at 08:17:27AM +0200, Markus Armbruster wrote:
>> >> Stefan Hajnoczi <stefanha@gmail.com> writes:
>> >> 
>> >> > On Tue, Jun 3, 2025 at 10:25 AM Markus Armbruster <armbru@redhat.com> wrote:
>> >> >>
>> >> >> From: Daniel P. Berrangé <berrange@redhat.com>
>> >  >> +
>> >> >> +The increasing prevalence of AI code generators, most notably but not limited
>> >> >
>> >> > More detail is needed on what an "AI code generator" is. Coding
>> >> > assistant tools range from autocompletion to linters to automatic code
>> >> > generators. In addition there are other AI-related tools like ChatGPT
>> >> > or Gemini as a chatbot that can people use like Stackoverflow or an
>> >> > API documentation summarizer.
>> >> >
>> >> > I think the intent is to say: do not put code that comes from _any_ AI
>> >> > tool into QEMU.
>> >> >
>> >> > It would be okay to use AI to research APIs, algorithms, brainstorm
>> >> > ideas, debug the code, analyze the code, etc but the actual code
>> >> > changes must not be generated by AI.
>> >
>> > The scope of the policy is around contributions we receive as
>> > patches with SoB. Researching / brainstorming / analysis etc
>> > are not contribution activities, so not covered by the policy
>> > IMHO.
>> 
>> Yes.  More below.
>> 
>> >> The existing text is about "AI code generators".  However, the "most
>> >> notably LLMs" that follows it could lead readers to believe it's about
>> >> more than just code generation, because LLMs are in fact used for more.
>> >> I figure this is your concern.
>> >> 
>> >> We could instead start wide, then narrow the focus to code generation.
>> >> Here's my try:
>> >> 
>> >>   The increasing prevalence of AI-assisted software development results
>> >>   in a number of difficult legal questions and risks for software
>> >>   projects, including QEMU.  Of particular concern is code generated by
>> >>   `Large Language Models
>> >>   <https://en.wikipedia.org/wiki/Large_language_model>`__ (LLMs).
>> >
>> > Documentation we maintain has the same concerns as code.
>> > So I'd suggest to substitute 'code' with 'code / content'.
>> 
>> Makes sense, thanks!
>> 
>> >> If we want to mention uses of AI we consider okay, I'd do so further
>> >> down, to not distract from the main point here.  Perhaps:
>> >> 
>> >>   The QEMU project thus requires that contributors refrain from using AI code
>> >>   generators on patches intended to be submitted to the project, and will
>> >>   decline any contribution if use of AI is either known or suspected.
>> >> 
>> >>   This policy does not apply to other uses of AI, such as researching APIs or
>> >>   algorithms, static analysis, or debugging.
>> >> 
>> >>   Examples of tools impacted by this policy includes both GitHub's CoPilot,
>> >>   OpenAI's ChatGPT, and Meta's Code Llama, amongst many others which are less
>> >>   well known.
>> >> 
>> >> The paragraph in the middle is new, the other two are unchanged.
>> >> 
>> >> Thoughts?
>> >
>> > IMHO its redundant, as the policy is expressly around contribution of
>> > code/content, and those activities as not contribution related, so
>> > outside the scope already.
>> 
>> The very first paragraph in this file already set the scope: "provenance
>> of patch submissions [...] to the project", so you have a point here.
>> But does repeating the scope here hurt or help?
>
> I guess it probably doesn't hurt to have it. Perhaps tweak to
>
>  This policy does not apply to other uses of AI, such as researching APIs or
>  algorithms, static analysis, or debugging, provided their output is not
>  to be included in contributions.
>
> and for the last paragraph remove 'both' and add a tailer
>
>    Examples of tools impacted by this policy include GitHub's CoPilot,
>    OpenAI's ChatGPT, and Meta's Code Llama (amongst many others which are less
>    well known), and code/content generation agents which are built on top of
>    such tools.

Sold!

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Philippe Mathieu-Daudé 8 months, 1 week ago

On 4/6/25 09:15, Daniel P. Berrangé wrote:
> On Wed, Jun 04, 2025 at 08:17:27AM +0200, Markus Armbruster wrote:
>> Stefan Hajnoczi <stefanha@gmail.com> writes:
>>
>>> On Tue, Jun 3, 2025 at 10:25 AM Markus Armbruster <armbru@redhat.com> wrote:
>>>>
>>>> From: Daniel P. Berrangé <berrange@redhat.com>
>   >> +
>>>> +The increasing prevalence of AI code generators, most notably but not limited
>>>
>>> More detail is needed on what an "AI code generator" is. Coding
>>> assistant tools range from autocompletion to linters to automatic code
>>> generators. In addition there are other AI-related tools like ChatGPT
>>> or Gemini as a chatbot that can people use like Stackoverflow or an
>>> API documentation summarizer.
>>>
>>> I think the intent is to say: do not put code that comes from _any_ AI
>>> tool into QEMU.
>>>
>>> It would be okay to use AI to research APIs, algorithms, brainstorm
>>> ideas, debug the code, analyze the code, etc but the actual code
>>> changes must not be generated by AI.
> 
> The scope of the policy is around contributions we receive as
> patches with SoB. Researching / brainstorming / analysis etc
> are not contribution activities, so not covered by the policy
> IMHO.
> 
>>
>> The existing text is about "AI code generators".  However, the "most
>> notably LLMs" that follows it could lead readers to believe it's about
>> more than just code generation, because LLMs are in fact used for more.
>> I figure this is your concern.
>>
>> We could instead start wide, then narrow the focus to code generation.
>> Here's my try:
>>
>>    The increasing prevalence of AI-assisted software development results
>>    in a number of difficult legal questions and risks for software
>>    projects, including QEMU.  Of particular concern is code generated by
>>    `Large Language Models
>>    <https://en.wikipedia.org/wiki/Large_language_model>`__ (LLMs).
> 
> Documentation we maintain has the same concerns as code.
> So I'd suggest to substitute 'code' with 'code / content'.

Why couldn't we accept documentation patches improved using LLM?

As a non-native English speaker being often stuck trying to describe
function APIs, I'm very tempted to use a LLM to review my sentences
and make them better understandable.

>> If we want to mention uses of AI we consider okay, I'd do so further
>> down, to not distract from the main point here.  Perhaps:
>>
>>    The QEMU project thus requires that contributors refrain from using AI code
>>    generators on patches intended to be submitted to the project, and will
>>    decline any contribution if use of AI is either known or suspected.
>>
>>    This policy does not apply to other uses of AI, such as researching APIs or
>>    algorithms, static analysis, or debugging.
>>
>>    Examples of tools impacted by this policy includes both GitHub's CoPilot,
>>    OpenAI's ChatGPT, and Meta's Code Llama, amongst many others which are less
>>    well known.
>>
>> The paragraph in the middle is new, the other two are unchanged.
>>
>> Thoughts?
> 
> IMHO its redundant, as the policy is expressly around contribution of
> code/content, and those activities as not contribution related, so
> outside the scope already.
> 
>>
>>>> +to, `Large Language Models <https://en.wikipedia.org/wiki/Large_language_model>`__
>>>> +(LLMs) results in a number of difficult legal questions and risks for software
>>>> +projects, including QEMU.
>>
>> Thanks!
>>
>> [...]
>>
> 
> With regards,
> Daniel

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Markus Armbruster 8 months, 1 week ago

Philippe Mathieu-Daudé <philmd@linaro.org> writes:

> On 4/6/25 09:15, Daniel P. Berrangé wrote:
>> On Wed, Jun 04, 2025 at 08:17:27AM +0200, Markus Armbruster wrote:
>>> Stefan Hajnoczi <stefanha@gmail.com> writes:
>>>
>>>> On Tue, Jun 3, 2025 at 10:25 AM Markus Armbruster <armbru@redhat.com> wrote:
>>>>>
>>>>> From: Daniel P. Berrangé <berrange@redhat.com>
>>>>> +The increasing prevalence of AI code generators, most notably but not limited
>>>>
>>>> More detail is needed on what an "AI code generator" is. Coding
>>>> assistant tools range from autocompletion to linters to automatic code
>>>> generators. In addition there are other AI-related tools like ChatGPT
>>>> or Gemini as a chatbot that can people use like Stackoverflow or an
>>>> API documentation summarizer.
>>>>
>>>> I think the intent is to say: do not put code that comes from _any_ AI
>>>> tool into QEMU.
>>>>
>>>> It would be okay to use AI to research APIs, algorithms, brainstorm
>>>> ideas, debug the code, analyze the code, etc but the actual code
>>>> changes must not be generated by AI.
>> 
>> The scope of the policy is around contributions we receive as
>> patches with SoB. Researching / brainstorming / analysis etc
>> are not contribution activities, so not covered by the policy
>> IMHO.
>> 
>>>
>>> The existing text is about "AI code generators".  However, the "most
>>> notably LLMs" that follows it could lead readers to believe it's about
>>> more than just code generation, because LLMs are in fact used for more.
>>> I figure this is your concern.
>>>
>>> We could instead start wide, then narrow the focus to code generation.
>>> Here's my try:
>>>
>>>    The increasing prevalence of AI-assisted software development results
>>>    in a number of difficult legal questions and risks for software
>>>    projects, including QEMU.  Of particular concern is code generated by
>>>    `Large Language Models
>>>    <https://en.wikipedia.org/wiki/Large_language_model>`__ (LLMs).
>> 
>> Documentation we maintain has the same concerns as code.
>> So I'd suggest to substitute 'code' with 'code / content'.
>
> Why couldn't we accept documentation patches improved using LLM?
>
> As a non-native English speaker being often stuck trying to describe
> function APIs, I'm very tempted to use a LLM to review my sentences
> and make them better understandable.

I understand the temptation!  Unfortunately, the "legal questions and
risks" Daniel described apply to *any* kind of copyrightable material,
not just to code.

Quote:

    To satisfy the DCO, the patch contributor has to fully understand the
    copyright and license status of code they are contributing to QEMU. With AI
    code generators, the copyright and license status of the output is ill-defined
    with no generally accepted, settled legal foundation.

    Where the training material is known, it is common for it to include large
    volumes of material under restrictive licensing/copyright terms. Even where
    the training material is all known to be under open source licenses, it is
    likely to be under a variety of terms, not all of which will be compatible
    with QEMU's licensing requirements.

    How contributors could comply with DCO terms (b) or (c) for the output of AI
    code generators commonly available today is unclear.  The QEMU project is not
    willing or able to accept the legal risks of non-compliance.

[...]

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Daniel P. Berrangé 8 months, 1 week ago

On Wed, Jun 04, 2025 at 09:54:33AM +0200, Philippe Mathieu-Daudé wrote:
> On 4/6/25 09:15, Daniel P. Berrangé wrote:
> > On Wed, Jun 04, 2025 at 08:17:27AM +0200, Markus Armbruster wrote:
> > > Stefan Hajnoczi <stefanha@gmail.com> writes:
> > > 
> > > > On Tue, Jun 3, 2025 at 10:25 AM Markus Armbruster <armbru@redhat.com> wrote:
> > > > > 
> > > > > From: Daniel P. Berrangé <berrange@redhat.com>
> >   >> +
> > > > > +The increasing prevalence of AI code generators, most notably but not limited
> > > > 
> > > > More detail is needed on what an "AI code generator" is. Coding
> > > > assistant tools range from autocompletion to linters to automatic code
> > > > generators. In addition there are other AI-related tools like ChatGPT
> > > > or Gemini as a chatbot that can people use like Stackoverflow or an
> > > > API documentation summarizer.
> > > > 
> > > > I think the intent is to say: do not put code that comes from _any_ AI
> > > > tool into QEMU.
> > > > 
> > > > It would be okay to use AI to research APIs, algorithms, brainstorm
> > > > ideas, debug the code, analyze the code, etc but the actual code
> > > > changes must not be generated by AI.
> > 
> > The scope of the policy is around contributions we receive as
> > patches with SoB. Researching / brainstorming / analysis etc
> > are not contribution activities, so not covered by the policy
> > IMHO.
> > 
> > > 
> > > The existing text is about "AI code generators".  However, the "most
> > > notably LLMs" that follows it could lead readers to believe it's about
> > > more than just code generation, because LLMs are in fact used for more.
> > > I figure this is your concern.
> > > 
> > > We could instead start wide, then narrow the focus to code generation.
> > > Here's my try:
> > > 
> > >    The increasing prevalence of AI-assisted software development results
> > >    in a number of difficult legal questions and risks for software
> > >    projects, including QEMU.  Of particular concern is code generated by
> > >    `Large Language Models
> > >    <https://en.wikipedia.org/wiki/Large_language_model>`__ (LLMs).
> > 
> > Documentation we maintain has the same concerns as code.
> > So I'd suggest to substitute 'code' with 'code / content'.
> 
> Why couldn't we accept documentation patches improved using LLM?

I would flip it around and ask why would documentation not be held
to the same standard as code, when it comes to licensing and legal
compliance ?

This is all copyright content that we merge & distribute under the
same QEMU licensing terms, and we have the same legal obligations
whether it is "source code" or "documentation" or other content
that is not traditional "source code" (images for example).

> As a non-native English speaker being often stuck trying to describe
> function APIs, I'm very tempted to use a LLM to review my sentences
> and make them better understandable.

I can understand that desire, and it is an admittedly tricky situation
and tradeoff for which I don't have a great answer.

As a starting point we (as reviewers/maintainers) must be broadly
very tolerant & accepting of content that is not perfect English,
because we know many (probably even the majority of) contributors
won't have English as their first language.

As a reviewer I don't mind imperfect language in submissions. Even
if language is not perfect it is at least a direct expression of
the author's understanding and thus we can have a level of trust
in the docs based on our community experience with the contributor.

If docs have been altered in any significant manner by an LLM,
even if they are linguistically improved, IMHO, knowing that use
of LLM would reduce my personal trust in the technically accuracy
of the contribution.

This is straying into the debate around the accuracy of LLMs though,
which is interesting, but tangential from the purpose of this policy
which aims to focus on the code provenance / legal side. 

So, back on track, a important point is that this policy (& the
legal concerns/risks it attempts to address) are implicitly
around contributions that can be considered copyrightable.

Some so called "trivial" work can be so simplistic as to not meet
the threshold for copyright protection, and it is thus easy for the
DCO requirements to be satisfied.

As a person, when you write the API documentation from scratch,
your output would generally be considered to be copyrightable
contribution by the author.

When a reviewer then suggests changes to your docs, most of the
time those changes are so trivial, that the reviewer wouldn't be
claiming copyright over the resulting work.

If the reviewer completely rewrites entire sentences in the
docs though, though would be able to claim copyright over part
of the resulting work.

The tippping point between copyrightable/non-copyrightable is
hard to define in a policy. It is inherantly fuzzy, and somewhat
of a "you'll know it when you see it" or "lets debate it in court"
situation...

So back to LLMs.

If you ask the LLM (or an agent using an LLM) to entirely write
the API docs from scratch, I think that should be expected to
fall under this proposed contribution policy in general.

If you write the API docs yourself and ask the LLM to review and
suggest improvements, that MAY or MAY NOT fall under this policy.

If the LLM suggested tweaks were minor enough to be considered
not to meet the threshold to be copyrightable it would be fine,
this is little different to a human reviewer suggesting tweaks.

If the LLM suggested large scale rewriting that would be harder
to draw the line, but would tend towards falling under this
contribution policy.

So it depends on the scope of what the LLM suggested as a change
to your docs.

IOW, LLM-as-sparkling-auto-correct is probably OK, but
LLM-as-book-editor / LLM-as-ghost-writer is probably NOT OK

This is a scenario where the QEMU contributor has to use their
personal judgement as to whether their use of LLM in a docs context
is compliant with this policy, or not. I don't think we should try
to describe this in the policy given how fuzzy the situation is.

NB, this copyrightable/non-copyrightable situation applies to source
code too, not just docs.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Philippe Mathieu-Daudé 8 months, 1 week ago

On 4/6/25 10:40, Daniel P. Berrangé wrote:
> On Wed, Jun 04, 2025 at 09:54:33AM +0200, Philippe Mathieu-Daudé wrote:
>> On 4/6/25 09:15, Daniel P. Berrangé wrote:
>>> On Wed, Jun 04, 2025 at 08:17:27AM +0200, Markus Armbruster wrote:
>>>> Stefan Hajnoczi <stefanha@gmail.com> writes:
>>>>
>>>>> On Tue, Jun 3, 2025 at 10:25 AM Markus Armbruster <armbru@redhat.com> wrote:
>>>>>>
>>>>>> From: Daniel P. Berrangé <berrange@redhat.com>
>>>    >> +
>>>>>> +The increasing prevalence of AI code generators, most notably but not limited
>>>>>
>>>>> More detail is needed on what an "AI code generator" is. Coding
>>>>> assistant tools range from autocompletion to linters to automatic code
>>>>> generators. In addition there are other AI-related tools like ChatGPT
>>>>> or Gemini as a chatbot that can people use like Stackoverflow or an
>>>>> API documentation summarizer.
>>>>>
>>>>> I think the intent is to say: do not put code that comes from _any_ AI
>>>>> tool into QEMU.
>>>>>
>>>>> It would be okay to use AI to research APIs, algorithms, brainstorm
>>>>> ideas, debug the code, analyze the code, etc but the actual code
>>>>> changes must not be generated by AI.
>>>
>>> The scope of the policy is around contributions we receive as
>>> patches with SoB. Researching / brainstorming / analysis etc
>>> are not contribution activities, so not covered by the policy
>>> IMHO.
>>>
>>>>
>>>> The existing text is about "AI code generators".  However, the "most
>>>> notably LLMs" that follows it could lead readers to believe it's about
>>>> more than just code generation, because LLMs are in fact used for more.
>>>> I figure this is your concern.
>>>>
>>>> We could instead start wide, then narrow the focus to code generation.
>>>> Here's my try:
>>>>
>>>>     The increasing prevalence of AI-assisted software development results
>>>>     in a number of difficult legal questions and risks for software
>>>>     projects, including QEMU.  Of particular concern is code generated by
>>>>     `Large Language Models
>>>>     <https://en.wikipedia.org/wiki/Large_language_model>`__ (LLMs).
>>>
>>> Documentation we maintain has the same concerns as code.
>>> So I'd suggest to substitute 'code' with 'code / content'.
>>
>> Why couldn't we accept documentation patches improved using LLM?
> 
> I would flip it around and ask why would documentation not be held
> to the same standard as code, when it comes to licensing and legal
> compliance ?
> 
> This is all copyright content that we merge & distribute under the
> same QEMU licensing terms, and we have the same legal obligations
> whether it is "source code" or "documentation" or other content
> that is not traditional "source code" (images for example).
> 
> 
>> As a non-native English speaker being often stuck trying to describe
>> function APIs, I'm very tempted to use a LLM to review my sentences
>> and make them better understandable.
> 
> I can understand that desire, and it is an admittedly tricky situation
> and tradeoff for which I don't have a great answer.
> 
> As a starting point we (as reviewers/maintainers) must be broadly
> very tolerant & accepting of content that is not perfect English,
> because we know many (probably even the majority of) contributors
> won't have English as their first language.
> 
> As a reviewer I don't mind imperfect language in submissions. Even
> if language is not perfect it is at least a direct expression of
> the author's understanding and thus we can have a level of trust
> in the docs based on our community experience with the contributor.
> 
> If docs have been altered in any significant manner by an LLM,
> even if they are linguistically improved, IMHO, knowing that use
> of LLM would reduce my personal trust in the technically accuracy
> of the contribution.
> 
> This is straying into the debate around the accuracy of LLMs though,
> which is interesting, but tangential from the purpose of this policy
> which aims to focus on the code provenance / legal side.
> 
> 
> 
> So, back on track, a important point is that this policy (& the
> legal concerns/risks it attempts to address) are implicitly
> around contributions that can be considered copyrightable.
> 
> Some so called "trivial" work can be so simplistic as to not meet
> the threshold for copyright protection, and it is thus easy for the
> DCO requirements to be satisfied.
> 
> 
> As a person, when you write the API documentation from scratch,
> your output would generally be considered to be copyrightable
> contribution by the author.
> 
> When a reviewer then suggests changes to your docs, most of the
> time those changes are so trivial, that the reviewer wouldn't be
> claiming copyright over the resulting work.
> 
> If the reviewer completely rewrites entire sentences in the
> docs though, though would be able to claim copyright over part
> of the resulting work.
> 
> 
> The tippping point between copyrightable/non-copyrightable is
> hard to define in a policy. It is inherantly fuzzy, and somewhat
> of a "you'll know it when you see it" or "lets debate it in court"
> situation...
> 
> 
> So back to LLMs.
> 
> 
> If you ask the LLM (or an agent using an LLM) to entirely write
> the API docs from scratch, I think that should be expected to
> fall under this proposed contribution policy in general.
> 
> 
> If you write the API docs yourself and ask the LLM to review and
> suggest improvements, that MAY or MAY NOT fall under this policy.
> 
> If the LLM suggested tweaks were minor enough to be considered
> not to meet the threshold to be copyrightable it would be fine,
> this is little different to a human reviewer suggesting tweaks.

Good.

> If the LLM suggested large scale rewriting that would be harder
> to draw the line, but would tend towards falling under this
> contribution policy.
> 
> So it depends on the scope of what the LLM suggested as a change
> to your docs.
> 
> IOW, LLM-as-sparkling-auto-correct is probably OK, but
> LLM-as-book-editor / LLM-as-ghost-writer is probably NOT OK

OK.

> This is a scenario where the QEMU contributor has to use their
> personal judgement as to whether their use of LLM in a docs context
> is compliant with this policy, or not. I don't think we should try
> to describe this in the policy given how fuzzy the situation is.

Thank you very much for this detailed explanation!

> 
> NB, this copyrightable/non-copyrightable situation applies to source
> code too, not just docs.
> 
> With regards,
> Daniel

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Kevin Wolf 8 months, 1 week ago

Am 03.06.2025 um 16:25 hat Markus Armbruster geschrieben:
> +TL;DR:
> +
> +  **Current QEMU project policy is to DECLINE any contributions which are
> +  believed to include or derive from AI generated code. This includes ChatGPT,
> +  CoPilot, Llama and similar tools**

[...]

> +Examples of tools impacted by this policy includes both GitHub's CoPilot,
> +OpenAI's ChatGPT, and Meta's Code Llama, amongst many others which are less
> +well known.

I wonder if the best list of examples is still the same now, a year
after the original version of the document was written. In particular,
maybe including an example of popular vibe coding IDEs like Cursor would
make sense?

But it's only examples anyway, so either way is fine.

Kevin

Re: [PATCH v3 3/3] docs: define policy forbidding use of AI code generators

Posted by Markus Armbruster 8 months, 1 week ago

Kevin Wolf <kwolf@redhat.com> writes:

> Am 03.06.2025 um 16:25 hat Markus Armbruster geschrieben:
>> +TL;DR:
>> +
>> +  **Current QEMU project policy is to DECLINE any contributions which are
>> +  believed to include or derive from AI generated code. This includes ChatGPT,
>> +  CoPilot, Llama and similar tools**
>
> [...]
>
>> +Examples of tools impacted by this policy includes both GitHub's CoPilot,
>> +OpenAI's ChatGPT, and Meta's Code Llama, amongst many others which are less
>> +well known.
>
> I wonder if the best list of examples is still the same now, a year
> after the original version of the document was written. In particular,
> maybe including an example of popular vibe coding IDEs like Cursor would
> make sense?
>
> But it's only examples anyway, so either way is fine.

Stefan suggested a few more, and I'll add them.

Thanks!

[PATCH v3 1/3] docs: introduce dedicated page about code provenance / sign-off
[PATCH v3 2/3] docs: define policy limiting the inclusion of generated files
[PATCH v3 3/3] docs: define policy forbidding use of AI code generators