[v1] docs: define policy forbidding use of "AI" / LLM code generators

[PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Daniel P. Berrangé 2 years, 2 months ago

There has been an explosion of interest in so called "AI" (LLM)
code generators in the past year or so. Thus far though, this is
has not been matched by a broadly accepted legal interpretation
of the licensing implications for code generator outputs. While
the vendors may claim there is no problem and a free choice of
license is possible, they have an inherent conflict of interest
in promoting this interpretation. More broadly there is, as yet,
no broad consensus on the licensing implications of code generators
trained on inputs under a wide variety of licenses.

The DCO requires contributors to assert they have the right to
contribute under the designated project license. Given the lack
of consensus on the licensing of "AI" (LLM) code generator output,
it is not considered credible to assert compliance with the DCO
clause (b) or (c) where a patch includes such generated code.

This patch thus defines a policy that the QEMU project will not
accept contributions where use of "AI" (LLM) code generators is
either known, or suspected.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
index b4591a2dec..a6e42c6b1b 100644
--- a/docs/devel/code-provenance.rst
+++ b/docs/devel/code-provenance.rst
@@ -195,3 +195,43 @@ example::
   Signed-off-by: Some Person <some.person@example.com>
   [Rebased and added support for 'foo']
   Signed-off-by: New Person <new.person@example.com>
+
+Use of "AI" (LLM) code generators
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+TL;DR:
+
+  **Current QEMU project policy is to DECLINE any contributions
+  which are believed to include or derive from "AI" (LLM)
+  generated code.**
+
+The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
+/ LLM) code generators raises a number of difficult legal questions, a
+number of which impact on Open Source projects. As noted earlier, the
+QEMU community requires that contributors certify their patch submissions
+are made in accordance with the rules of the :ref:`dco` (DCO). When a
+patch contains "AI" generated code this raises difficulties with code
+provenence and thus DCO compliance.
+
+To satisfy the DCO, the patch contributor has to fully understand
+the origins and license of code they are contributing to QEMU. The
+license terms that should apply to the output of an "AI" code generator
+are ill-defined, given that both training data and operation of the
+"AI" are typically opaque to the user. Even where the training data
+is said to all be open source, it will likely be under a wide variety
+of license terms.
+
+While the vendor's of "AI" code generators may promote the idea that
+code output can be taken under a free choice of license, this is not
+yet considered to be a generally accepted, nor tested, legal opinion.
+
+With this in mind, the QEMU maintainers does not consider it is
+currently possible to comply with DCO terms (b) or (c) for most "AI"
+generated code.
+
+The QEMU maintainers thus require that contributors refrain from using
+"AI" code generators on patches intended to be submitted to the project,
+and will decline any contribution if use of "AI" is known or suspected.
+
+Examples of tools impacted by this policy includes both GitHub CoPilot,
+and ChatGPT, amongst many others which are less well known.
-- 
2.41.0

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Stefan Hajnoczi 2 years, 2 months ago

On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote:
> There has been an explosion of interest in so called "AI" (LLM)
> code generators in the past year or so. Thus far though, this is
> has not been matched by a broadly accepted legal interpretation
> of the licensing implications for code generator outputs. While
> the vendors may claim there is no problem and a free choice of
> license is possible, they have an inherent conflict of interest
> in promoting this interpretation. More broadly there is, as yet,
> no broad consensus on the licensing implications of code generators
> trained on inputs under a wide variety of licenses.
> 
> The DCO requires contributors to assert they have the right to
> contribute under the designated project license. Given the lack
> of consensus on the licensing of "AI" (LLM) code generator output,
> it is not considered credible to assert compliance with the DCO
> clause (b) or (c) where a patch includes such generated code.
> 
> This patch thus defines a policy that the QEMU project will not
> accept contributions where use of "AI" (LLM) code generators is
> either known, or suspected.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)

As open source LLMs mature, it may be possible to curate the training
data so that the output complies with software licenses and can be used
in QEMU.

For the time being, the position in this patch seems reasonable because
it prevents license problems down the road.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote:
> There has been an explosion of interest in so called "AI" (LLM)
> code generators in the past year or so. Thus far though, this is
> has not been matched by a broadly accepted legal interpretation
> of the licensing implications for code generator outputs. While
> the vendors may claim there is no problem and a free choice of
> license is possible, they have an inherent conflict of interest
> in promoting this interpretation. More broadly there is, as yet,
> no broad consensus on the licensing implications of code generators
> trained on inputs under a wide variety of licenses.
> 
> The DCO requires contributors to assert they have the right to
> contribute under the designated project license. Given the lack
> of consensus on the licensing of "AI" (LLM) code generator output,
> it is not considered credible to assert compliance with the DCO
> clause (b) or (c) where a patch includes such generated code.
> 
> This patch thus defines a policy that the QEMU project will not
> accept contributions where use of "AI" (LLM) code generators is
> either known, or suspected.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> index b4591a2dec..a6e42c6b1b 100644
> --- a/docs/devel/code-provenance.rst
> +++ b/docs/devel/code-provenance.rst
> @@ -195,3 +195,43 @@ example::
>    Signed-off-by: Some Person <some.person@example.com>
>    [Rebased and added support for 'foo']
>    Signed-off-by: New Person <new.person@example.com>
> +
> +Use of "AI" (LLM) code generators
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +TL;DR:
> +
> +  **Current QEMU project policy is to DECLINE any contributions
> +  which are believed to include or derive from "AI" (LLM)
> +  generated code.**
> +
> +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
> +/ LLM) code generators raises a number of difficult legal questions, a
> +number of which impact on Open Source projects. As noted earlier, the
> +QEMU community requires that contributors certify their patch submissions
> +are made in accordance with the rules of the :ref:`dco` (DCO). When a
> +patch contains "AI" generated code this raises difficulties with code
> +provenence and thus DCO compliance.
> +
> +To satisfy the DCO, the patch contributor has to fully understand
> +the origins and license of code they are contributing to QEMU. The
> +license terms that should apply to the output of an "AI" code generator
> +are ill-defined, given that both training data and operation of the
> +"AI" are typically opaque to the user. Even where the training data
> +is said to all be open source, it will likely be under a wide variety
> +of license terms.
> +
> +While the vendor's of "AI" code generators may promote the idea that
> +code output can be taken under a free choice of license, this is not
> +yet considered to be a generally accepted, nor tested, legal opinion.
> +
> +With this in mind, the QEMU maintainers does not consider it is
> +currently possible to comply with DCO terms (b) or (c) for most "AI"
> +generated code.
> +
> +The QEMU maintainers thus require that contributors refrain from using
> +"AI" code generators on patches intended to be submitted to the project,
> +and will decline any contribution if use of "AI" is known or suspected.
> +
> +Examples of tools impacted by this policy includes both GitHub CoPilot,
> +and ChatGPT, amongst many others which are less well known.


So you called out these two by name, fine, but given "AI" is in scare
quotes I don't really know what is or is not allowed and I don't know
how will contributors know.  Is the "AI" that one must not use
necessarily an LLM?  And how do you define LLM even? Wikipedia says
"general-purpose language understanding and generation".


All this seems vague to me.


However, can't we define a simpler more specific policy?
For example, isn't it true that *any* automatically generated code
can only be included if the scripts producing said code
are also included or otherwise available under GPLv2?




> -- 
> 2.41.0

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Daniel P. Berrangé 2 years, 2 months ago

On Thu, Nov 23, 2023 at 09:35:43AM -0500, Michael S. Tsirkin wrote:
> On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote:
> > There has been an explosion of interest in so called "AI" (LLM)
> > code generators in the past year or so. Thus far though, this is
> > has not been matched by a broadly accepted legal interpretation
> > of the licensing implications for code generator outputs. While
> > the vendors may claim there is no problem and a free choice of
> > license is possible, they have an inherent conflict of interest
> > in promoting this interpretation. More broadly there is, as yet,
> > no broad consensus on the licensing implications of code generators
> > trained on inputs under a wide variety of licenses.
> > 
> > The DCO requires contributors to assert they have the right to
> > contribute under the designated project license. Given the lack
> > of consensus on the licensing of "AI" (LLM) code generator output,
> > it is not considered credible to assert compliance with the DCO
> > clause (b) or (c) where a patch includes such generated code.
> > 
> > This patch thus defines a policy that the QEMU project will not
> > accept contributions where use of "AI" (LLM) code generators is
> > either known, or suspected.
> > 
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > ---
> >  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 40 insertions(+)
> > 
> > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> > index b4591a2dec..a6e42c6b1b 100644
> > --- a/docs/devel/code-provenance.rst
> > +++ b/docs/devel/code-provenance.rst
> > @@ -195,3 +195,43 @@ example::
> >    Signed-off-by: Some Person <some.person@example.com>
> >    [Rebased and added support for 'foo']
> >    Signed-off-by: New Person <new.person@example.com>
> > +
> > +Use of "AI" (LLM) code generators
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +TL;DR:
> > +
> > +  **Current QEMU project policy is to DECLINE any contributions
> > +  which are believed to include or derive from "AI" (LLM)
> > +  generated code.**
> > +
> > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
> > +/ LLM) code generators raises a number of difficult legal questions, a
> > +number of which impact on Open Source projects. As noted earlier, the
> > +QEMU community requires that contributors certify their patch submissions
> > +are made in accordance with the rules of the :ref:`dco` (DCO). When a
> > +patch contains "AI" generated code this raises difficulties with code
> > +provenence and thus DCO compliance.
> > +
> > +To satisfy the DCO, the patch contributor has to fully understand
> > +the origins and license of code they are contributing to QEMU. The
> > +license terms that should apply to the output of an "AI" code generator
> > +are ill-defined, given that both training data and operation of the
> > +"AI" are typically opaque to the user. Even where the training data
> > +is said to all be open source, it will likely be under a wide variety
> > +of license terms.
> > +
> > +While the vendor's of "AI" code generators may promote the idea that
> > +code output can be taken under a free choice of license, this is not
> > +yet considered to be a generally accepted, nor tested, legal opinion.
> > +
> > +With this in mind, the QEMU maintainers does not consider it is
> > +currently possible to comply with DCO terms (b) or (c) for most "AI"
> > +generated code.
> > +
> > +The QEMU maintainers thus require that contributors refrain from using
> > +"AI" code generators on patches intended to be submitted to the project,
> > +and will decline any contribution if use of "AI" is known or suspected.
> > +
> > +Examples of tools impacted by this policy includes both GitHub CoPilot,
> > +and ChatGPT, amongst many others which are less well known.
> 
> 
> So you called out these two by name, fine, but given "AI" is in scare
> quotes I don't really know what is or is not allowed and I don't know
> how will contributors know.  Is the "AI" that one must not use
> necessarily an LLM?  And how do you define LLM even? Wikipedia says
> "general-purpose language understanding and generation".

I used "AI" in quotes, because I think it can mean different things to
different people. In practical terms it has become a bit of a catch
all term for a wide variety of tools. Thus I think the quote serve to
express this as a loose generalization, rather than a precise definition.

The same for "LLM", I don't want to try to define it, as it has also
become somewhat of a general term. 

> All this seems vague to me.

Delibrately so, as there are a wide variety of tools working in
varying ways, but all with similar caveats around the licensing
of the output "derivative" work.

> However, can't we define a simpler more specific policy?
> For example, isn't it true that *any* automatically generated code
> can only be included if the scripts producing said code
> are also included or otherwise available under GPLv2?

The license of a code generation tool itself is usually considered
to be not a factor in the license of its output.

In most cases the license of the input data will determine the
license of the output data, since the latter is a derivative
work of the former. The person runing the tool will typically
know exact what the input data is, and so have confidence over
the license of the output.

If there are questions about whether the output is a derivative
of the tool's code itself, then the tool author can provide an
disclaimer for this.  Such a disclaimer though, would not erase
the derivative link between input data and output data. One
example is GCC where the output .o/exe is a derivative of the
input .c.  The output, however, may also link the gcc runtime
library, and so GCC has a license exception saying that this
runtime linkage doesn't affect the license of the output
program. This is OK, since the GCC authors who added this
exception owned copyright over the runtime library they're
adding an exception for.

If we apply this to LLMs, the output of the LLM is a derivative
of the training data. The output is not a derivative of the LLM
code. The LLM copyright holders could make this latter point
explicit since they own copyright of the LLM code, but they do
not own copyright of the training data, and neither does the
person using the LLM, hence the legal uncertainty.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote:
> The license of a code generation tool itself is usually considered
> to be not a factor in the license of its output.

Really? I would find it very surprising if a code generation tool that
is not a language model and so is not understanding the code it's
generating did not include some code snippets going into the output.
It is also possible to unintentionally run afoul of GPL's definition of source
code which is "the preferred form of the work for making modifications to it". 
So even if you have copyright to input, dumping just output and putting
GPL on it might or might not be ok.

-- 
MST

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Daniel P. Berrangé 2 years, 2 months ago

On Thu, Nov 23, 2023 at 05:39:18PM -0500, Michael S. Tsirkin wrote:
> On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote:
> > The license of a code generation tool itself is usually considered
> > to be not a factor in the license of its output.
> 
> Really? I would find it very surprising if a code generation tool that
> is not a language model and so is not understanding the code it's
> generating did not include some code snippets going into the output.
> It is also possible to unintentionally run afoul of GPL's definition of source
> code which is "the preferred form of the work for making modifications to it". 
> So even if you have copyright to input, dumping just output and putting
> GPL on it might or might not be ok.

Consider the C pre-processor. This takes an input .c file, and expands
all the macros, to split out a new .c file.

The license of the output .c file is determined by the license of the
input .c file. The license of the CPP impl (whether OSS or proprietary)
doesn't have any influence on the license of the output file, it cannot
magically force the output file to be proprietary any more than it can
force it to be output file GPL.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Alex Bennée 2 years, 2 months ago

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Thu, Nov 23, 2023 at 05:39:18PM -0500, Michael S. Tsirkin wrote:
>> On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote:
>> > The license of a code generation tool itself is usually considered
>> > to be not a factor in the license of its output.
>> 
>> Really? I would find it very surprising if a code generation tool that
>> is not a language model and so is not understanding the code it's
>> generating did not include some code snippets going into the output.
>> It is also possible to unintentionally run afoul of GPL's definition of source
>> code which is "the preferred form of the work for making modifications to it". 
>> So even if you have copyright to input, dumping just output and putting
>> GPL on it might or might not be ok.
>
> Consider the C pre-processor. This takes an input .c file, and expands
> all the macros, to split out a new .c file.
>
> The license of the output .c file is determined by the license of the
> input .c file. The license of the CPP impl (whether OSS or proprietary)
> doesn't have any influence on the license of the output file, it cannot
> magically force the output file to be proprietary any more than it can
> force it to be output file GPL.

LLM's are just a tool like a compiler (albeit with spookier different
internals). The prompt and the instructions are arguably the more
important part of how to get good results from the LLM transformation.
In fact most of the way I've been using them has been by pasting some
existing code and asking for review or transformation of it.

However I totally get that using the various online LLMs you have very
little transparency about what has gone into their training and therefor
there is a danger of proprietary code being hallucinated out of their
matricies. Conversely what if I use an LLM like OpenLLaMa:

  https://github.com/openlm-research/open_llama

I have fairly exhaustive definitions of what went into the training data
which of most interest is probably the StarCoder dataset (paper):

  https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view

where there are tools to detect if generated code has been lifted
directly from the dataset or is indeed a transformation.

>
> With regards,
> Daniel

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Daniel P. Berrangé 2 years, 2 months ago

On Fri, Nov 24, 2023 at 10:21:17AM +0000, Alex Bennée wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Thu, Nov 23, 2023 at 05:39:18PM -0500, Michael S. Tsirkin wrote:
> >> On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote:
> >> > The license of a code generation tool itself is usually considered
> >> > to be not a factor in the license of its output.
> >> 
> >> Really? I would find it very surprising if a code generation tool that
> >> is not a language model and so is not understanding the code it's
> >> generating did not include some code snippets going into the output.
> >> It is also possible to unintentionally run afoul of GPL's definition of source
> >> code which is "the preferred form of the work for making modifications to it". 
> >> So even if you have copyright to input, dumping just output and putting
> >> GPL on it might or might not be ok.
> >
> > Consider the C pre-processor. This takes an input .c file, and expands
> > all the macros, to split out a new .c file.
> >
> > The license of the output .c file is determined by the license of the
> > input .c file. The license of the CPP impl (whether OSS or proprietary)
> > doesn't have any influence on the license of the output file, it cannot
> > magically force the output file to be proprietary any more than it can
> > force it to be output file GPL.
> 
> LLM's are just a tool like a compiler (albeit with spookier different
> internals). The prompt and the instructions are arguably the more
> important part of how to get good results from the LLM transformation.
> In fact most of the way I've been using them has been by pasting some
> existing code and asking for review or transformation of it.
> 
> However I totally get that using the various online LLMs you have very
> little transparency about what has gone into their training and therefor
> there is a danger of proprietary code being hallucinated out of their
> matricies. Conversely what if I use an LLM like OpenLLaMa:
> 
>   https://github.com/openlm-research/open_llama
> 
> I have fairly exhaustive definitions of what went into the training data
> which of most interest is probably the StarCoder dataset (paper):
> 
>   https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view
> 
> where there are tools to detect if generated code has been lifted
> directly from the dataset or is indeed a transformation.

I've not looked at the links above, but I think if someone can make an
compelling argument that *specific* tools have sufficient transparency
to be compatible with signing the DCO, then I think we could maintain a
list of exceptions in the policy.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Fri, Nov 24, 2023 at 10:21:17AM +0000, Alex Bennée wrote:
> LLM's are just a tool like a compiler (albeit with spookier different
> internals).

We already generally don't accept compiler output in patches since
it is not source code by the definition of GPL.

-- 
MST

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Fri, Nov 24, 2023 at 09:06:29AM +0000, Daniel P. Berrangé wrote:
> On Thu, Nov 23, 2023 at 05:39:18PM -0500, Michael S. Tsirkin wrote:
> > On Thu, Nov 23, 2023 at 05:58:45PM +0000, Daniel P. Berrangé wrote:
> > > The license of a code generation tool itself is usually considered
> > > to be not a factor in the license of its output.
> > 
> > Really? I would find it very surprising if a code generation tool that
> > is not a language model and so is not understanding the code it's
> > generating did not include some code snippets going into the output.
> > It is also possible to unintentionally run afoul of GPL's definition of source
> > code which is "the preferred form of the work for making modifications to it". 
> > So even if you have copyright to input, dumping just output and putting
> > GPL on it might or might not be ok.
> 
> Consider the C pre-processor. This takes an input .c file, and expands
> all the macros, to split out a new .c file.
> 
> The license of the output .c file is determined by the license of the
> input .c file. The license of the CPP impl (whether OSS or proprietary)
> doesn't have any influence on the license of the output file, it cannot
> magically force the output file to be proprietary any more than it can
> force it to be output file GPL.
> 
> With regards,
> Daniel

Sorry I don't get how is C preprocessor relevant here? It does not
generate source code in the GPL sense. We won't accept C preprocessor
output in a patch.

Not being a lawyer I personally am not really interested in discussing
how copyright works, certainly not at this highly abstract and
simplified level.

-- 
MST

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Manos Pitsidianakis 2 years, 2 months ago

On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote:
>> There has been an explosion of interest in so called "AI" (LLM)
>> code generators in the past year or so. Thus far though, this is
>> has not been matched by a broadly accepted legal interpretation
>> of the licensing implications for code generator outputs. While
>> the vendors may claim there is no problem and a free choice of
>> license is possible, they have an inherent conflict of interest
>> in promoting this interpretation. More broadly there is, as yet,
>> no broad consensus on the licensing implications of code generators
>> trained on inputs under a wide variety of licenses.
>> 
>> The DCO requires contributors to assert they have the right to
>> contribute under the designated project license. Given the lack
>> of consensus on the licensing of "AI" (LLM) code generator output,
>> it is not considered credible to assert compliance with the DCO
>> clause (b) or (c) where a patch includes such generated code.
>> 
>> This patch thus defines a policy that the QEMU project will not
>> accept contributions where use of "AI" (LLM) code generators is
>> either known, or suspected.
>> 
>> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>> ---
>>  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
>>  1 file changed, 40 insertions(+)
>> 
>> diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
>> index b4591a2dec..a6e42c6b1b 100644
>> --- a/docs/devel/code-provenance.rst
>> +++ b/docs/devel/code-provenance.rst
>> @@ -195,3 +195,43 @@ example::
>>    Signed-off-by: Some Person <some.person@example.com>
>>    [Rebased and added support for 'foo']
>>    Signed-off-by: New Person <new.person@example.com>
>> +
>> +Use of "AI" (LLM) code generators
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +TL;DR:
>> +
>> +  **Current QEMU project policy is to DECLINE any contributions
>> +  which are believed to include or derive from "AI" (LLM)
>> +  generated code.**
>> +
>> +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
>> +/ LLM) code generators raises a number of difficult legal questions, a
>> +number of which impact on Open Source projects. As noted earlier, the
>> +QEMU community requires that contributors certify their patch submissions
>> +are made in accordance with the rules of the :ref:`dco` (DCO). When a
>> +patch contains "AI" generated code this raises difficulties with code
>> +provenence and thus DCO compliance.
>> +
>> +To satisfy the DCO, the patch contributor has to fully understand
>> +the origins and license of code they are contributing to QEMU. The
>> +license terms that should apply to the output of an "AI" code generator
>> +are ill-defined, given that both training data and operation of the
>> +"AI" are typically opaque to the user. Even where the training data
>> +is said to all be open source, it will likely be under a wide variety
>> +of license terms.
>> +
>> +While the vendor's of "AI" code generators may promote the idea that
>> +code output can be taken under a free choice of license, this is not
>> +yet considered to be a generally accepted, nor tested, legal opinion.
>> +
>> +With this in mind, the QEMU maintainers does not consider it is
>> +currently possible to comply with DCO terms (b) or (c) for most "AI"
>> +generated code.
>> +
>> +The QEMU maintainers thus require that contributors refrain from using
>> +"AI" code generators on patches intended to be submitted to the project,
>> +and will decline any contribution if use of "AI" is known or suspected.
>> +
>> +Examples of tools impacted by this policy includes both GitHub CoPilot,
>> +and ChatGPT, amongst many others which are less well known.
>
>
>So you called out these two by name, fine, but given "AI" is in scare
>quotes I don't really know what is or is not allowed and I don't know
>how will contributors know.  Is the "AI" that one must not use
>necessarily an LLM?  And how do you define LLM even? Wikipedia says
>"general-purpose language understanding and generation".
>
>
>All this seems vague to me.
>
>
>However, can't we define a simpler more specific policy?
>For example, isn't it true that *any* automatically generated code
>can only be included if the scripts producing said code
>are also included or otherwise available under GPLv2?

The following definition makes sense to me:

- Automated codegen tool must be idempotent.
- Automated codegen tool must not use statistical modelling.

I'd remove all AI or LLM references. These are non-specific, colloquial 
and in the case of `AI`, non-technical. This policy should apply the 
same to a Markov chain code generator.

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Kevin Wolf 2 years, 2 months ago

Am 23.11.2023 um 15:56 hat Manos Pitsidianakis geschrieben:
> On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote:
> > > There has been an explosion of interest in so called "AI" (LLM)
> > > code generators in the past year or so. Thus far though, this is
> > > has not been matched by a broadly accepted legal interpretation
> > > of the licensing implications for code generator outputs. While
> > > the vendors may claim there is no problem and a free choice of
> > > license is possible, they have an inherent conflict of interest
> > > in promoting this interpretation. More broadly there is, as yet,
> > > no broad consensus on the licensing implications of code generators
> > > trained on inputs under a wide variety of licenses.
> > > 
> > > The DCO requires contributors to assert they have the right to
> > > contribute under the designated project license. Given the lack
> > > of consensus on the licensing of "AI" (LLM) code generator output,
> > > it is not considered credible to assert compliance with the DCO
> > > clause (b) or (c) where a patch includes such generated code.
> > > 
> > > This patch thus defines a policy that the QEMU project will not
> > > accept contributions where use of "AI" (LLM) code generators is
> > > either known, or suspected.
> > > 
> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > > ---
> > >  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 40 insertions(+)
> > > 
> > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> > > index b4591a2dec..a6e42c6b1b 100644
> > > --- a/docs/devel/code-provenance.rst
> > > +++ b/docs/devel/code-provenance.rst
> > > @@ -195,3 +195,43 @@ example::
> > >    Signed-off-by: Some Person <some.person@example.com>
> > >    [Rebased and added support for 'foo']
> > >    Signed-off-by: New Person <new.person@example.com>
> > > +
> > > +Use of "AI" (LLM) code generators
> > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > +
> > > +TL;DR:
> > > +
> > > +  **Current QEMU project policy is to DECLINE any contributions
> > > +  which are believed to include or derive from "AI" (LLM)
> > > +  generated code.**
> > > +
> > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
> > > +/ LLM) code generators raises a number of difficult legal questions, a
> > > +number of which impact on Open Source projects. As noted earlier, the
> > > +QEMU community requires that contributors certify their patch submissions
> > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a
> > > +patch contains "AI" generated code this raises difficulties with code
> > > +provenence and thus DCO compliance.
> > > +
> > > +To satisfy the DCO, the patch contributor has to fully understand
> > > +the origins and license of code they are contributing to QEMU. The
> > > +license terms that should apply to the output of an "AI" code generator
> > > +are ill-defined, given that both training data and operation of the
> > > +"AI" are typically opaque to the user. Even where the training data
> > > +is said to all be open source, it will likely be under a wide variety
> > > +of license terms.
> > > +
> > > +While the vendor's of "AI" code generators may promote the idea that
> > > +code output can be taken under a free choice of license, this is not
> > > +yet considered to be a generally accepted, nor tested, legal opinion.
> > > +
> > > +With this in mind, the QEMU maintainers does not consider it is
> > > +currently possible to comply with DCO terms (b) or (c) for most "AI"
> > > +generated code.
> > > +
> > > +The QEMU maintainers thus require that contributors refrain from using
> > > +"AI" code generators on patches intended to be submitted to the project,
> > > +and will decline any contribution if use of "AI" is known or suspected.
> > > +
> > > +Examples of tools impacted by this policy includes both GitHub CoPilot,
> > > +and ChatGPT, amongst many others which are less well known.
> > 
> > 
> > So you called out these two by name, fine, but given "AI" is in scare
> > quotes I don't really know what is or is not allowed and I don't know
> > how will contributors know.  Is the "AI" that one must not use
> > necessarily an LLM?  And how do you define LLM even? Wikipedia says
> > "general-purpose language understanding and generation".
> > 
> > 
> > All this seems vague to me.
> > 
> > 
> > However, can't we define a simpler more specific policy?
> > For example, isn't it true that *any* automatically generated code
> > can only be included if the scripts producing said code
> > are also included or otherwise available under GPLv2?
> 
> The following definition makes sense to me:
> 
> - Automated codegen tool must be idempotent.
> - Automated codegen tool must not use statistical modelling.

How are these definitions related to your ability to sign the DCO?

Kevin

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Manos Pitsidianakis 2 years, 2 months ago

On Fri, 24 Nov 2023 12:25, Kevin Wolf <kwolf@redhat.com> wrote:
>Am 23.11.2023 um 15:56 hat Manos Pitsidianakis geschrieben:
>> On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> > On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote:
>> > > There has been an explosion of interest in so called "AI" (LLM)
>> > > code generators in the past year or so. Thus far though, this is
>> > > has not been matched by a broadly accepted legal interpretation
>> > > of the licensing implications for code generator outputs. While
>> > > the vendors may claim there is no problem and a free choice of
>> > > license is possible, they have an inherent conflict of interest
>> > > in promoting this interpretation. More broadly there is, as yet,
>> > > no broad consensus on the licensing implications of code generators
>> > > trained on inputs under a wide variety of licenses.
>> > > 
>> > > The DCO requires contributors to assert they have the right to
>> > > contribute under the designated project license. Given the lack
>> > > of consensus on the licensing of "AI" (LLM) code generator output,
>> > > it is not considered credible to assert compliance with the DCO
>> > > clause (b) or (c) where a patch includes such generated code.
>> > > 
>> > > This patch thus defines a policy that the QEMU project will not
>> > > accept contributions where use of "AI" (LLM) code generators is
>> > > either known, or suspected.
>> > > 
>> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>> > > ---
>> > >  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
>> > >  1 file changed, 40 insertions(+)
>> > > 
>> > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
>> > > index b4591a2dec..a6e42c6b1b 100644
>> > > --- a/docs/devel/code-provenance.rst
>> > > +++ b/docs/devel/code-provenance.rst
>> > > @@ -195,3 +195,43 @@ example::
>> > >    Signed-off-by: Some Person <some.person@example.com>
>> > >    [Rebased and added support for 'foo']
>> > >    Signed-off-by: New Person <new.person@example.com>
>> > > +
>> > > +Use of "AI" (LLM) code generators
>> > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > > +
>> > > +TL;DR:
>> > > +
>> > > +  **Current QEMU project policy is to DECLINE any contributions
>> > > +  which are believed to include or derive from "AI" (LLM)
>> > > +  generated code.**
>> > > +
>> > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
>> > > +/ LLM) code generators raises a number of difficult legal questions, a
>> > > +number of which impact on Open Source projects. As noted earlier, the
>> > > +QEMU community requires that contributors certify their patch submissions
>> > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a
>> > > +patch contains "AI" generated code this raises difficulties with code
>> > > +provenence and thus DCO compliance.
>> > > +
>> > > +To satisfy the DCO, the patch contributor has to fully understand
>> > > +the origins and license of code they are contributing to QEMU. The
>> > > +license terms that should apply to the output of an "AI" code generator
>> > > +are ill-defined, given that both training data and operation of the
>> > > +"AI" are typically opaque to the user. Even where the training data
>> > > +is said to all be open source, it will likely be under a wide variety
>> > > +of license terms.
>> > > +
>> > > +While the vendor's of "AI" code generators may promote the idea that
>> > > +code output can be taken under a free choice of license, this is not
>> > > +yet considered to be a generally accepted, nor tested, legal opinion.
>> > > +
>> > > +With this in mind, the QEMU maintainers does not consider it is
>> > > +currently possible to comply with DCO terms (b) or (c) for most "AI"
>> > > +generated code.
>> > > +
>> > > +The QEMU maintainers thus require that contributors refrain from using
>> > > +"AI" code generators on patches intended to be submitted to the project,
>> > > +and will decline any contribution if use of "AI" is known or suspected.
>> > > +
>> > > +Examples of tools impacted by this policy includes both GitHub CoPilot,
>> > > +and ChatGPT, amongst many others which are less well known.
>> > 
>> > 
>> > So you called out these two by name, fine, but given "AI" is in scare
>> > quotes I don't really know what is or is not allowed and I don't know
>> > how will contributors know.  Is the "AI" that one must not use
>> > necessarily an LLM?  And how do you define LLM even? Wikipedia says
>> > "general-purpose language understanding and generation".
>> > 
>> > 
>> > All this seems vague to me.
>> > 
>> > 
>> > However, can't we define a simpler more specific policy?
>> > For example, isn't it true that *any* automatically generated code
>> > can only be included if the scripts producing said code
>> > are also included or otherwise available under GPLv2?
>> 
>> The following definition makes sense to me:
>> 
>> - Automated codegen tool must be idempotent.
>> - Automated codegen tool must not use statistical modelling.
>
>How are these definitions related to your ability to sign the DCO?
>
>Kevin

This was a response to Michael's salient observation that AI and LLM are 
very vague and not clearly defined terms. I did not mention DCO at all.

Manos

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Fri, Nov 24, 2023 at 11:25:55AM +0100, Kevin Wolf wrote:
> > - Automated codegen tool must be idempotent.
> > - Automated codegen tool must not use statistical modelling.
> 
> How are these definitions related to your ability to sign the DCO?

Not only that - while the question of whether code generated e.g. by copilot
would be source code by GPL definition is unclear at least to me,
code generated by an idempotent automated tool seems highly
likely not to satisfy the GPL definition.
Though I am not a lawyer and do not speak for Red Hat.

-- 
MST

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Daniel P. Berrangé 2 years, 2 months ago

On Thu, Nov 23, 2023 at 04:56:28PM +0200, Manos Pitsidianakis wrote:
> On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote:
> > > There has been an explosion of interest in so called "AI" (LLM)
> > > code generators in the past year or so. Thus far though, this is
> > > has not been matched by a broadly accepted legal interpretation
> > > of the licensing implications for code generator outputs. While
> > > the vendors may claim there is no problem and a free choice of
> > > license is possible, they have an inherent conflict of interest
> > > in promoting this interpretation. More broadly there is, as yet,
> > > no broad consensus on the licensing implications of code generators
> > > trained on inputs under a wide variety of licenses.
> > > 
> > > The DCO requires contributors to assert they have the right to
> > > contribute under the designated project license. Given the lack
> > > of consensus on the licensing of "AI" (LLM) code generator output,
> > > it is not considered credible to assert compliance with the DCO
> > > clause (b) or (c) where a patch includes such generated code.
> > > 
> > > This patch thus defines a policy that the QEMU project will not
> > > accept contributions where use of "AI" (LLM) code generators is
> > > either known, or suspected.
> > > 
> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > > ---
> > >  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 40 insertions(+)
> > > 
> > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> > > index b4591a2dec..a6e42c6b1b 100644
> > > --- a/docs/devel/code-provenance.rst
> > > +++ b/docs/devel/code-provenance.rst
> > > @@ -195,3 +195,43 @@ example::
> > >    Signed-off-by: Some Person <some.person@example.com>
> > >    [Rebased and added support for 'foo']
> > >    Signed-off-by: New Person <new.person@example.com>
> > > +
> > > +Use of "AI" (LLM) code generators
> > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > +
> > > +TL;DR:
> > > +
> > > +  **Current QEMU project policy is to DECLINE any contributions
> > > +  which are believed to include or derive from "AI" (LLM)
> > > +  generated code.**
> > > +
> > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
> > > +/ LLM) code generators raises a number of difficult legal questions, a
> > > +number of which impact on Open Source projects. As noted earlier, the
> > > +QEMU community requires that contributors certify their patch submissions
> > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a
> > > +patch contains "AI" generated code this raises difficulties with code
> > > +provenence and thus DCO compliance.
> > > +
> > > +To satisfy the DCO, the patch contributor has to fully understand
> > > +the origins and license of code they are contributing to QEMU. The
> > > +license terms that should apply to the output of an "AI" code generator
> > > +are ill-defined, given that both training data and operation of the
> > > +"AI" are typically opaque to the user. Even where the training data
> > > +is said to all be open source, it will likely be under a wide variety
> > > +of license terms.
> > > +
> > > +While the vendor's of "AI" code generators may promote the idea that
> > > +code output can be taken under a free choice of license, this is not
> > > +yet considered to be a generally accepted, nor tested, legal opinion.
> > > +
> > > +With this in mind, the QEMU maintainers does not consider it is
> > > +currently possible to comply with DCO terms (b) or (c) for most "AI"
> > > +generated code.
> > > +
> > > +The QEMU maintainers thus require that contributors refrain from using
> > > +"AI" code generators on patches intended to be submitted to the project,
> > > +and will decline any contribution if use of "AI" is known or suspected.
> > > +
> > > +Examples of tools impacted by this policy includes both GitHub CoPilot,
> > > +and ChatGPT, amongst many others which are less well known.
> > 
> > 
> > So you called out these two by name, fine, but given "AI" is in scare
> > quotes I don't really know what is or is not allowed and I don't know
> > how will contributors know.  Is the "AI" that one must not use
> > necessarily an LLM?  And how do you define LLM even? Wikipedia says
> > "general-purpose language understanding and generation".
> > 
> > 
> > All this seems vague to me.
> > 
> > 
> > However, can't we define a simpler more specific policy?
> > For example, isn't it true that *any* automatically generated code
> > can only be included if the scripts producing said code
> > are also included or otherwise available under GPLv2?
> 
> The following definition makes sense to me:
> 
> - Automated codegen tool must be idempotent.
> - Automated codegen tool must not use statistical modelling.

As a casual reader, I would find this somewhat unclear to interpet
and relate to.

> I'd remove all AI or LLM references. These are non-specific, colloquial and
> in the case of `AI`, non-technical. This policy should apply the same to a
> Markov chain code generator.

The fact that they are colloaquial is, IMHO, a good thing is it makes
the policy relatable to the casual reader who hears the terms "AI" and
"LLM" in technical press articles/blogs/etc all over the place.

I would have considered "Markov chain code generator" to fall under the
"AI" reference, since "AI" has defacto become a general purpose term
that covers a wierd variety of underlying technologies.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Peter Maydell 2 years, 2 months ago

On Thu, 23 Nov 2023 at 18:02, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Nov 23, 2023 at 04:56:28PM +0200, Manos Pitsidianakis wrote:
> > On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote:
> > > > +Examples of tools impacted by this policy includes both GitHub CoPilot,
> > > > +and ChatGPT, amongst many others which are less well known.
> > >
> > >
> > > So you called out these two by name, fine, but given "AI" is in scare
> > > quotes I don't really know what is or is not allowed and I don't know
> > > how will contributors know.  Is the "AI" that one must not use
> > > necessarily an LLM?  And how do you define LLM even? Wikipedia says
> > > "general-purpose language understanding and generation".
> > >
> > >
> > > All this seems vague to me.
> > >
> > >
> > > However, can't we define a simpler more specific policy?
> > > For example, isn't it true that *any* automatically generated code
> > > can only be included if the scripts producing said code
> > > are also included or otherwise available under GPLv2?
> >
> > The following definition makes sense to me:
> >
> > - Automated codegen tool must be idempotent.
> > - Automated codegen tool must not use statistical modelling.
>
> As a casual reader, I would find this somewhat unclear to interpet
> and relate to.

It's also not really relevant to what we're trying to rule out.
A non-idempotent codegen tool is fine, if the code it generates
is clearly under a license that's compatible with QEMU's.
A codegen tool that uses statistical modelling is also fine,
if (for example) it's only doing statistical modelling of the
data in the single file it's adding code to and doesn't use
any external data set.

> > I'd remove all AI or LLM references. These are non-specific, colloquial and
> > in the case of `AI`, non-technical. This policy should apply the same to a
> > Markov chain code generator.
>
> The fact that they are colloaquial is, IMHO, a good thing is it makes
> the policy relatable to the casual reader who hears the terms "AI" and
> "LLM" in technical press articles/blogs/etc all over the place.

Yes, I think that the most important thing about the wording
of this policy (assuming we agree on it) is that it should be
immediately very clear to anybody reading it that ChatGPT,
Copilot, etc type tools aren't permitted. Because in practice
the most likely case is somebody who wants to use those, and we
don't want to make them have to go through "read an abstract
definition of what isn't permitted and apply that abstract
definition to the concrete tool they're using".

thanks
-- PMM

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Alex Bennée 2 years, 2 months ago

Manos Pitsidianakis <manos.pitsidianakis@linaro.org> writes:

> On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote:
>>> There has been an explosion of interest in so called "AI" (LLM)
>>> code generators in the past year or so. Thus far though, this is
>>> has not been matched by a broadly accepted legal interpretation
>>> of the licensing implications for code generator outputs. While
>>> the vendors may claim there is no problem and a free choice of
>>> license is possible, they have an inherent conflict of interest
>>> in promoting this interpretation. More broadly there is, as yet,
>>> no broad consensus on the licensing implications of code generators
>>> trained on inputs under a wide variety of licenses.
>>> The DCO requires contributors to assert they have the right to
>>> contribute under the designated project license. Given the lack
>>> of consensus on the licensing of "AI" (LLM) code generator output,
>>> it is not considered credible to assert compliance with the DCO
>>> clause (b) or (c) where a patch includes such generated code.
>>> This patch thus defines a policy that the QEMU project will not
>>> accept contributions where use of "AI" (LLM) code generators is
>>> either known, or suspected.
>>> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>>> ---
>>>  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
>>>  1 file changed, 40 insertions(+)
>>> diff --git a/docs/devel/code-provenance.rst
>>> b/docs/devel/code-provenance.rst
>>> index b4591a2dec..a6e42c6b1b 100644
>>> --- a/docs/devel/code-provenance.rst
>>> +++ b/docs/devel/code-provenance.rst
>>> @@ -195,3 +195,43 @@ example::
>>>    Signed-off-by: Some Person <some.person@example.com>
>>>    [Rebased and added support for 'foo']
>>>    Signed-off-by: New Person <new.person@example.com>
>>> +
>>> +Use of "AI" (LLM) code generators
>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> +
>>> +TL;DR:
>>> +
>>> +  **Current QEMU project policy is to DECLINE any contributions
>>> +  which are believed to include or derive from "AI" (LLM)
>>> +  generated code.**
>>> +
>>> +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
>>> +/ LLM) code generators raises a number of difficult legal questions, a
>>> +number of which impact on Open Source projects. As noted earlier, the
>>> +QEMU community requires that contributors certify their patch submissions
>>> +are made in accordance with the rules of the :ref:`dco` (DCO). When a
>>> +patch contains "AI" generated code this raises difficulties with code
>>> +provenence and thus DCO compliance.
>>> +
<snip>
>>> +
>>> +The QEMU maintainers thus require that contributors refrain from using
>>> +"AI" code generators on patches intended to be submitted to the project,
>>> +and will decline any contribution if use of "AI" is known or suspected.
>>> +
>>> +Examples of tools impacted by this policy includes both GitHub CoPilot,
>>> +and ChatGPT, amongst many others which are less well known.
>>
>>
>>So you called out these two by name, fine, but given "AI" is in scare
>>quotes I don't really know what is or is not allowed and I don't know
>>how will contributors know.  Is the "AI" that one must not use
>>necessarily an LLM?  And how do you define LLM even? Wikipedia says
>>"general-purpose language understanding and generation".
>>
>>
>>All this seems vague to me.
>>
>>
>>However, can't we define a simpler more specific policy?
>>For example, isn't it true that *any* automatically generated code
>>can only be included if the scripts producing said code
>>are also included or otherwise available under GPLv2?
>
> The following definition makes sense to me:
>
> - Automated codegen tool must be idempotent.
> - Automated codegen tool must not use statistical modelling.
>
> I'd remove all AI or LLM references. These are non-specific,
> colloquial and in the case of `AI`, non-technical. This policy should
> apply the same to a Markov chain code generator.

I'm fairly sure my Emacs auto-complete would fail by that definition.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Philippe Mathieu-Daudé 2 years, 2 months ago

On 23/11/23 15:56, Manos Pitsidianakis wrote:
> On Thu, 23 Nov 2023 16:35, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> On Thu, Nov 23, 2023 at 11:40:26AM +0000, Daniel P. Berrangé wrote:
>>> There has been an explosion of interest in so called "AI" (LLM)
>>> code generators in the past year or so. Thus far though, this is
>>> has not been matched by a broadly accepted legal interpretation
>>> of the licensing implications for code generator outputs. While
>>> the vendors may claim there is no problem and a free choice of
>>> license is possible, they have an inherent conflict of interest
>>> in promoting this interpretation. More broadly there is, as yet,
>>> no broad consensus on the licensing implications of code generators
>>> trained on inputs under a wide variety of licenses.
>>>
>>> The DCO requires contributors to assert they have the right to
>>> contribute under the designated project license. Given the lack
>>> of consensus on the licensing of "AI" (LLM) code generator output,
>>> it is not considered credible to assert compliance with the DCO
>>> clause (b) or (c) where a patch includes such generated code.
>>>
>>> This patch thus defines a policy that the QEMU project will not
>>> accept contributions where use of "AI" (LLM) code generators is
>>> either known, or suspected.
>>>
>>> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>>> ---
>>>  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
>>>  1 file changed, 40 insertions(+)


>>> +Use of "AI" (LLM) code generators
>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> +
>>> +TL;DR:
>>> +
>>> +  **Current QEMU project policy is to DECLINE any contributions
>>> +  which are believed to include or derive from "AI" (LLM)
>>> +  generated code.**
>>> +
>>> +The existence of "AI" (`Large Language Model 
>>> <https://en.wikipedia.org/wiki/Large_language_model>`__
>>> +/ LLM) code generators raises a number of difficult legal questions, a
>>> +number of which impact on Open Source projects. As noted earlier, the
>>> +QEMU community requires that contributors certify their patch 
>>> submissions
>>> +are made in accordance with the rules of the :ref:`dco` (DCO). When a
>>> +patch contains "AI" generated code this raises difficulties with code
>>> +provenence and thus DCO compliance.
>>> +
>>> +To satisfy the DCO, the patch contributor has to fully understand
>>> +the origins and license of code they are contributing to QEMU. The
>>> +license terms that should apply to the output of an "AI" code generator
>>> +are ill-defined, given that both training data and operation of the
>>> +"AI" are typically opaque to the user. Even where the training data
>>> +is said to all be open source, it will likely be under a wide variety
>>> +of license terms.
>>> +
>>> +While the vendor's of "AI" code generators may promote the idea that
>>> +code output can be taken under a free choice of license, this is not
>>> +yet considered to be a generally accepted, nor tested, legal opinion.
>>> +
>>> +With this in mind, the QEMU maintainers does not consider it is
>>> +currently possible to comply with DCO terms (b) or (c) for most "AI"
>>> +generated code.
>>> +
>>> +The QEMU maintainers thus require that contributors refrain from using
>>> +"AI" code generators on patches intended to be submitted to the 
>>> project,
>>> +and will decline any contribution if use of "AI" is known or suspected.
>>> +
>>> +Examples of tools impacted by this policy includes both GitHub CoPilot,
>>> +and ChatGPT, amongst many others which are less well known.
>>
>>
>> So you called out these two by name, fine, but given "AI" is in scare
>> quotes I don't really know what is or is not allowed and I don't know
>> how will contributors know.  Is the "AI" that one must not use
>> necessarily an LLM?  And how do you define LLM even? Wikipedia says
>> "general-purpose language understanding and generation".
>>
>>
>> All this seems vague to me.
>>
>>
>> However, can't we define a simpler more specific policy?
>> For example, isn't it true that *any* automatically generated code
>> can only be included if the scripts producing said code
>> are also included or otherwise available under GPLv2?
> 
> The following definition makes sense to me:
> 
> - Automated codegen tool must be idempotent.
> - Automated codegen tool must not use statistical modelling.
> 
> I'd remove all AI or LLM references. These are non-specific, colloquial 
> and in the case of `AI`, non-technical. This policy should apply the 
> same to a Markov chain code generator.

This document targets all contributors. Contributions can be typo
fix, translations, ... and don't have to be technical. Similarly,
contributors aren't expected to be technical experts. As a neophyte,
"AI" makes sense. "Idempotent code generator" or "LLM" don't :)

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Thu, Nov 23, 2023 at 04:29:52PM +0100, Philippe Mathieu-Daudé wrote:
> This document targets all contributors. Contributions can be typo
> fix, translations, ... and don't have to be technical. Similarly,
> contributors aren't expected to be technical experts. As a neophyte,
> "AI" makes sense. "Idempotent code generator" or "LLM" don't :)

I don't think there's any big deal in using AI for typo fixes.

-- 
MST

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michal Suchánek 2 years, 2 months ago

On Thu, Nov 23, 2023 at 12:06:59PM -0500, Michael S. Tsirkin wrote:
> On Thu, Nov 23, 2023 at 04:29:52PM +0100, Philippe Mathieu-Daudé wrote:
> > This document targets all contributors. Contributions can be typo
> > fix, translations, ... and don't have to be technical. Similarly,
> > contributors aren't expected to be technical experts. As a neophyte,
> > "AI" makes sense. "Idempotent code generator" or "LLM" don't :)
> 
> I don't think there's any big deal in using AI for typo fixes.

For how many typos it is still OK, and would not a deterministic
spellchecker be preferred?

There are some edge cases where using AI is OK, the problem is most of
the time it is not clear it is OK to use.

Thanks

Michal

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Thu, Nov 23, 2023 at 06:29:38PM +0100, Michal Suchánek wrote:
> On Thu, Nov 23, 2023 at 12:06:59PM -0500, Michael S. Tsirkin wrote:
> > On Thu, Nov 23, 2023 at 04:29:52PM +0100, Philippe Mathieu-Daudé wrote:
> > > This document targets all contributors. Contributions can be typo
> > > fix, translations, ... and don't have to be technical. Similarly,
> > > contributors aren't expected to be technical experts. As a neophyte,
> > > "AI" makes sense. "Idempotent code generator" or "LLM" don't :)
> > 
> > I don't think there's any big deal in using AI for typo fixes.
> 
> For how many typos it is still OK, and would not a deterministic
> spellchecker be preferred?
> 
> There are some edge cases where using AI is OK, the problem is most of
> the time it is not clear it is OK to use.
> 
> Thanks
> 
> Michal

¯\_(ツ)_/¯ I am not a lawyer, and I don't speak for Red Hat.


My point is however that e.g. even if you are using e.g. a grammar
corrector you better make sure that it is not claiming that its output
is a derivative work.

-- 
MST

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Thu, Nov 23, 2023 at 04:56:28PM +0200, Manos Pitsidianakis wrote:
> > However, can't we define a simpler more specific policy?
> > For example, isn't it true that *any* automatically generated code
> > can only be included if the scripts producing said code
> > are also included or otherwise available under GPLv2?
> 
> The following definition makes sense to me:
> 
> - Automated codegen tool must be idempotent.
> - Automated codegen tool must not use statistical modelling.

Why does it matter so much?

> I'd remove all AI or LLM references. These are non-specific, colloquial and
> in the case of `AI`, non-technical. This policy should apply the same to a
> Markov chain code generator.

-- 
MST

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Kevin Wolf 2 years, 2 months ago

Am 23.11.2023 um 12:40 hat Daniel P. Berrangé geschrieben:
> There has been an explosion of interest in so called "AI" (LLM)
> code generators in the past year or so. Thus far though, this is
> has not been matched by a broadly accepted legal interpretation
> of the licensing implications for code generator outputs. While
> the vendors may claim there is no problem and a free choice of
> license is possible, they have an inherent conflict of interest
> in promoting this interpretation. More broadly there is, as yet,
> no broad consensus on the licensing implications of code generators
> trained on inputs under a wide variety of licenses.
> 
> The DCO requires contributors to assert they have the right to
> contribute under the designated project license. Given the lack
> of consensus on the licensing of "AI" (LLM) code generator output,
> it is not considered credible to assert compliance with the DCO
> clause (b) or (c) where a patch includes such generated code.
> 
> This patch thus defines a policy that the QEMU project will not
> accept contributions where use of "AI" (LLM) code generators is
> either known, or suspected.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> index b4591a2dec..a6e42c6b1b 100644
> --- a/docs/devel/code-provenance.rst
> +++ b/docs/devel/code-provenance.rst
> @@ -195,3 +195,43 @@ example::
>    Signed-off-by: Some Person <some.person@example.com>
>    [Rebased and added support for 'foo']
>    Signed-off-by: New Person <new.person@example.com>
> +
> +Use of "AI" (LLM) code generators
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +TL;DR:
> +
> +  **Current QEMU project policy is to DECLINE any contributions
> +  which are believed to include or derive from "AI" (LLM)
> +  generated code.**
> +
> +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
> +/ LLM) code generators raises a number of difficult legal questions, a
> +number of which impact on Open Source projects. As noted earlier, the
> +QEMU community requires that contributors certify their patch submissions
> +are made in accordance with the rules of the :ref:`dco` (DCO). When a
> +patch contains "AI" generated code this raises difficulties with code
> +provenence and thus DCO compliance.
> +
> +To satisfy the DCO, the patch contributor has to fully understand
> +the origins and license of code they are contributing to QEMU. The
> +license terms that should apply to the output of an "AI" code generator
> +are ill-defined, given that both training data and operation of the
> +"AI" are typically opaque to the user. Even where the training data
> +is said to all be open source, it will likely be under a wide variety
> +of license terms.
> +
> +While the vendor's of "AI" code generators may promote the idea that
> +code output can be taken under a free choice of license, this is not
> +yet considered to be a generally accepted, nor tested, legal opinion.
> +
> +With this in mind, the QEMU maintainers does not consider it is

s/does/do/ or maybe s/maintainers/project/

> +currently possible to comply with DCO terms (b) or (c) for most "AI"
> +generated code.
> +
> +The QEMU maintainers thus require that contributors refrain from using
> +"AI" code generators on patches intended to be submitted to the project,
> +and will decline any contribution if use of "AI" is known or suspected.
> +
> +Examples of tools impacted by this policy includes both GitHub CoPilot,
> +and ChatGPT, amongst many others which are less well known.

Acked-by: Kevin Wolf <kwolf@redhat.com>

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Alex Bennée 2 years, 2 months ago

Daniel P. Berrangé <berrange@redhat.com> writes:

> There has been an explosion of interest in so called "AI" (LLM)
> code generators in the past year or so. Thus far though, this is
> has not been matched by a broadly accepted legal interpretation
> of the licensing implications for code generator outputs. While
> the vendors may claim there is no problem and a free choice of
> license is possible, they have an inherent conflict of interest
> in promoting this interpretation. More broadly there is, as yet,
> no broad consensus on the licensing implications of code generators
> trained on inputs under a wide variety of licenses.
>
> The DCO requires contributors to assert they have the right to
> contribute under the designated project license. Given the lack
> of consensus on the licensing of "AI" (LLM) code generator output,
> it is not considered credible to assert compliance with the DCO
> clause (b) or (c) where a patch includes such generated code.
>
> This patch thus defines a policy that the QEMU project will not
> accept contributions where use of "AI" (LLM) code generators is
> either known, or suspected.
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
>
> diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> index b4591a2dec..a6e42c6b1b 100644
> --- a/docs/devel/code-provenance.rst
> +++ b/docs/devel/code-provenance.rst
> @@ -195,3 +195,43 @@ example::
>    Signed-off-by: Some Person <some.person@example.com>
>    [Rebased and added support for 'foo']
>    Signed-off-by: New Person <new.person@example.com>
> +
> +Use of "AI" (LLM) code generators
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +TL;DR:
> +
> +  **Current QEMU project policy is to DECLINE any contributions
> +  which are believed to include or derive from "AI" (LLM)
> +  generated code.**
> +
> +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
> +/ LLM) code generators raises a number of difficult legal questions, a
> +number of which impact on Open Source projects. As noted earlier, the
> +QEMU community requires that contributors certify their patch submissions
> +are made in accordance with the rules of the :ref:`dco` (DCO). When a
> +patch contains "AI" generated code this raises difficulties with code
> +provenence and thus DCO compliance.

I agree this is going to be a field that keeps lawyers well re-numerated
for the foreseeable future. However I suspect this elides over the main
use case for LLM generators which is non-novel transformation. One good
example is generating text fixtures where you write a piece of original
code and then ask the code completion engine to fill out some unit tests
to exercise the code. It's boring mechanical work but one an LLM is very
suited to (even if you might tweak the final result).

> +To satisfy the DCO, the patch contributor has to fully understand
> +the origins and license of code they are contributing to QEMU. The
> +license terms that should apply to the output of an "AI" code generator
> +are ill-defined, given that both training data and operation of the
> +"AI" are typically opaque to the user. Even where the training data
> +is said to all be open source, it will likely be under a wide variety
> +of license terms.
> +
> +While the vendor's of "AI" code generators may promote the idea that
> +code output can be taken under a free choice of license, this is not
> +yet considered to be a generally accepted, nor tested, legal opinion.
> +
> +With this in mind, the QEMU maintainers does not consider it is
> +currently possible to comply with DCO terms (b) or (c) for most "AI"
> +generated code.

There is a load of code out that isn't eligible for copyright projection
because it doesn't demonstrate much originality or creativity. In the
experimentation I've done so far I've not seen much sign of genuine
creativity. LLM's benefit from having access to a wide corpus of
training data and tend to do a better job of inferencing solutions from
semi-related posts than say for example human manually comparing posts
having pasted an error message in google.

> +
> +The QEMU maintainers thus require that contributors refrain from using
> +"AI" code generators on patches intended to be submitted to the project,
> +and will decline any contribution if use of "AI" is known or suspected.
> +
> +Examples of tools impacted by this policy includes both GitHub CoPilot,
> +and ChatGPT, amongst many others which are less well known.

What about if you took an LLM and then fine tuned it by using project
data so it could better help new users in making contributions to the
project? You would be biasing the model to your own data for the
purposes of helping developers write better QEMU code?

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Daniel P. Berrangé 2 years, 2 months ago

On Thu, Nov 23, 2023 at 12:57:42PM +0000, Alex Bennée wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > There has been an explosion of interest in so called "AI" (LLM)
> > code generators in the past year or so. Thus far though, this is
> > has not been matched by a broadly accepted legal interpretation
> > of the licensing implications for code generator outputs. While
> > the vendors may claim there is no problem and a free choice of
> > license is possible, they have an inherent conflict of interest
> > in promoting this interpretation. More broadly there is, as yet,
> > no broad consensus on the licensing implications of code generators
> > trained on inputs under a wide variety of licenses.
> >
> > The DCO requires contributors to assert they have the right to
> > contribute under the designated project license. Given the lack
> > of consensus on the licensing of "AI" (LLM) code generator output,
> > it is not considered credible to assert compliance with the DCO
> > clause (b) or (c) where a patch includes such generated code.
> >
> > This patch thus defines a policy that the QEMU project will not
> > accept contributions where use of "AI" (LLM) code generators is
> > either known, or suspected.
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > ---
> >  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 40 insertions(+)
> >
> > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> > index b4591a2dec..a6e42c6b1b 100644
> > --- a/docs/devel/code-provenance.rst
> > +++ b/docs/devel/code-provenance.rst
> > @@ -195,3 +195,43 @@ example::
> >    Signed-off-by: Some Person <some.person@example.com>
> >    [Rebased and added support for 'foo']
> >    Signed-off-by: New Person <new.person@example.com>
> > +
> > +Use of "AI" (LLM) code generators
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +TL;DR:
> > +
> > +  **Current QEMU project policy is to DECLINE any contributions
> > +  which are believed to include or derive from "AI" (LLM)
> > +  generated code.**
> > +
> > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
> > +/ LLM) code generators raises a number of difficult legal questions, a
> > +number of which impact on Open Source projects. As noted earlier, the
> > +QEMU community requires that contributors certify their patch submissions
> > +are made in accordance with the rules of the :ref:`dco` (DCO). When a
> > +patch contains "AI" generated code this raises difficulties with code
> > +provenence and thus DCO compliance.
> 
> I agree this is going to be a field that keeps lawyers well re-numerated
> for the foreseeable future. However I suspect this elides over the main
> use case for LLM generators which is non-novel transformation. One good
> example is generating text fixtures where you write a piece of original
> code and then ask the code completion engine to fill out some unit tests
> to exercise the code. It's boring mechanical work but one an LLM is very
> suited to (even if you might tweak the final result).

Yes, I can see how that is helpful, but I think in many cases the
resulting code will be complex enough to be considered copyrightable,
and so even with the original input code, I feel the licensing of the
output is still ill-defined.

> 
> > +To satisfy the DCO, the patch contributor has to fully understand
> > +the origins and license of code they are contributing to QEMU. The
> > +license terms that should apply to the output of an "AI" code generator
> > +are ill-defined, given that both training data and operation of the
> > +"AI" are typically opaque to the user. Even where the training data
> > +is said to all be open source, it will likely be under a wide variety
> > +of license terms.
> > +
> > +While the vendor's of "AI" code generators may promote the idea that
> > +code output can be taken under a free choice of license, this is not
> > +yet considered to be a generally accepted, nor tested, legal opinion.
> > +
> > +With this in mind, the QEMU maintainers does not consider it is
> > +currently possible to comply with DCO terms (b) or (c) for most "AI"
> > +generated code.
> 
> There is a load of code out that isn't eligible for copyright projection
> because it doesn't demonstrate much originality or creativity. In the
> experimentation I've done so far I've not seen much sign of genuine
> creativity. LLM's benefit from having access to a wide corpus of
> training data and tend to do a better job of inferencing solutions from
> semi-related posts than say for example human manually comparing posts
> having pasted an error message in google.

The boundary between what is considered copyrightable and not, it
itself quite ill-defined, and thus it is hard to express a clear
rule that can be applied.

I think more experience long term contributors end up getting somewhat
of a "gut feeling" about what's ok and what's not, but I'm not sure if
that is true for contibutors in general.

IOW, while there are likely cases where it is possible to safely use
a AI generator, I'm not sure how to best express that in an way that
makes sense.

Perhaps a loosely worded addendum  about possible exception for
"trivial" output

> > +The QEMU maintainers thus require that contributors refrain from using
> > +"AI" code generators on patches intended to be submitted to the project,
> > +and will decline any contribution if use of "AI" is known or suspected.
> > +
> > +Examples of tools impacted by this policy includes both GitHub CoPilot,
> > +and ChatGPT, amongst many others which are less well known.
> 
> What about if you took an LLM and then fine tuned it by using project
> data so it could better help new users in making contributions to the
> project? You would be biasing the model to your own data for the
> purposes of helping developers write better QEMU code?

It is hard to provide an answer to that question, since I think it is
something that would need to be considered case by case. It hinges
around how much does the new QEMU specific training data influence
the model, vs other pre-existing training (if any)

Perhaps we can finish this policy with a general point to solicit
feedback on possible exceptions ?

  "If a contributor believes they can demonstrate that the output of
   a particular tool has deterministic licensing, such that they can
   satisfy the DCO, they should provide such info to the mailing list"

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Thu, Nov 23, 2023 at 05:46:16PM +0000, Daniel P. Berrangé wrote:
> On Thu, Nov 23, 2023 at 12:57:42PM +0000, Alex Bennée wrote:
> > Daniel P. Berrangé <berrange@redhat.com> writes:
> > 
> > > There has been an explosion of interest in so called "AI" (LLM)
> > > code generators in the past year or so. Thus far though, this is
> > > has not been matched by a broadly accepted legal interpretation
> > > of the licensing implications for code generator outputs. While
> > > the vendors may claim there is no problem and a free choice of
> > > license is possible, they have an inherent conflict of interest
> > > in promoting this interpretation. More broadly there is, as yet,
> > > no broad consensus on the licensing implications of code generators
> > > trained on inputs under a wide variety of licenses.
> > >
> > > The DCO requires contributors to assert they have the right to
> > > contribute under the designated project license. Given the lack
> > > of consensus on the licensing of "AI" (LLM) code generator output,
> > > it is not considered credible to assert compliance with the DCO
> > > clause (b) or (c) where a patch includes such generated code.
> > >
> > > This patch thus defines a policy that the QEMU project will not
> > > accept contributions where use of "AI" (LLM) code generators is
> > > either known, or suspected.
> > >
> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > > ---
> > >  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 40 insertions(+)
> > >
> > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> > > index b4591a2dec..a6e42c6b1b 100644
> > > --- a/docs/devel/code-provenance.rst
> > > +++ b/docs/devel/code-provenance.rst
> > > @@ -195,3 +195,43 @@ example::
> > >    Signed-off-by: Some Person <some.person@example.com>
> > >    [Rebased and added support for 'foo']
> > >    Signed-off-by: New Person <new.person@example.com>
> > > +
> > > +Use of "AI" (LLM) code generators
> > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > +
> > > +TL;DR:
> > > +
> > > +  **Current QEMU project policy is to DECLINE any contributions
> > > +  which are believed to include or derive from "AI" (LLM)
> > > +  generated code.**
> > > +
> > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
> > > +/ LLM) code generators raises a number of difficult legal questions, a
> > > +number of which impact on Open Source projects. As noted earlier, the
> > > +QEMU community requires that contributors certify their patch submissions
> > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a
> > > +patch contains "AI" generated code this raises difficulties with code
> > > +provenence and thus DCO compliance.
> > 
> > I agree this is going to be a field that keeps lawyers well re-numerated
> > for the foreseeable future. However I suspect this elides over the main
> > use case for LLM generators which is non-novel transformation. One good
> > example is generating text fixtures where you write a piece of original
> > code and then ask the code completion engine to fill out some unit tests
> > to exercise the code. It's boring mechanical work but one an LLM is very
> > suited to (even if you might tweak the final result).
> 
> Yes, I can see how that is helpful, but I think in many cases the
> resulting code will be complex enough to be considered copyrightable,
> and so even with the original input code, I feel the licensing of the
> output is still ill-defined.
> 
> > 
> > > +To satisfy the DCO, the patch contributor has to fully understand
> > > +the origins and license of code they are contributing to QEMU. The
> > > +license terms that should apply to the output of an "AI" code generator
> > > +are ill-defined, given that both training data and operation of the
> > > +"AI" are typically opaque to the user. Even where the training data
> > > +is said to all be open source, it will likely be under a wide variety
> > > +of license terms.
> > > +
> > > +While the vendor's of "AI" code generators may promote the idea that
> > > +code output can be taken under a free choice of license, this is not
> > > +yet considered to be a generally accepted, nor tested, legal opinion.
> > > +
> > > +With this in mind, the QEMU maintainers does not consider it is
> > > +currently possible to comply with DCO terms (b) or (c) for most "AI"
> > > +generated code.
> > 
> > There is a load of code out that isn't eligible for copyright projection
> > because it doesn't demonstrate much originality or creativity. In the
> > experimentation I've done so far I've not seen much sign of genuine
> > creativity. LLM's benefit from having access to a wide corpus of
> > training data and tend to do a better job of inferencing solutions from
> > semi-related posts than say for example human manually comparing posts
> > having pasted an error message in google.
> 
> The boundary between what is considered copyrightable and not, it
> itself quite ill-defined, and thus it is hard to express a clear
> rule that can be applied.
> 
> I think more experience long term contributors end up getting somewhat
> of a "gut feeling" about what's ok and what's not, but I'm not sure if
> that is true for contibutors in general.
> 
> IOW, while there are likely cases where it is possible to safely use
> a AI generator, I'm not sure how to best express that in an way that
> makes sense.
> 
> Perhaps a loosely worded addendum  about possible exception for
> "trivial" output
> 
> > > +The QEMU maintainers thus require that contributors refrain from using
> > > +"AI" code generators on patches intended to be submitted to the project,
> > > +and will decline any contribution if use of "AI" is known or suspected.
> > > +
> > > +Examples of tools impacted by this policy includes both GitHub CoPilot,
> > > +and ChatGPT, amongst many others which are less well known.
> > 
> > What about if you took an LLM and then fine tuned it by using project
> > data so it could better help new users in making contributions to the
> > project? You would be biasing the model to your own data for the
> > purposes of helping developers write better QEMU code?
> 
> It is hard to provide an answer to that question, since I think it is
> something that would need to be considered case by case. It hinges
> around how much does the new QEMU specific training data influence
> the model, vs other pre-existing training (if any)
> 
> Perhaps we can finish this policy with a general point to solicit
> feedback on possible exceptions ?
> 
>   "If a contributor believes they can demonstrate that the output of
>    a particular tool has deterministic licensing, such that they can
>    satisfy the DCO, they should provide such info to the mailing list"
> 
> With regards,
> Daniel


But the question is not about what QEMU should accept. We can trust
maintainers to DTRT. The question is the meaning of DCO.  If you want
DCO to mean "this code was not generated by AI" then you better define
"AI" in an unambiguous way otherwise what is it certifying?

Instead, I propose adding simply this:

	Thus, generally, Signed-off-by from *each* person who has written
	a substantial portion of the patch is required.

	If a substantial portion of the patch was not written by any
	human person but was instead generated automatically (e.g. by an AI such
	as ChatGPT, or a decompiler) then you *must* clearly document
	this in the patch commit message. As a matter of policy, and out of an
	abundance of caution, such contributions will generally be rejected.

	When in doubt whether a specific portion is substantial - assume
	that Signed-off-by is required.





-- 
MST

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Kevin Wolf 2 years, 2 months ago

Am 24.11.2023 um 00:53 hat Michael S. Tsirkin geschrieben:
> On Thu, Nov 23, 2023 at 05:46:16PM +0000, Daniel P. Berrangé wrote:
> > On Thu, Nov 23, 2023 at 12:57:42PM +0000, Alex Bennée wrote:
> > > Daniel P. Berrangé <berrange@redhat.com> writes:
> > > 
> > > > There has been an explosion of interest in so called "AI" (LLM)
> > > > code generators in the past year or so. Thus far though, this is
> > > > has not been matched by a broadly accepted legal interpretation
> > > > of the licensing implications for code generator outputs. While
> > > > the vendors may claim there is no problem and a free choice of
> > > > license is possible, they have an inherent conflict of interest
> > > > in promoting this interpretation. More broadly there is, as yet,
> > > > no broad consensus on the licensing implications of code generators
> > > > trained on inputs under a wide variety of licenses.
> > > >
> > > > The DCO requires contributors to assert they have the right to
> > > > contribute under the designated project license. Given the lack
> > > > of consensus on the licensing of "AI" (LLM) code generator output,
> > > > it is not considered credible to assert compliance with the DCO
> > > > clause (b) or (c) where a patch includes such generated code.
> > > >
> > > > This patch thus defines a policy that the QEMU project will not
> > > > accept contributions where use of "AI" (LLM) code generators is
> > > > either known, or suspected.
> > > >
> > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > > > ---
> > > >  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 40 insertions(+)
> > > >
> > > > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> > > > index b4591a2dec..a6e42c6b1b 100644
> > > > --- a/docs/devel/code-provenance.rst
> > > > +++ b/docs/devel/code-provenance.rst
> > > > @@ -195,3 +195,43 @@ example::
> > > >    Signed-off-by: Some Person <some.person@example.com>
> > > >    [Rebased and added support for 'foo']
> > > >    Signed-off-by: New Person <new.person@example.com>
> > > > +
> > > > +Use of "AI" (LLM) code generators
> > > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > +
> > > > +TL;DR:
> > > > +
> > > > +  **Current QEMU project policy is to DECLINE any contributions
> > > > +  which are believed to include or derive from "AI" (LLM)
> > > > +  generated code.**
> > > > +
> > > > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
> > > > +/ LLM) code generators raises a number of difficult legal questions, a
> > > > +number of which impact on Open Source projects. As noted earlier, the
> > > > +QEMU community requires that contributors certify their patch submissions
> > > > +are made in accordance with the rules of the :ref:`dco` (DCO). When a
> > > > +patch contains "AI" generated code this raises difficulties with code
> > > > +provenence and thus DCO compliance.
> > > 
> > > I agree this is going to be a field that keeps lawyers well re-numerated
> > > for the foreseeable future. However I suspect this elides over the main
> > > use case for LLM generators which is non-novel transformation. One good
> > > example is generating text fixtures where you write a piece of original
> > > code and then ask the code completion engine to fill out some unit tests
> > > to exercise the code. It's boring mechanical work but one an LLM is very
> > > suited to (even if you might tweak the final result).
> > 
> > Yes, I can see how that is helpful, but I think in many cases the
> > resulting code will be complex enough to be considered copyrightable,
> > and so even with the original input code, I feel the licensing of the
> > output is still ill-defined.
> > 
> > > 
> > > > +To satisfy the DCO, the patch contributor has to fully understand
> > > > +the origins and license of code they are contributing to QEMU. The
> > > > +license terms that should apply to the output of an "AI" code generator
> > > > +are ill-defined, given that both training data and operation of the
> > > > +"AI" are typically opaque to the user. Even where the training data
> > > > +is said to all be open source, it will likely be under a wide variety
> > > > +of license terms.
> > > > +
> > > > +While the vendor's of "AI" code generators may promote the idea that
> > > > +code output can be taken under a free choice of license, this is not
> > > > +yet considered to be a generally accepted, nor tested, legal opinion.
> > > > +
> > > > +With this in mind, the QEMU maintainers does not consider it is
> > > > +currently possible to comply with DCO terms (b) or (c) for most "AI"
> > > > +generated code.
> > > 
> > > There is a load of code out that isn't eligible for copyright projection
> > > because it doesn't demonstrate much originality or creativity. In the
> > > experimentation I've done so far I've not seen much sign of genuine
> > > creativity. LLM's benefit from having access to a wide corpus of
> > > training data and tend to do a better job of inferencing solutions from
> > > semi-related posts than say for example human manually comparing posts
> > > having pasted an error message in google.
> > 
> > The boundary between what is considered copyrightable and not, it
> > itself quite ill-defined, and thus it is hard to express a clear
> > rule that can be applied.
> > 
> > I think more experience long term contributors end up getting somewhat
> > of a "gut feeling" about what's ok and what's not, but I'm not sure if
> > that is true for contibutors in general.
> > 
> > IOW, while there are likely cases where it is possible to safely use
> > a AI generator, I'm not sure how to best express that in an way that
> > makes sense.
> > 
> > Perhaps a loosely worded addendum  about possible exception for
> > "trivial" output
> > 
> > > > +The QEMU maintainers thus require that contributors refrain from using
> > > > +"AI" code generators on patches intended to be submitted to the project,
> > > > +and will decline any contribution if use of "AI" is known or suspected.
> > > > +
> > > > +Examples of tools impacted by this policy includes both GitHub CoPilot,
> > > > +and ChatGPT, amongst many others which are less well known.
> > > 
> > > What about if you took an LLM and then fine tuned it by using project
> > > data so it could better help new users in making contributions to the
> > > project? You would be biasing the model to your own data for the
> > > purposes of helping developers write better QEMU code?
> > 
> > It is hard to provide an answer to that question, since I think it is
> > something that would need to be considered case by case. It hinges
> > around how much does the new QEMU specific training data influence
> > the model, vs other pre-existing training (if any)

I suspect fine tuning won't be enough because it doesn't make the
unlicensed original training data go away.

If you could make sure that all of the training data consists only of
code for which you have the right to contribute it to QEMU, that would
be a different case.

> > Perhaps we can finish this policy with a general point to solicit
> > feedback on possible exceptions ?
> > 
> >   "If a contributor believes they can demonstrate that the output of
> >    a particular tool has deterministic licensing, such that they can
> >    satisfy the DCO, they should provide such info to the mailing list"
> > 
> > With regards,
> > Daniel
> 
> 
> But the question is not about what QEMU should accept. We can trust
> maintainers to DTRT. The question is the meaning of DCO.  If you want
> DCO to mean "this code was not generated by AI" then you better define
> "AI" in an unambiguous way otherwise what is it certifying?

That you can state confidently that you have the legal right to
contribute this code.

The problem is not AI per se, the problem is incompatibly licensed - or
really, unlicensed (should I call it "pirated" for effect?) - training
input for the AI.

So if you got the code from ChatGPT, I simply won't believe you even if
you claim that you have the right.

> Instead, I propose adding simply this:
> 
> 	Thus, generally, Signed-off-by from *each* person who has written
> 	a substantial portion of the patch is required.
> 
> 	If a substantial portion of the patch was not written by any
> 	human person but was instead generated automatically (e.g. by an AI such
> 	as ChatGPT, or a decompiler) then you *must* clearly document
> 	this in the patch commit message. As a matter of policy, and out of an
> 	abundance of caution, such contributions will generally be rejected.
> 
> 	When in doubt whether a specific portion is substantial - assume
> 	that Signed-off-by is required.

"generated automatically" is going way too far. There is no problem at
all with code changes generated by Coccinelle if you wrote the rules
yourself or received them under a license that allows their inclusion in
QEMU.

The problem with ChatGPT etc. is that there is no licensing information
attached to the generated code. You know it's based on someone else's
work, but you don't know who it is, if they are willing to give you a
license and under which conditions.

And it's not an "abundance of caution" why we reject such patches, but
that you obviously can't actually sign the DCO under such cirumstances
and therefore the S-o-b is wrong.

Kevin

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Alex Bennée 2 years, 2 months ago

Kevin Wolf <kwolf@redhat.com> writes:

> Am 24.11.2023 um 00:53 hat Michael S. Tsirkin geschrieben:
>> On Thu, Nov 23, 2023 at 05:46:16PM +0000, Daniel P. Berrangé wrote:
>> > On Thu, Nov 23, 2023 at 12:57:42PM +0000, Alex Bennée wrote:
>> > > Daniel P. Berrangé <berrange@redhat.com> writes:
>> > > 
<snip>
>> > > > +The QEMU maintainers thus require that contributors refrain from using
>> > > > +"AI" code generators on patches intended to be submitted to the project,
>> > > > +and will decline any contribution if use of "AI" is known or suspected.
>> > > > +
>> > > > +Examples of tools impacted by this policy includes both GitHub CoPilot,
>> > > > +and ChatGPT, amongst many others which are less well known.
>> > > 
>> > > What about if you took an LLM and then fine tuned it by using project
>> > > data so it could better help new users in making contributions to the
>> > > project? You would be biasing the model to your own data for the
>> > > purposes of helping developers write better QEMU code?
>> > 
>> > It is hard to provide an answer to that question, since I think it is
>> > something that would need to be considered case by case. It hinges
>> > around how much does the new QEMU specific training data influence
>> > the model, vs other pre-existing training (if any)
>
> I suspect fine tuning won't be enough because it doesn't make the
> unlicensed original training data go away.
>
> If you could make sure that all of the training data consists only of
> code for which you have the right to contribute it to QEMU, that would
> be a different case.

That probably means we can never use even open source LLMs to generate
code for QEMU because while the source data is all open source it won't
necessarily be GPL compatible.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Fri, Nov 24, 2023 at 10:33:49AM +0000, Alex Bennée wrote:
> That probably means we can never use even open source LLMs to generate
> code for QEMU because while the source data is all open source it won't
> necessarily be GPL compatible.

I would probably wait until the dust settles before we start accepting
LLM generated code. If nothing else, generated code quality
in our niche area is at this point still nowhere near being useful.

-- 
MST

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Peter Maydell 2 years, 2 months ago

On Fri, 24 Nov 2023 at 10:42, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Nov 24, 2023 at 10:33:49AM +0000, Alex Bennée wrote:
> > That probably means we can never use even open source LLMs to generate
> > code for QEMU because while the source data is all open source it won't
> > necessarily be GPL compatible.
>
> I would probably wait until the dust settles before we start accepting
> LLM generated code.

I think that's pretty much my take on what this policy is:
"say no for now; we can always come back later when the legal
situation seems clearer".

-- PMM

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Daniel P. Berrangé 2 years, 2 months ago

On Fri, Nov 24, 2023 at 10:43:05AM +0000, Peter Maydell wrote:
> On Fri, 24 Nov 2023 at 10:42, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Nov 24, 2023 at 10:33:49AM +0000, Alex Bennée wrote:
> > > That probably means we can never use even open source LLMs to generate
> > > code for QEMU because while the source data is all open source it won't
> > > necessarily be GPL compatible.
> >
> > I would probably wait until the dust settles before we start accepting
> > LLM generated code.
> 
> I think that's pretty much my take on what this policy is:
> "say no for now; we can always come back later when the legal
> situation seems clearer".

Yes, that was my thoughts exactly.

And if anyone comes along with a specific LLM/AI code generator that
they believe can be used in a way compatible with the DCO, they can
ask for an exception to the general policy which we can discuss then.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Fri, Nov 24, 2023 at 11:37:15AM +0000, Daniel P. Berrangé wrote:
> On Fri, Nov 24, 2023 at 10:43:05AM +0000, Peter Maydell wrote:
> > On Fri, 24 Nov 2023 at 10:42, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Fri, Nov 24, 2023 at 10:33:49AM +0000, Alex Bennée wrote:
> > > > That probably means we can never use even open source LLMs to generate
> > > > code for QEMU because while the source data is all open source it won't
> > > > necessarily be GPL compatible.
> > >
> > > I would probably wait until the dust settles before we start accepting
> > > LLM generated code.
> > 
> > I think that's pretty much my take on what this policy is:
> > "say no for now; we can always come back later when the legal
> > situation seems clearer".
> 
> Yes, that was my thoughts exactly.
> 
> And if anyone comes along with a specific LLM/AI code generator that
> they believe can be used in a way compatible with the DCO, they can
> ask for an exception to the general policy which we can discuss then.

Yea. But why do you keep worrying about LLM/AI mess?  Are there code
generators whose output do allow? What are these?

-- 
MST

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Fri, Nov 24, 2023 at 06:39:21AM -0500, Michael S. Tsirkin wrote:
> On Fri, Nov 24, 2023 at 11:37:15AM +0000, Daniel P. Berrangé wrote:
> > On Fri, Nov 24, 2023 at 10:43:05AM +0000, Peter Maydell wrote:
> > > On Fri, 24 Nov 2023 at 10:42, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Fri, Nov 24, 2023 at 10:33:49AM +0000, Alex Bennée wrote:
> > > > > That probably means we can never use even open source LLMs to generate
> > > > > code for QEMU because while the source data is all open source it won't
> > > > > necessarily be GPL compatible.
> > > >
> > > > I would probably wait until the dust settles before we start accepting
> > > > LLM generated code.
> > > 
> > > I think that's pretty much my take on what this policy is:
> > > "say no for now; we can always come back later when the legal
> > > situation seems clearer".
> > 
> > Yes, that was my thoughts exactly.
> > 
> > And if anyone comes along with a specific LLM/AI code generator that
> > they believe can be used in a way compatible with the DCO, they can
> > ask for an exception to the general policy which we can discuss then.
> 
> Yea. But why do you keep worrying about LLM/AI mess?  Are there code
> generators whose output do allow? What are these?

And to clarify I mean source code in the GPL sense so please do not
say "compiler".

-- 
MST

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Fri, Nov 24, 2023 at 10:43:05AM +0000, Peter Maydell wrote:
> On Fri, 24 Nov 2023 at 10:42, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Nov 24, 2023 at 10:33:49AM +0000, Alex Bennée wrote:
> > > That probably means we can never use even open source LLMs to generate
> > > code for QEMU because while the source data is all open source it won't
> > > necessarily be GPL compatible.
> >
> > I would probably wait until the dust settles before we start accepting
> > LLM generated code.
> 
> I think that's pretty much my take on what this policy is:
> "say no for now; we can always come back later when the legal
> situation seems clearer".

Absolutely. So I think we should not try and venture into terminology
such as what is ai or try and promote legal copyright theories.
ATM there's no good reason for someone who did not write the code
to put their DCO on the code. If it is not clear who wrote the code
because it was generated and not written then we don't want it.

-- 
MST

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michal Suchánek 2 years, 2 months ago

On Thu, Nov 23, 2023 at 12:57:42PM +0000, Alex Bennée wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > There has been an explosion of interest in so called "AI" (LLM)
> > code generators in the past year or so. Thus far though, this is
> > has not been matched by a broadly accepted legal interpretation
> > of the licensing implications for code generator outputs. While
> > the vendors may claim there is no problem and a free choice of
> > license is possible, they have an inherent conflict of interest
> > in promoting this interpretation. More broadly there is, as yet,
> > no broad consensus on the licensing implications of code generators
> > trained on inputs under a wide variety of licenses.
> >
> > The DCO requires contributors to assert they have the right to
> > contribute under the designated project license. Given the lack
> > of consensus on the licensing of "AI" (LLM) code generator output,
> > it is not considered credible to assert compliance with the DCO
> > clause (b) or (c) where a patch includes such generated code.
> >
> > This patch thus defines a policy that the QEMU project will not
> > accept contributions where use of "AI" (LLM) code generators is
> > either known, or suspected.
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > ---
> >  docs/devel/code-provenance.rst | 40 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 40 insertions(+)
> >
> > diff --git a/docs/devel/code-provenance.rst b/docs/devel/code-provenance.rst
> > index b4591a2dec..a6e42c6b1b 100644
> > --- a/docs/devel/code-provenance.rst
> > +++ b/docs/devel/code-provenance.rst
> > @@ -195,3 +195,43 @@ example::
> >    Signed-off-by: Some Person <some.person@example.com>
> >    [Rebased and added support for 'foo']
> >    Signed-off-by: New Person <new.person@example.com>
> > +
> > +Use of "AI" (LLM) code generators
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +TL;DR:
> > +
> > +  **Current QEMU project policy is to DECLINE any contributions
> > +  which are believed to include or derive from "AI" (LLM)
> > +  generated code.**
> > +
> > +The existence of "AI" (`Large Language Model <https://en.wikipedia.org/wiki/Large_language_model>`__
> > +/ LLM) code generators raises a number of difficult legal questions, a
> > +number of which impact on Open Source projects. As noted earlier, the
> > +QEMU community requires that contributors certify their patch submissions
> > +are made in accordance with the rules of the :ref:`dco` (DCO). When a
> > +patch contains "AI" generated code this raises difficulties with code
> > +provenence and thus DCO compliance.
> 
> I agree this is going to be a field that keeps lawyers well re-numerated
> for the foreseeable future. However I suspect this elides over the main
> use case for LLM generators which is non-novel transformation. One good
> example is generating text fixtures where you write a piece of original
> code and then ask the code completion engine to fill out some unit tests
> to exercise the code. It's boring mechanical work but one an LLM is very
> suited to (even if you might tweak the final result).

It may be suited to produce such code (disputable) but the code is not
suited for inclusion into the project, for legal reasons.

> > +To satisfy the DCO, the patch contributor has to fully understand
> > +the origins and license of code they are contributing to QEMU. The
> > +license terms that should apply to the output of an "AI" code generator
> > +are ill-defined, given that both training data and operation of the
> > +"AI" are typically opaque to the user. Even where the training data
> > +is said to all be open source, it will likely be under a wide variety
> > +of license terms.
> > +
> > +While the vendor's of "AI" code generators may promote the idea that
> > +code output can be taken under a free choice of license, this is not
> > +yet considered to be a generally accepted, nor tested, legal opinion.
> > +
> > +With this in mind, the QEMU maintainers does not consider it is
> > +currently possible to comply with DCO terms (b) or (c) for most "AI"
> > +generated code.
> 
> There is a load of code out that isn't eligible for copyright projection
> because it doesn't demonstrate much originality or creativity. In the
> experimentation I've done so far I've not seen much sign of genuine
> creativity. LLM's benefit from having access to a wide corpus of
> training data and tend to do a better job of inferencing solutions from
> semi-related posts than say for example human manually comparing posts
> having pasted an error message in google.

And license of that corpus of training data is not defined.

If you could erase the copyright on anything by feeding it into a
statistical model and pulling it back out there would be some big
content license holders objecting so it's very unlikely to happen.

Consequently, for all practical purposes the "AI"/LLM output is
derivative work of the input with all legal consequences.

This is, of course, only a problem for *generative* use of AI/LLM where
the putput can contain contain copies of substantial parts of input.

Thanks

Michal

Re: [PATCH 2/2] docs: define policy forbidding use of "AI" / LLM code generators

Posted by Michael S. Tsirkin 2 years, 2 months ago

On Thu, Nov 23, 2023 at 06:37:47PM +0100, Michal Suchánek wrote:
> If you could erase the copyright on anything by feeding it into a
> statistical model and pulling it back out there
> Would be some big
> content license holders objecting so it's very unlikely to happen.

I won't venture a guess and I think neither should QEMU.  For now, being
on the safe side and rejecting auto-generated code sounds very
reasonable to me, though, in particular because it's often
quite low quality ;).

Not a lawyer, and I don't speak for Red Hat.
-- 
MST