[PATCH] docs/devel/tcg: Expand on multi-threaded TCG

Philippe Mathieu-Daudé posted 1 patch 2 days, 8 hours ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20260528082022.32359-1-philmd@linaro.org
Maintainers: Pierrick Bouvier <pierrick.bouvier@oss.qualcomm.com>, Richard Henderson <richard.henderson@linaro.org>, Paolo Bonzini <pbonzini@redhat.com>
docs/devel/multi-thread-tcg.rst |  2 +-
docs/devel/tcg-icount.rst       |  1 +
docs/devel/tcg.rst              | 89 +++++++++++++++++++++++++++++++++
3 files changed, 91 insertions(+), 1 deletion(-)
[PATCH] docs/devel/tcg: Expand on multi-threaded TCG
Posted by Philippe Mathieu-Daudé 2 days, 8 hours ago
Significantly expands the TCG documentation to provide more
comprehensive overview of its internal architecture.

Use more rST anchors to improve cross-referencing across the
documentation.

Clarify front-end / optimization / back-end phases.

Detail a bit memory consistency barriers under MTTCG mode.

Add the following new sections:

 - Register Allocation and Liveness analysis
 - Overviews of the Vector/SIMD internal strategy
 - Deterministic Execution (icount)
 - TCG Plugins
 - Instruction Decoding with decodetree

AI-used-for: docs
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
---
Based-on: <20260528073412.551117-1-pbonzini@redhat.com>
---
 docs/devel/multi-thread-tcg.rst |  2 +-
 docs/devel/tcg-icount.rst       |  1 +
 docs/devel/tcg.rst              | 89 +++++++++++++++++++++++++++++++++
 3 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/docs/devel/multi-thread-tcg.rst b/docs/devel/multi-thread-tcg.rst
index da9a1530c9f..aa0b11ab360 100644
--- a/docs/devel/multi-thread-tcg.rst
+++ b/docs/devel/multi-thread-tcg.rst
@@ -4,7 +4,7 @@
   This work is licensed under the terms of the GNU GPL, version 2 or
   later. See the COPYING file in the top-level directory.
 
-.. _mttcg:
+.. _MTTCG:
 
 ==================
 Multi-threaded TCG
diff --git a/docs/devel/tcg-icount.rst b/docs/devel/tcg-icount.rst
index a1dcd79e0fd..848c19a746f 100644
--- a/docs/devel/tcg-icount.rst
+++ b/docs/devel/tcg-icount.rst
@@ -2,6 +2,7 @@
    Copyright (c) 2020, Linaro Limited
    Written by Alex Bennée
 
+.. _icount:
 
 ========================
 TCG Instruction Counting
diff --git a/docs/devel/tcg.rst b/docs/devel/tcg.rst
index 2786f2f6791..9af06018f6a 100644
--- a/docs/devel/tcg.rst
+++ b/docs/devel/tcg.rst
@@ -13,6 +13,16 @@ performances.
 QEMU's dynamic translation backend is called TCG, for "Tiny Code
 Generator". For more information, please take a look at :ref:`tcg-ops-ref`.
 
+The translation process occurs in several distinct passes:
+
+1. **Front-end**: Guest instructions are parsed (often using the
+   `decodetree <Instruction Decoding (decodetree)_>`_ tool) and converted
+   into target-independent TCG Intermediate Representation (IR) opcodes.
+2. **Optimization**: TCG performs passes such as constant folding, liveness
+   analysis, and dead code elimination on the IR.
+3. **Back-end**: The optimized IR is converted by a host-specific code
+   generator into native instructions for the host CPU.
+
 The following sections outline some notable features and implementation
 details of QEMU's dynamic translator.
 
@@ -44,6 +54,12 @@ translating it from the guest architecture if it isn’t already available
 in memory. Then QEMU proceeds to execute this next TB, starting at the
 prologue and then moving on to the translated instructions.
 
+In :ref:`MTTCG` mode, each guest CPU is emulated by a separate host thread.
+TCG ensures memory consistency by inserting memory barrier (``mb``) opcodes
+for guest instructions with ordering side effects. Direct block chaining
+across page boundaries is restricted to ensure that changes to memory
+mappings in one thread are correctly handled by others.
+
 Exiting from the TB this way will cause the ``cpu_exec_interrupt()``
 callback to be re-evaluated before executing additional instructions.
 It is mandatory to exit this way after any CPU state changes that may
@@ -175,6 +191,12 @@ virtual to physical address translation is done at every memory
 access.
 
 QEMU uses an address translation cache (TLB) to speed up the translation.
+The software MMU partitions accesses into a **TLB fast-path** and a
+**TLB slow-path**. The fast-path handles RAM and ROM areas, where the TLB
+provides the direct offset between guest virtual addresses and host memory.
+If an access does not match a fast-path entry, it falls through to the
+slow-path, which calls C helper functions to handle MMIO device emulation.
+
 In order to avoid flushing the translated code each time the MMU
 mappings change, all caches in QEMU are physically indexed.  This
 means that each basic block is indexed with its physical address.
@@ -190,6 +212,73 @@ memory areas instead calls out to C code for device emulation.
 Finally, the MMU helps tracking dirty pages and pages pointed to by
 translation blocks.
 
+Register Allocation and Liveness
+--------------------------------
+
+During the translation phase, guest instructions are converted into TCG IR
+using an **unlimited number of temporaries (TEMPs)**.
+This allows guest translators to express logic without being constrained
+by the finite register set of the host CPU.
+
+To resolve these TEMPs into physical registers, TCG performs two passes:
+
+1. **Liveness Analysis**: This pass determines the "live range" of each
+   temporary within a basic block. By identifying when a variable
+   becomes "dead" (i.e., its value is no longer needed), TCG can suppress
+   redundant moves and remove instructions that compute unused results.
+2. **Register Allocation**: The Global Register Allocator maps live TEMPs
+   to host physical registers. Fixed globals, such as the pointer
+   to the CPU architecture state (``cpu_env``), are often permanently
+   held in host registers to minimize memory traffic during execution.
+
+Vector/SIMD Internal Strategy
+-----------------------------
+
+TCG supports SIMD operations through a set of generic vector instructions
+(e.g., ``add_vec``, ``shli_vec``) parameterized by vector length and element
+size. The length is specified as a ``TCGType`` (V64, V128, or V256), and the
+element size is given in log2 8-bit units.
+
+The internal strategy relies on the backend mapping these generic opcodes
+to native host SIMD instructions, such as x86 AVX or ARM NEON. If the host
+backend does not support a specific vector operation  or length, TCG's
+expansion layer automatically decomposes the opcode into smaller supported
+vector sizes or standard integer operations.
+
+Deterministic Execution (icount)
+--------------------------------
+
+The :ref:`icount` mechanism provides deterministic execution by ensuring
+that each Translation Block executes a fixed number of instructions. This
+is essential for features like record/replay and deterministic virtual time,
+where instruction counts serve as the system clock.
+
+Instrumentation and Plugins
+---------------------------
+
+:ref:`TCG Plugins` provide a mechanism for runtime instrumentation. Opcodes
+like ``plugin_cb`` and ``plugin_mem_cb`` are inserted during translation to
+trigger callbacks in external modules, allowing analysis of instruction
+execution or memory access.
+
+Instruction Decoding (decodetree)
+---------------------------------
+
+The first step of the translation process is converting a raw bitstream of
+guest instructions into a structured format that the translator can process.
+QEMU simplifies this using the ``decodetree.py`` script, which generates C
+code decoders from a domain-specific language defined in ``.decode`` files.
+
+The decodetree tool allows developers to define instruction **patterns**
+based on a bitmask and fixed bits. When a match is found, the generated
+decoder automatically  extracts defined **fields** (such as registers or
+immediates) and passes  them to a manually written translation function.
+
+This declarative approach drastically reduces the amount of error-prone
+manual bit-shifting and nested "if-else" logic required in guest translators.
+
+For detailled implementation see :ref:`decodetree`.
+
 Profiling JITted code
 ---------------------
 
-- 
2.53.0


Re: [PATCH] docs/devel/tcg: Expand on multi-threaded TCG
Posted by Alex Bennée 2 days, 3 hours ago
Philippe Mathieu-Daudé <philmd@linaro.org> writes:

> Significantly expands the TCG documentation to provide more
> comprehensive overview of its internal architecture.
>
> Use more rST anchors to improve cross-referencing across the
> documentation.
>
> Clarify front-end / optimization / back-end phases.
>
> Detail a bit memory consistency barriers under MTTCG mode.
>
> Add the following new sections:
>
>  - Register Allocation and Liveness analysis
>  - Overviews of the Vector/SIMD internal strategy
>  - Deterministic Execution (icount)
>  - TCG Plugins
>  - Instruction Decoding with decodetree
>
> AI-used-for: docs
> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> ---
> Based-on: <20260528073412.551117-1-pbonzini@redhat.com>
> ---
>  docs/devel/multi-thread-tcg.rst |  2 +-
>  docs/devel/tcg-icount.rst       |  1 +
>  docs/devel/tcg.rst              | 89 +++++++++++++++++++++++++++++++++
>  3 files changed, 91 insertions(+), 1 deletion(-)
>
> diff --git a/docs/devel/multi-thread-tcg.rst b/docs/devel/multi-thread-tcg.rst
> index da9a1530c9f..aa0b11ab360 100644
> --- a/docs/devel/multi-thread-tcg.rst
> +++ b/docs/devel/multi-thread-tcg.rst
> @@ -4,7 +4,7 @@
>    This work is licensed under the terms of the GNU GPL, version 2 or
>    later. See the COPYING file in the top-level directory.
>  
> -.. _mttcg:
> +.. _MTTCG:
>  
>  ==================
>  Multi-threaded TCG
> diff --git a/docs/devel/tcg-icount.rst b/docs/devel/tcg-icount.rst
> index a1dcd79e0fd..848c19a746f 100644
> --- a/docs/devel/tcg-icount.rst
> +++ b/docs/devel/tcg-icount.rst
> @@ -2,6 +2,7 @@
>     Copyright (c) 2020, Linaro Limited
>     Written by Alex Bennée
>  
> +.. _icount:
>  
>  ========================
>  TCG Instruction Counting
> diff --git a/docs/devel/tcg.rst b/docs/devel/tcg.rst
> index 2786f2f6791..9af06018f6a 100644
> --- a/docs/devel/tcg.rst
> +++ b/docs/devel/tcg.rst
> @@ -13,6 +13,16 @@ performances.
>  QEMU's dynamic translation backend is called TCG, for "Tiny Code
>  Generator". For more information, please take a look at :ref:`tcg-ops-ref`.
>  
> +The translation process occurs in several distinct passes:
> +
> +1. **Front-end**: Guest instructions are parsed (often using the
> +   `decodetree <Instruction Decoding (decodetree)_>`_ tool) and converted
> +   into target-independent TCG Intermediate Representation (IR) opcodes.
> +2. **Optimization**: TCG performs passes such as constant folding, liveness
> +   analysis, and dead code elimination on the IR.

Not all optimisation is done here by the way, some of the front-end ops
will select operations based on TCG_TARGET_HAS_ before we get to the
optimisation pass.

> +3. **Back-end**: The optimized IR is converted by a host-specific code
> +   generator into native instructions for the host CPU.
> +
>  The following sections outline some notable features and implementation
>  details of QEMU's dynamic translator.
>  
> @@ -44,6 +54,12 @@ translating it from the guest architecture if it isn’t already available
>  in memory. Then QEMU proceeds to execute this next TB, starting at the
>  prologue and then moving on to the translated instructions.
>  
> +In :ref:`MTTCG` mode, each guest CPU is emulated by a separate host thread.
> +TCG ensures memory consistency by inserting memory barrier (``mb``) opcodes
> +for guest instructions with ordering side effects. Direct block chaining
> +across page boundaries is restricted to ensure that changes to memory
> +mappings in one thread are correctly handled by others.
> +
>  Exiting from the TB this way will cause the ``cpu_exec_interrupt()``
>  callback to be re-evaluated before executing additional instructions.
>  It is mandatory to exit this way after any CPU state changes that may
> @@ -175,6 +191,12 @@ virtual to physical address translation is done at every memory
>  access.
>  
>  QEMU uses an address translation cache (TLB) to speed up the translation.
> +The software MMU partitions accesses into a **TLB fast-path** and a
> +**TLB slow-path**. The fast-path handles RAM and ROM areas, where the TLB
> +provides the direct offset between guest virtual addresses and host memory.
> +If an access does not match a fast-path entry, it falls through to the
> +slow-path, which calls C helper functions to handle MMIO device emulation.
> +
>  In order to avoid flushing the translated code each time the MMU
>  mappings change, all caches in QEMU are physically indexed.  This
>  means that each basic block is indexed with its physical address.
> @@ -190,6 +212,73 @@ memory areas instead calls out to C code for device emulation.
>  Finally, the MMU helps tracking dirty pages and pages pointed to by
>  translation blocks.
>  
> +Register Allocation and Liveness
> +--------------------------------
> +
> +During the translation phase, guest instructions are converted into TCG IR
> +using an **unlimited number of temporaries (TEMPs)**.
> +This allows guest translators to express logic without being constrained
> +by the finite register set of the host CPU.
> +
> +To resolve these TEMPs into physical registers, TCG performs two passes:
> +
> +1. **Liveness Analysis**: This pass determines the "live range" of each
> +   temporary within a basic block. By identifying when a variable
> +   becomes "dead" (i.e., its value is no longer needed), TCG can suppress
> +   redundant moves and remove instructions that compute unused results.
> +2. **Register Allocation**: The Global Register Allocator maps live TEMPs
> +   to host physical registers. Fixed globals, such as the pointer
> +   to the CPU architecture state (``cpu_env``), are often permanently
> +   held in host registers to minimize memory traffic during execution.
> +
> +Vector/SIMD Internal Strategy
> +-----------------------------
> +
> +TCG supports SIMD operations through a set of generic vector instructions
> +(e.g., ``add_vec``, ``shli_vec``) parameterized by vector length and element
> +size. The length is specified as a ``TCGType`` (V64, V128, or V256), and the
> +element size is given in log2 8-bit units.
> +
> +The internal strategy relies on the backend mapping these generic opcodes
> +to native host SIMD instructions, such as x86 AVX or ARM NEON. If the host
> +backend does not support a specific vector operation  or length, TCG's
> +expansion layer automatically decomposes the opcode into smaller supported
> +vector sizes or standard integer operations.
> +
> +Deterministic Execution (icount)
> +--------------------------------
> +
> +The :ref:`icount` mechanism provides deterministic execution by ensuring
> +that each Translation Block executes a fixed number of instructions. This
> +is essential for features like record/replay and deterministic virtual time,
> +where instruction counts serve as the system clock.
> +
> +Instrumentation and Plugins
> +---------------------------
> +
> +:ref:`TCG Plugins` provide a mechanism for runtime instrumentation. Opcodes
> +like ``plugin_cb`` and ``plugin_mem_cb`` are inserted during translation to
> +trigger callbacks in external modules, allowing analysis of instruction
> +execution or memory access.
> +
> +Instruction Decoding (decodetree)
> +---------------------------------
> +
> +The first step of the translation process is converting a raw bitstream of
> +guest instructions into a structured format that the translator can process.
> +QEMU simplifies this using the ``decodetree.py`` script, which generates C
> +code decoders from a domain-specific language defined in ``.decode`` files.
> +
> +The decodetree tool allows developers to define instruction **patterns**
> +based on a bitmask and fixed bits. When a match is found, the generated
> +decoder automatically  extracts defined **fields** (such as registers or
> +immediates) and passes  them to a manually written translation function.
> +
> +This declarative approach drastically reduces the amount of error-prone
> +manual bit-shifting and nested "if-else" logic required in guest translators.
> +
> +For detailled implementation see :ref:`decodetree`.
> +
>  Profiling JITted code
>  ---------------------

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro
Re: [PATCH] docs/devel/tcg: Expand on multi-threaded TCG
Posted by Paolo Bonzini 2 days, 7 hours ago
Before looking at the specifics, I appreciate you bold and experimenting 
with how to improve our documentation. I can see that this is largely 
unedited LLM output, and honestly, that is actually a good thing for 
this experiment.

On the other hand, it exposes where the tool falls short, and highlights 
very clearly the risks of accepting AI-generated content too leisurely.

On 5/28/26 10:20, Philippe Mathieu-Daudé wrote:
> Significantly expands the TCG documentation to provide more
> comprehensive overview of its internal architecture.
> 
> Use more rST anchors to improve cross-referencing across the
> documentation.
> 
> Clarify front-end / optimization / back-end phases.
> 
> Detail a bit memory consistency barriers under MTTCG mode.
> 
> Add the following new sections:
> 
>   - Register Allocation and Liveness analysis
>   - Overviews of the Vector/SIMD internal strategy
>   - Deterministic Execution (icount)
>   - TCG Plugins
>   - Instruction Decoding with decodetree

This commit message is not really up to the standards.  It is purely a 
"what" which can be obtained just by glancing at the section headers.

It should explain the purpose of tcg.rst and why these new sections were 
singled out.
> +The translation process occurs in several distinct passes:
> +
> +1. **Front-end**: Guest instructions are parsed (often using the
> +   `decodetree <Instruction Decoding (decodetree)_>`_ tool) and converted
> +   into target-independent TCG Intermediate Representation (IR) opcodes.
> +2. **Optimization**: TCG performs passes such as constant folding, liveness
> +   analysis, and dead code elimination on the IR.
> +3. **Back-end**: The optimized IR is converted by a host-specific code
> +   generator into native instructions for the host CPU.

The sections below should be sorted according to these sections, when 
applicable.

Register allocation also fits somewhere, probably in "back-end".

There should be also another sentence for the TCG run-time (accel/tcg).
> +Register Allocation and Liveness
> +--------------------------------
> +
> +During the translation phase, guest instructions are converted into TCG IR
> +using an **unlimited number of temporaries (TEMPs)**.
> +This allows guest translators to express logic without being constrained
> +by the finite register set of the host CPU.
> +
> +To resolve these TEMPs into physical registers, TCG performs two passes:
> +
> +1. **Liveness Analysis**: This pass determines the "live range" of each
> +   temporary within a basic block. By identifying when a variable
> +   becomes "dead" (i.e., its value is no longer needed), TCG can suppress
> +   redundant moves and remove instructions that compute unused results.
> +2. **Register Allocation**: The Global Register Allocator maps live TEMPs
> +   to host physical registers. Fixed globals, such as the pointer
> +   to the CPU architecture state (``cpu_env``), are often permanently
> +   held in host registers to minimize memory traffic during execution.
> +
> +Vector/SIMD Internal Strategy
> +-----------------------------
> +
> +TCG supports SIMD operations through a set of generic vector instructions
> +(e.g., ``add_vec``, ``shli_vec``) parameterized by vector length and element
> +size. The length is specified as a ``TCGType`` (V64, V128, or V256), and the
> +element size is given in log2 8-bit units.
> +
> +The internal strategy relies on the backend mapping these generic opcodes
> +to native host SIMD instructions, such as x86 AVX or ARM NEON. If the host
> +backend does not support a specific vector operation  or length, TCG's
> +expansion layer automatically decomposes the opcode into smaller supported
> +vector sizes or standard integer operations.
> +
> +Deterministic Execution (icount)
> +--------------------------------
> +
> +The :ref:`icount` mechanism provides deterministic execution by ensuring
> +that each Translation Block executes a fixed number of instructions.

Hallucination (to put it kindly).  It ensures that QEMU_CLOCK_VIRTUAL is 
a multiple of the number of instructions executed.

> This
> +is essential for features like record/replay and deterministic virtual time,
> +where instruction counts serve as the system clock.
> +
> +Instrumentation and Plugins
> +---------------------------
> +
> +:ref:`TCG Plugins` provide a mechanism for runtime instrumentation. Opcodes
> +like ``plugin_cb`` and ``plugin_mem_cb`` are inserted during translation to
> +trigger callbacks in external modules, allowing analysis of instruction
> +execution or memory access.
> +
> +Instruction Decoding (decodetree)
> +---------------------------------
> +
> +The first step of the translation process is converting a raw bitstream of
> +guest instructions into a structured format that the translator can process.

Is this true?  Maybe "extracting operands from the raw bitstream of 
guest instructions, for easier processing in the translator"?

> +QEMU simplifies this using the ``decodetree.py`` script, which generates C
> +code decoders from a domain-specific language defined in ``.decode`` files.
> +
> +The decodetree tool allows developers to define instruction **patterns**
> +based on a bitmask and fixed bits. When a match is found, the generated
> +decoder automatically  extracts defined **fields** (such as registers or
> +immediates) and passes  them to a manually written translation function.
> +
> +This declarative approach drastically reduces the amount of error-prone
> +manual bit-shifting and nested "if-else" logic required in guest translators.

I would just say "``decodetree`` simplifies writing and maintaining the 
front-end compared to manual decoding".  Maybe it's worth adding 
something like "Note however that it is mostly applicable to processors 
whose instruction encoding is fixed length, or mostly fixed length.".

> +For detailled implementation see :ref:`decodetree`.

"detailed".

Honestly, I'm not impressed by the quality of the output.  There's no 
organization, just a bunch of new sections in no order (decodetree comes 
last).  They might be good enough for a glossary, but for developer 
documentation it would just add structural debt(*).  At the very least 
all the "----"-level sections should be split into front-end, 
optimization, back-end and run-time.

Again, this is not about you---I hope you knew that this wasn't going to 
be included as is. :)  Submitting this without manual editing shows the 
baseline capabilities of the LLM and highlights the importance of human 
steering.

Paolo

(*) I have just made this term up, but I think it should be a thing - we 
have a lot of it already in QEMU docs


Re: [PATCH] docs/devel/tcg: Expand on multi-threaded TCG
Posted by Philippe Mathieu-Daudé 2 days, 6 hours ago
On 28/5/26 11:04, Paolo Bonzini wrote:
> Before looking at the specifics, I appreciate you bold and experimenting 
> with how to improve our documentation. I can see that this is largely 
> unedited LLM output, and honestly, that is actually a good thing for 
> this experiment.
> 
> On the other hand, it exposes where the tool falls short, and highlights 
> very clearly the risks of accepting AI-generated content too leisurely.
> 
> On 5/28/26 10:20, Philippe Mathieu-Daudé wrote:
>> Significantly expands the TCG documentation to provide more
>> comprehensive overview of its internal architecture.
>>
>> Use more rST anchors to improve cross-referencing across the
>> documentation.
>>
>> Clarify front-end / optimization / back-end phases.
>>
>> Detail a bit memory consistency barriers under MTTCG mode.
>>
>> Add the following new sections:
>>
>>   - Register Allocation and Liveness analysis
>>   - Overviews of the Vector/SIMD internal strategy
>>   - Deterministic Execution (icount)
>>   - TCG Plugins
>>   - Instruction Decoding with decodetree
> 
> This commit message is not really up to the standards.  It is purely a 
> "what" which can be obtained just by glancing at the section headers.
> 
> It should explain the purpose of tcg.rst and why these new sections were 
> singled out.
>> +The translation process occurs in several distinct passes:
>> +
>> +1. **Front-end**: Guest instructions are parsed (often using the
>> +   `decodetree <Instruction Decoding (decodetree)_>`_ tool) and 
>> converted
>> +   into target-independent TCG Intermediate Representation (IR) opcodes.
>> +2. **Optimization**: TCG performs passes such as constant folding, 
>> liveness
>> +   analysis, and dead code elimination on the IR.
>> +3. **Back-end**: The optimized IR is converted by a host-specific code
>> +   generator into native instructions for the host CPU.
> 
> The sections below should be sorted according to these sections, when 
> applicable.
> 
> Register allocation also fits somewhere, probably in "back-end".
> 
> There should be also another sentence for the TCG run-time (accel/tcg).
>> +Register Allocation and Liveness
>> +--------------------------------
>> +
>> +During the translation phase, guest instructions are converted into 
>> TCG IR
>> +using an **unlimited number of temporaries (TEMPs)**.
>> +This allows guest translators to express logic without being constrained
>> +by the finite register set of the host CPU.
>> +
>> +To resolve these TEMPs into physical registers, TCG performs two passes:
>> +
>> +1. **Liveness Analysis**: This pass determines the "live range" of each
>> +   temporary within a basic block. By identifying when a variable
>> +   becomes "dead" (i.e., its value is no longer needed), TCG can 
>> suppress
>> +   redundant moves and remove instructions that compute unused results.
>> +2. **Register Allocation**: The Global Register Allocator maps live 
>> TEMPs
>> +   to host physical registers. Fixed globals, such as the pointer
>> +   to the CPU architecture state (``cpu_env``), are often permanently
>> +   held in host registers to minimize memory traffic during execution.
>> +
>> +Vector/SIMD Internal Strategy
>> +-----------------------------
>> +
>> +TCG supports SIMD operations through a set of generic vector 
>> instructions
>> +(e.g., ``add_vec``, ``shli_vec``) parameterized by vector length and 
>> element
>> +size. The length is specified as a ``TCGType`` (V64, V128, or V256), 
>> and the
>> +element size is given in log2 8-bit units.
>> +
>> +The internal strategy relies on the backend mapping these generic 
>> opcodes
>> +to native host SIMD instructions, such as x86 AVX or ARM NEON. If the 
>> host
>> +backend does not support a specific vector operation  or length, TCG's
>> +expansion layer automatically decomposes the opcode into smaller 
>> supported
>> +vector sizes or standard integer operations.
>> +
>> +Deterministic Execution (icount)
>> +--------------------------------
>> +
>> +The :ref:`icount` mechanism provides deterministic execution by ensuring
>> +that each Translation Block executes a fixed number of instructions.
> 
> Hallucination (to put it kindly).  It ensures that QEMU_CLOCK_VIRTUAL is 
> a multiple of the number of instructions executed.
> 
>> This
>> +is essential for features like record/replay and deterministic 
>> virtual time,
>> +where instruction counts serve as the system clock.
>> +
>> +Instrumentation and Plugins
>> +---------------------------
>> +
>> +:ref:`TCG Plugins` provide a mechanism for runtime instrumentation. 
>> Opcodes
>> +like ``plugin_cb`` and ``plugin_mem_cb`` are inserted during 
>> translation to
>> +trigger callbacks in external modules, allowing analysis of instruction
>> +execution or memory access.
>> +
>> +Instruction Decoding (decodetree)
>> +---------------------------------
>> +
>> +The first step of the translation process is converting a raw 
>> bitstream of
>> +guest instructions into a structured format that the translator can 
>> process.
> 
> Is this true?  Maybe "extracting operands from the raw bitstream of 
> guest instructions, for easier processing in the translator"?
> 
>> +QEMU simplifies this using the ``decodetree.py`` script, which 
>> generates C
>> +code decoders from a domain-specific language defined in ``.decode`` 
>> files.
>> +
>> +The decodetree tool allows developers to define instruction **patterns**
>> +based on a bitmask and fixed bits. When a match is found, the generated
>> +decoder automatically  extracts defined **fields** (such as registers or
>> +immediates) and passes  them to a manually written translation function.
>> +
>> +This declarative approach drastically reduces the amount of error-prone
>> +manual bit-shifting and nested "if-else" logic required in guest 
>> translators.
> 
> I would just say "``decodetree`` simplifies writing and maintaining the 
> front-end compared to manual decoding".  Maybe it's worth adding 
> something like "Note however that it is mostly applicable to processors 
> whose instruction encoding is fixed length, or mostly fixed length.".
> 
>> +For detailled implementation see :ref:`decodetree`.
> 
> "detailed".
> 
> Honestly, I'm not impressed by the quality of the output.  There's no 
> organization, just a bunch of new sections in no order (decodetree comes 
> last).  They might be good enough for a glossary, but for developer 
> documentation it would just add structural debt(*).  At the very least 
> all the "----"-level sections should be split into front-end, 
> optimization, back-end and run-time.
> 
> Again, this is not about you---I hope you knew that this wasn't going to 
> be included as is. :)  Submitting this without manual editing shows the 
> baseline capabilities of the LLM and highlights the importance of human 
> steering.

Thanks for the quick feedback. Yeah I wanted to test the water with a
patch (and ought to post as RFC). Some TCG concepts are hard to grasp,
and I indeed trusted the LLM too much to be an English fluent and
technical expert. I told it "You are what best understand QEMU
internals, and a TCG expert. You don't make mistakes and do not lie.
You write in a perfectly understandable technical English language."

Blame on me for being to confident here <:)

> 
> Paolo
> 
> (*) I have just made this term up, but I think it should be a thing - we 
> have a lot of it already in QEMU docs
>