plugins: Introduce Fault Injection framework and API extensions

[RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Ruslan Ruslichenko 2 weeks, 5 days ago

From: Ruslan Ruslichenko <Ruslan_Ruslichenko@epam.com>

This patch series is submitted as an RFC to gather early feedback on a Fault Injection (FI) framework built on top of the QEMU TCG plugin subsystem.

Motivation

Testing guest operating systems, hypervisors (like Xen), and low-level drivers against unexpected hardware failures can be difficult.
This series provides an interface to inject faults dynamically without altering QEMU's core emulation source code for every test case.

Architecture & Key Features

The series introduces the core API extensions and implements a fault injection plugin (contrib/plugins/fault_injection.c) targeting AArch64.
The plugin can be controlled statically via XML configurations on boot, or dynamically at runtime via a UNIX socket (enabling integration with automated testing frameworks via Python or GDB).

New Plugin API Capabilities:

MMIO Interception: Allows plugins to hook into memory_region_dispatch_read/write to modify hardware register reads or drop writes.
Asynchronous Timers: Exposes QEMU_CLOCK_VIRTUAL to plugins, allowing callbacks to be scheduled based on guest virtual time.
TB Cache Flushing: Exposes qemu_plugin_flush_tb_cache() so plugins can force re-translation when applying dynamic PC-based hooks.
Interrupt & Exception Injection: Exposes APIs to raise/pulse hardware IRQs on the primary INTC and inject CPU exceptions (e.g., SErrors).
Custom Device Faults: Introduces a registry where device models (e.g., SMMUv3) can expose specific fault handlers (like CMDQ errors) to be triggered externally by plugins.

Patch Summary
Patch 1 (target/arm): Adds support for asynchronous CPU exception injection.
Patch 2-3 (plugins/api): Exposes virtual clock timers and TB cache flushing to the public plugin API.
Patch 4 (plugins): Introduces the core fault injection subsystem, IRQ/Exception routing, and the Custom Fault registry.
Patch 5 (system/memory): Adds the MMIO override hooks into the memory dispatch path.
Patch 6 (hw/intc): Registers the ARM GIC (v2/v3) with the plugin subsystem to enable direct hardware IRQ injection.
Patch 7 (hw/arm): Registers the SMMUv3 with the custom fault registry to demonstrate how device models can expose specific errors (like CMDQ faults) to plugins.
Patch 8 (contrib/plugins): Implements the actual fault_injection plugin using the new APIs.
Patch 9 (docs): Adds documentation and usage examples for the plugin.

Request for Comments & Feedback

Any suggestions on improvements, potential edge cases, or issues with the current design are highly welcome.

Ruslan Ruslichenko (9):
  target/arm: Add API for dynamic exception injection
  plugins/api: Expose virtual clock timers to plugins
  plugins: Expose Transaction Block cache flush API to plugins
  plugins: Introduce fault injection API and core subsystem
  system/memory: Add plugin callbacks to intercept MMIO accesses
  hw/intc/arm_gic: Register primary GIC for plugin IRQ injection
  hw/arm/smmuv3: Add plugin fault handler for CMDQ errors
  contrib/plugins: Add fault injection plugin
  docs: Add description of fault-injection plugin and subsystem

 contrib/plugins/fault_injection.c | 772 ++++++++++++++++++++++++++++++
 contrib/plugins/meson.build       |   1 +
 docs/fault-injection.txt          | 111 +++++
 hw/arm/smmuv3.c                   |  54 +++
 hw/intc/arm_gic.c                 |  28 ++
 hw/intc/arm_gicv3.c               |  28 ++
 include/plugins/qemu-plugin.h     |  28 ++
 include/qemu/plugin.h             |  39 ++
 plugins/api.c                     |  62 +++
 plugins/core.c                    |  11 +
 plugins/fault.c                   | 116 +++++
 plugins/meson.build               |   1 +
 plugins/plugin.h                  |   2 +
 system/memory.c                   |   8 +
 target/arm/cpu.h                  |   4 +
 target/arm/helper.c               |  55 +++
 16 files changed, 1320 insertions(+)
 create mode 100644 contrib/plugins/fault_injection.c
 create mode 100644 docs/fault-injection.txt
 create mode 100644 plugins/fault.c

-- 
2.43.0

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Pierrick Bouvier 2 weeks, 4 days ago

Hi Ruslan,

On 3/18/26 3:46 AM, Ruslan Ruslichenko wrote:
> From: Ruslan Ruslichenko <Ruslan_Ruslichenko@epam.com>
> 
> This patch series is submitted as an RFC to gather early feedback on a Fault Injection (FI) framework built on top of the QEMU TCG plugin subsystem.
> 
> Motivation
> 
> Testing guest operating systems, hypervisors (like Xen), and low-level drivers against unexpected hardware failures can be difficult.
> This series provides an interface to inject faults dynamically without altering QEMU's core emulation source code for every test case.
> 
> Architecture & Key Features
> 
> The series introduces the core API extensions and implements a fault injection plugin (contrib/plugins/fault_injection.c) targeting AArch64.
> The plugin can be controlled statically via XML configurations on boot, or dynamically at runtime via a UNIX socket (enabling integration with automated testing frameworks via Python or GDB).
> 
> New Plugin API Capabilities:
> 
> MMIO Interception: Allows plugins to hook into memory_region_dispatch_read/write to modify hardware register reads or drop writes.
> Asynchronous Timers: Exposes QEMU_CLOCK_VIRTUAL to plugins, allowing callbacks to be scheduled based on guest virtual time.
> TB Cache Flushing: Exposes qemu_plugin_flush_tb_cache() so plugins can force re-translation when applying dynamic PC-based hooks.
> Interrupt & Exception Injection: Exposes APIs to raise/pulse hardware IRQs on the primary INTC and inject CPU exceptions (e.g., SErrors).
> Custom Device Faults: Introduces a registry where device models (e.g., SMMUv3) can expose specific fault handlers (like CMDQ errors) to be triggered externally by plugins.
> 
> Patch Summary
> Patch 1 (target/arm): Adds support for asynchronous CPU exception injection.
> Patch 2-3 (plugins/api): Exposes virtual clock timers and TB cache flushing to the public plugin API.
> Patch 4 (plugins): Introduces the core fault injection subsystem, IRQ/Exception routing, and the Custom Fault registry.
> Patch 5 (system/memory): Adds the MMIO override hooks into the memory dispatch path.
> Patch 6 (hw/intc): Registers the ARM GIC (v2/v3) with the plugin subsystem to enable direct hardware IRQ injection.
> Patch 7 (hw/arm): Registers the SMMUv3 with the custom fault registry to demonstrate how device models can expose specific errors (like CMDQ faults) to plugins.
> Patch 8 (contrib/plugins): Implements the actual fault_injection plugin using the new APIs.
> Patch 9 (docs): Adds documentation and usage examples for the plugin.
> 
> Request for Comments & Feedback
> 
> Any suggestions on improvements, potential edge cases, or issues with the current design are highly welcome.
> 
> Ruslan Ruslichenko (9):
>    target/arm: Add API for dynamic exception injection
>    plugins/api: Expose virtual clock timers to plugins
>    plugins: Expose Transaction Block cache flush API to plugins
>    plugins: Introduce fault injection API and core subsystem
>    system/memory: Add plugin callbacks to intercept MMIO accesses
>    hw/intc/arm_gic: Register primary GIC for plugin IRQ injection
>    hw/arm/smmuv3: Add plugin fault handler for CMDQ errors
>    contrib/plugins: Add fault injection plugin
>    docs: Add description of fault-injection plugin and subsystem
> 
>   contrib/plugins/fault_injection.c | 772 ++++++++++++++++++++++++++++++
>   contrib/plugins/meson.build       |   1 +
>   docs/fault-injection.txt          | 111 +++++
>   hw/arm/smmuv3.c                   |  54 +++
>   hw/intc/arm_gic.c                 |  28 ++
>   hw/intc/arm_gicv3.c               |  28 ++
>   include/plugins/qemu-plugin.h     |  28 ++
>   include/qemu/plugin.h             |  39 ++
>   plugins/api.c                     |  62 +++
>   plugins/core.c                    |  11 +
>   plugins/fault.c                   | 116 +++++
>   plugins/meson.build               |   1 +
>   plugins/plugin.h                  |   2 +
>   system/memory.c                   |   8 +
>   target/arm/cpu.h                  |   4 +
>   target/arm/helper.c               |  55 +++
>   16 files changed, 1320 insertions(+)
>   create mode 100644 contrib/plugins/fault_injection.c
>   create mode 100644 docs/fault-injection.txt
>   create mode 100644 plugins/fault.c
> 

first, thanks for posting your series!

About the general approach.
As you noticed, this is exposing a lot of QEMU internals, and it's 
something we tend to avoid to do. As well, it's very architecture 
specific, which is another pattern we try to avoid.

For some of your needs (especially IRQ injection and timer injection), 
did you consider writing a custom ad-hoc device and timer generating those?
There is nothing preventing you from writing a plugin that can 
communicate with this specific device (through a socket for instance), 
to request specific injections. I feel that it would scale better than 
exposing all this to QEMU plugins API.

For SMMU, this is trickier. Tao recently (6ce361b02c82) an iommu test 
device, associated to qtest to unit test the smmu implementation. We 
could maybe see to leverage that on a full machine, associated with the 
communication method mentioned above, to generate specific operations at 
runtime, all triggered via a plugin.

Exposing qemu_plugin_flush_tb_cache is a hint we are missing something 
on QEMU side. Better to fix it than expose this very internal function.
The associated TRIGGER_ON_PC is very similar to existing inline 
operations. They could be enhanced to support writing to a given 
register, all the bricks are there. For TRIGGER_ON_SYSREG it's a bit 
more complex, but we might enhance inline operations also to support 
hooks on specific register writes.

For MMIO override, the current approach you have is good, and it's 
definitely something we could integrate.

What are you toughts about this? (especially the device based approach 
in case that you maybe tried first).

Regards,
Pierrick

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Ruslan Ruslichenko 2 weeks, 3 days ago

Hi Pierrick,

Thank you for the feedback and review!

Our current plan is to put this plugin through our internal workflows to gather
more data on its limitations and performance.
Based on results, we may consider extending or refining the implementation
in the future.

Any further feedback on potential issues is highly appreciated.

On Wed, Mar 18, 2026 at 6:16 PM Pierrick Bouvier
<pierrick.bouvier@linaro.org> wrote:
>
> Hi Ruslan,
>
> On 3/18/26 3:46 AM, Ruslan Ruslichenko wrote:
> > From: Ruslan Ruslichenko <Ruslan_Ruslichenko@epam.com>
> >
> > This patch series is submitted as an RFC to gather early feedback on a Fault Injection (FI) framework built on top of the QEMU TCG plugin subsystem.
> >
> > Motivation
> >
> > Testing guest operating systems, hypervisors (like Xen), and low-level drivers against unexpected hardware failures can be difficult.
> > This series provides an interface to inject faults dynamically without altering QEMU's core emulation source code for every test case.
> >
> > Architecture & Key Features
> >
> > The series introduces the core API extensions and implements a fault injection plugin (contrib/plugins/fault_injection.c) targeting AArch64.
> > The plugin can be controlled statically via XML configurations on boot, or dynamically at runtime via a UNIX socket (enabling integration with automated testing frameworks via Python or GDB).
> >
> > New Plugin API Capabilities:
> >
> > MMIO Interception: Allows plugins to hook into memory_region_dispatch_read/write to modify hardware register reads or drop writes.
> > Asynchronous Timers: Exposes QEMU_CLOCK_VIRTUAL to plugins, allowing callbacks to be scheduled based on guest virtual time.
> > TB Cache Flushing: Exposes qemu_plugin_flush_tb_cache() so plugins can force re-translation when applying dynamic PC-based hooks.
> > Interrupt & Exception Injection: Exposes APIs to raise/pulse hardware IRQs on the primary INTC and inject CPU exceptions (e.g., SErrors).
> > Custom Device Faults: Introduces a registry where device models (e.g., SMMUv3) can expose specific fault handlers (like CMDQ errors) to be triggered externally by plugins.
> >
> > Patch Summary
> > Patch 1 (target/arm): Adds support for asynchronous CPU exception injection.
> > Patch 2-3 (plugins/api): Exposes virtual clock timers and TB cache flushing to the public plugin API.
> > Patch 4 (plugins): Introduces the core fault injection subsystem, IRQ/Exception routing, and the Custom Fault registry.
> > Patch 5 (system/memory): Adds the MMIO override hooks into the memory dispatch path.
> > Patch 6 (hw/intc): Registers the ARM GIC (v2/v3) with the plugin subsystem to enable direct hardware IRQ injection.
> > Patch 7 (hw/arm): Registers the SMMUv3 with the custom fault registry to demonstrate how device models can expose specific errors (like CMDQ faults) to plugins.
> > Patch 8 (contrib/plugins): Implements the actual fault_injection plugin using the new APIs.
> > Patch 9 (docs): Adds documentation and usage examples for the plugin.
> >
> > Request for Comments & Feedback
> >
> > Any suggestions on improvements, potential edge cases, or issues with the current design are highly welcome.
> >
> > Ruslan Ruslichenko (9):
> >    target/arm: Add API for dynamic exception injection
> >    plugins/api: Expose virtual clock timers to plugins
> >    plugins: Expose Transaction Block cache flush API to plugins
> >    plugins: Introduce fault injection API and core subsystem
> >    system/memory: Add plugin callbacks to intercept MMIO accesses
> >    hw/intc/arm_gic: Register primary GIC for plugin IRQ injection
> >    hw/arm/smmuv3: Add plugin fault handler for CMDQ errors
> >    contrib/plugins: Add fault injection plugin
> >    docs: Add description of fault-injection plugin and subsystem
> >
> >   contrib/plugins/fault_injection.c | 772 ++++++++++++++++++++++++++++++
> >   contrib/plugins/meson.build       |   1 +
> >   docs/fault-injection.txt          | 111 +++++
> >   hw/arm/smmuv3.c                   |  54 +++
> >   hw/intc/arm_gic.c                 |  28 ++
> >   hw/intc/arm_gicv3.c               |  28 ++
> >   include/plugins/qemu-plugin.h     |  28 ++
> >   include/qemu/plugin.h             |  39 ++
> >   plugins/api.c                     |  62 +++
> >   plugins/core.c                    |  11 +
> >   plugins/fault.c                   | 116 +++++
> >   plugins/meson.build               |   1 +
> >   plugins/plugin.h                  |   2 +
> >   system/memory.c                   |   8 +
> >   target/arm/cpu.h                  |   4 +
> >   target/arm/helper.c               |  55 +++
> >   16 files changed, 1320 insertions(+)
> >   create mode 100644 contrib/plugins/fault_injection.c
> >   create mode 100644 docs/fault-injection.txt
> >   create mode 100644 plugins/fault.c
> >
>
> first, thanks for posting your series!
>
> About the general approach.
> As you noticed, this is exposing a lot of QEMU internals, and it's
> something we tend to avoid to do. As well, it's very architecture
> specific, which is another pattern we try to avoid.
>
> For some of your needs (especially IRQ injection and timer injection),
> did you consider writing a custom ad-hoc device and timer generating those?
> There is nothing preventing you from writing a plugin that can
> communicate with this specific device (through a socket for instance),
> to request specific injections. I feel that it would scale better than
> exposing all this to QEMU plugins API.
>
> For SMMU, this is trickier. Tao recently (6ce361b02c82) an iommu test
> device, associated to qtest to unit test the smmu implementation. We
> could maybe see to leverage that on a full machine, associated with the
> communication method mentioned above, to generate specific operations at
> runtime, all triggered via a plugin.
>
> Exposing qemu_plugin_flush_tb_cache is a hint we are missing something
> on QEMU side. Better to fix it than expose this very internal function.

The reason this was needed is that the plugin may receive PC trigger
configuration
dynamically and need to register instruction callback at runtime.
If the TB for that PC is already translated and cached, our newly registered
callback might not be executed.

If there is a more proper way to force QEMU to re-translate a specific
TB or attach
a callback to cached TB it would be great to reduce the complexity here.

> The associated TRIGGER_ON_PC is very similar to existing inline
> operations. They could be enhanced to support writing to a given
> register, all the bricks are there. For TRIGGER_ON_SYSREG it's a bit
> more complex, but we might enhance inline operations also to support
> hooks on specific register writes.

TRIGGER_ON_PC may also be used for generating other faults too. For example,
one use-case is to trigger CPU exceptions on specific instructions.
Supporting TRIGGER_ON_SYSREG as an inline operation sounds like a
really interesting
direction to explore.

>
> For MMIO override, the current approach you have is good, and it's
> definitely something we could integrate.
>
> What are you toughts about this? (especially the device based approach
> in case that you maybe tried first).

I agree such an approach can work well for IRQ's and Timers, and would be
more clean way to implement this.

However, for SMMU and similar cases, triggering internal state errors is not
easy and requires accessing internal logic. So for those specific cases,
a different approach may be needed.

>
> Regards,
> Pierrick

BR,
Ruslan

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Pierrick Bouvier 2 weeks, 3 days ago

On 3/19/26 11:20 AM, Ruslan Ruslichenko wrote:
> Hi Pierrick,
> 
> Thank you for the feedback and review!
> 
> Our current plan is to put this plugin through our internal workflows to gather
> more data on its limitations and performance.
> Based on results, we may consider extending or refining the implementation
> in the future.
> 
> Any further feedback on potential issues is highly appreciated.
>

By design, the approach of modifying QEMU internals to allow to inject 
IRQ, set a timer, or trigger SMMU has very few chances to be integrated 
as it is. At least, it should be discussed with the concerned 
maintainers, and see if they would be open to it or not.

It's not wrong in itself, if you want a downstream solution, but it does 
not scale upstream if we have to consider and accept everyone's needs. 
The plugin API in itself can accept the burden for such things, but it's 
harder to justify for internal stuff.

I believe it would be better to rely on ad hoc devices generating this, 
with the advantage that even if they don't get accepted upstream, it 
will be more easy for you to maintain them downstream compared to more 
intrusive patches.

> On Wed, Mar 18, 2026 at 6:16 PM Pierrick Bouvier
> <pierrick.bouvier@linaro.org> wrote:
>>
>> Hi Ruslan,
>>
>> On 3/18/26 3:46 AM, Ruslan Ruslichenko wrote:
>>> From: Ruslan Ruslichenko <Ruslan_Ruslichenko@epam.com>
>>>
>>> This patch series is submitted as an RFC to gather early feedback on a Fault Injection (FI) framework built on top of the QEMU TCG plugin subsystem.
>>>
>>> Motivation
>>>
>>> Testing guest operating systems, hypervisors (like Xen), and low-level drivers against unexpected hardware failures can be difficult.
>>> This series provides an interface to inject faults dynamically without altering QEMU's core emulation source code for every test case.
>>>
>>> Architecture & Key Features
>>>
>>> The series introduces the core API extensions and implements a fault injection plugin (contrib/plugins/fault_injection.c) targeting AArch64.
>>> The plugin can be controlled statically via XML configurations on boot, or dynamically at runtime via a UNIX socket (enabling integration with automated testing frameworks via Python or GDB).
>>>
>>> New Plugin API Capabilities:
>>>
>>> MMIO Interception: Allows plugins to hook into memory_region_dispatch_read/write to modify hardware register reads or drop writes.
>>> Asynchronous Timers: Exposes QEMU_CLOCK_VIRTUAL to plugins, allowing callbacks to be scheduled based on guest virtual time.
>>> TB Cache Flushing: Exposes qemu_plugin_flush_tb_cache() so plugins can force re-translation when applying dynamic PC-based hooks.
>>> Interrupt & Exception Injection: Exposes APIs to raise/pulse hardware IRQs on the primary INTC and inject CPU exceptions (e.g., SErrors).
>>> Custom Device Faults: Introduces a registry where device models (e.g., SMMUv3) can expose specific fault handlers (like CMDQ errors) to be triggered externally by plugins.
>>>
>>> Patch Summary
>>> Patch 1 (target/arm): Adds support for asynchronous CPU exception injection.
>>> Patch 2-3 (plugins/api): Exposes virtual clock timers and TB cache flushing to the public plugin API.
>>> Patch 4 (plugins): Introduces the core fault injection subsystem, IRQ/Exception routing, and the Custom Fault registry.
>>> Patch 5 (system/memory): Adds the MMIO override hooks into the memory dispatch path.
>>> Patch 6 (hw/intc): Registers the ARM GIC (v2/v3) with the plugin subsystem to enable direct hardware IRQ injection.
>>> Patch 7 (hw/arm): Registers the SMMUv3 with the custom fault registry to demonstrate how device models can expose specific errors (like CMDQ faults) to plugins.
>>> Patch 8 (contrib/plugins): Implements the actual fault_injection plugin using the new APIs.
>>> Patch 9 (docs): Adds documentation and usage examples for the plugin.
>>>
>>> Request for Comments & Feedback
>>>
>>> Any suggestions on improvements, potential edge cases, or issues with the current design are highly welcome.
>>>
>>> Ruslan Ruslichenko (9):
>>>     target/arm: Add API for dynamic exception injection
>>>     plugins/api: Expose virtual clock timers to plugins
>>>     plugins: Expose Transaction Block cache flush API to plugins
>>>     plugins: Introduce fault injection API and core subsystem
>>>     system/memory: Add plugin callbacks to intercept MMIO accesses
>>>     hw/intc/arm_gic: Register primary GIC for plugin IRQ injection
>>>     hw/arm/smmuv3: Add plugin fault handler for CMDQ errors
>>>     contrib/plugins: Add fault injection plugin
>>>     docs: Add description of fault-injection plugin and subsystem
>>>
>>>    contrib/plugins/fault_injection.c | 772 ++++++++++++++++++++++++++++++
>>>    contrib/plugins/meson.build       |   1 +
>>>    docs/fault-injection.txt          | 111 +++++
>>>    hw/arm/smmuv3.c                   |  54 +++
>>>    hw/intc/arm_gic.c                 |  28 ++
>>>    hw/intc/arm_gicv3.c               |  28 ++
>>>    include/plugins/qemu-plugin.h     |  28 ++
>>>    include/qemu/plugin.h             |  39 ++
>>>    plugins/api.c                     |  62 +++
>>>    plugins/core.c                    |  11 +
>>>    plugins/fault.c                   | 116 +++++
>>>    plugins/meson.build               |   1 +
>>>    plugins/plugin.h                  |   2 +
>>>    system/memory.c                   |   8 +
>>>    target/arm/cpu.h                  |   4 +
>>>    target/arm/helper.c               |  55 +++
>>>    16 files changed, 1320 insertions(+)
>>>    create mode 100644 contrib/plugins/fault_injection.c
>>>    create mode 100644 docs/fault-injection.txt
>>>    create mode 100644 plugins/fault.c
>>>
>>
>> first, thanks for posting your series!
>>
>> About the general approach.
>> As you noticed, this is exposing a lot of QEMU internals, and it's
>> something we tend to avoid to do. As well, it's very architecture
>> specific, which is another pattern we try to avoid.
>>
>> For some of your needs (especially IRQ injection and timer injection),
>> did you consider writing a custom ad-hoc device and timer generating those?
>> There is nothing preventing you from writing a plugin that can
>> communicate with this specific device (through a socket for instance),
>> to request specific injections. I feel that it would scale better than
>> exposing all this to QEMU plugins API.
>>
>> For SMMU, this is trickier. Tao recently (6ce361b02c82) an iommu test
>> device, associated to qtest to unit test the smmu implementation. We
>> could maybe see to leverage that on a full machine, associated with the
>> communication method mentioned above, to generate specific operations at
>> runtime, all triggered via a plugin.
>>
>> Exposing qemu_plugin_flush_tb_cache is a hint we are missing something
>> on QEMU side. Better to fix it than expose this very internal function.
> 
> The reason this was needed is that the plugin may receive PC trigger
> configuration
> dynamically and need to register instruction callback at runtime.
> If the TB for that PC is already translated and cached, our newly registered
> callback might not be executed.
> 
> If there is a more proper way to force QEMU to re-translate a specific
> TB or attach
> a callback to cached TB it would be great to reduce the complexity here.
>

I understand better. QEMU plugin current implementation is too limited 
for this, and everything has to be done/known at translation time.
What is your use case for receiving PC trigger after translation? Do you 
have some mechanism to communicate with the plugin for this?

>> The associated TRIGGER_ON_PC is very similar to existing inline
>> operations. They could be enhanced to support writing to a given
>> register, all the bricks are there. For TRIGGER_ON_SYSREG it's a bit
>> more complex, but we might enhance inline operations also to support
>> hooks on specific register writes.
> 
> TRIGGER_ON_PC may also be used for generating other faults too. For example,
> one use-case is to trigger CPU exceptions on specific instructions.
> Supporting TRIGGER_ON_SYSREG as an inline operation sounds like a
> really interesting
> direction to explore.
>

In general, having inline operations support on register read/writes 
would be a very nice thing to have (though might be tricky to implement 
correctly), and more efficient that the existing approach that requires 
to check their value everytime.

>>
>> For MMIO override, the current approach you have is good, and it's
>> definitely something we could integrate.
>>
>> What are you toughts about this? (especially the device based approach
>> in case that you maybe tried first).
> 
> I agree such an approach can work well for IRQ's and Timers, and would be
> more clean way to implement this.
> 
> However, for SMMU and similar cases, triggering internal state errors is not
> easy and requires accessing internal logic. So for those specific cases,
> a different approach may be needed.
>

Thus the iommu-testdev I mentioned, that could be extended to support this.

>>
>> Regards,
>> Pierrick
> 
> BR,
> Ruslan

Regards,
Pierrick

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Ruslan Ruslichenko 2 weeks, 3 days ago

On Thu, Mar 19, 2026 at 8:04 PM Pierrick Bouvier
<pierrick.bouvier@linaro.org> wrote:
>
> On 3/19/26 11:20 AM, Ruslan Ruslichenko wrote:
> > Hi Pierrick,
> >
> > Thank you for the feedback and review!
> >
> > Our current plan is to put this plugin through our internal workflows to gather
> > more data on its limitations and performance.
> > Based on results, we may consider extending or refining the implementation
> > in the future.
> >
> > Any further feedback on potential issues is highly appreciated.
> >
>
> By design, the approach of modifying QEMU internals to allow to inject
> IRQ, set a timer, or trigger SMMU has very few chances to be integrated
> as it is. At least, it should be discussed with the concerned
> maintainers, and see if they would be open to it or not.
>
> It's not wrong in itself, if you want a downstream solution, but it does
> not scale upstream if we have to consider and accept everyone's needs.
> The plugin API in itself can accept the burden for such things, but it's
> harder to justify for internal stuff.
>
> I believe it would be better to rely on ad hoc devices generating this,
> with the advantage that even if they don't get accepted upstream, it
> will be more easy for you to maintain them downstream compared to more
> intrusive patches.
>
> > On Wed, Mar 18, 2026 at 6:16 PM Pierrick Bouvier
> > <pierrick.bouvier@linaro.org> wrote:
> >>
> >> Hi Ruslan,
> >>
> >> On 3/18/26 3:46 AM, Ruslan Ruslichenko wrote:
> >>> From: Ruslan Ruslichenko <Ruslan_Ruslichenko@epam.com>
> >>>
> >>> This patch series is submitted as an RFC to gather early feedback on a Fault Injection (FI) framework built on top of the QEMU TCG plugin subsystem.
> >>>
> >>> Motivation
> >>>
> >>> Testing guest operating systems, hypervisors (like Xen), and low-level drivers against unexpected hardware failures can be difficult.
> >>> This series provides an interface to inject faults dynamically without altering QEMU's core emulation source code for every test case.
> >>>
> >>> Architecture & Key Features
> >>>
> >>> The series introduces the core API extensions and implements a fault injection plugin (contrib/plugins/fault_injection.c) targeting AArch64.
> >>> The plugin can be controlled statically via XML configurations on boot, or dynamically at runtime via a UNIX socket (enabling integration with automated testing frameworks via Python or GDB).
> >>>
> >>> New Plugin API Capabilities:
> >>>
> >>> MMIO Interception: Allows plugins to hook into memory_region_dispatch_read/write to modify hardware register reads or drop writes.
> >>> Asynchronous Timers: Exposes QEMU_CLOCK_VIRTUAL to plugins, allowing callbacks to be scheduled based on guest virtual time.
> >>> TB Cache Flushing: Exposes qemu_plugin_flush_tb_cache() so plugins can force re-translation when applying dynamic PC-based hooks.
> >>> Interrupt & Exception Injection: Exposes APIs to raise/pulse hardware IRQs on the primary INTC and inject CPU exceptions (e.g., SErrors).
> >>> Custom Device Faults: Introduces a registry where device models (e.g., SMMUv3) can expose specific fault handlers (like CMDQ errors) to be triggered externally by plugins.
> >>>
> >>> Patch Summary
> >>> Patch 1 (target/arm): Adds support for asynchronous CPU exception injection.
> >>> Patch 2-3 (plugins/api): Exposes virtual clock timers and TB cache flushing to the public plugin API.
> >>> Patch 4 (plugins): Introduces the core fault injection subsystem, IRQ/Exception routing, and the Custom Fault registry.
> >>> Patch 5 (system/memory): Adds the MMIO override hooks into the memory dispatch path.
> >>> Patch 6 (hw/intc): Registers the ARM GIC (v2/v3) with the plugin subsystem to enable direct hardware IRQ injection.
> >>> Patch 7 (hw/arm): Registers the SMMUv3 with the custom fault registry to demonstrate how device models can expose specific errors (like CMDQ faults) to plugins.
> >>> Patch 8 (contrib/plugins): Implements the actual fault_injection plugin using the new APIs.
> >>> Patch 9 (docs): Adds documentation and usage examples for the plugin.
> >>>
> >>> Request for Comments & Feedback
> >>>
> >>> Any suggestions on improvements, potential edge cases, or issues with the current design are highly welcome.
> >>>
> >>> Ruslan Ruslichenko (9):
> >>>     target/arm: Add API for dynamic exception injection
> >>>     plugins/api: Expose virtual clock timers to plugins
> >>>     plugins: Expose Transaction Block cache flush API to plugins
> >>>     plugins: Introduce fault injection API and core subsystem
> >>>     system/memory: Add plugin callbacks to intercept MMIO accesses
> >>>     hw/intc/arm_gic: Register primary GIC for plugin IRQ injection
> >>>     hw/arm/smmuv3: Add plugin fault handler for CMDQ errors
> >>>     contrib/plugins: Add fault injection plugin
> >>>     docs: Add description of fault-injection plugin and subsystem
> >>>
> >>>    contrib/plugins/fault_injection.c | 772 ++++++++++++++++++++++++++++++
> >>>    contrib/plugins/meson.build       |   1 +
> >>>    docs/fault-injection.txt          | 111 +++++
> >>>    hw/arm/smmuv3.c                   |  54 +++
> >>>    hw/intc/arm_gic.c                 |  28 ++
> >>>    hw/intc/arm_gicv3.c               |  28 ++
> >>>    include/plugins/qemu-plugin.h     |  28 ++
> >>>    include/qemu/plugin.h             |  39 ++
> >>>    plugins/api.c                     |  62 +++
> >>>    plugins/core.c                    |  11 +
> >>>    plugins/fault.c                   | 116 +++++
> >>>    plugins/meson.build               |   1 +
> >>>    plugins/plugin.h                  |   2 +
> >>>    system/memory.c                   |   8 +
> >>>    target/arm/cpu.h                  |   4 +
> >>>    target/arm/helper.c               |  55 +++
> >>>    16 files changed, 1320 insertions(+)
> >>>    create mode 100644 contrib/plugins/fault_injection.c
> >>>    create mode 100644 docs/fault-injection.txt
> >>>    create mode 100644 plugins/fault.c
> >>>
> >>
> >> first, thanks for posting your series!
> >>
> >> About the general approach.
> >> As you noticed, this is exposing a lot of QEMU internals, and it's
> >> something we tend to avoid to do. As well, it's very architecture
> >> specific, which is another pattern we try to avoid.
> >>
> >> For some of your needs (especially IRQ injection and timer injection),
> >> did you consider writing a custom ad-hoc device and timer generating those?
> >> There is nothing preventing you from writing a plugin that can
> >> communicate with this specific device (through a socket for instance),
> >> to request specific injections. I feel that it would scale better than
> >> exposing all this to QEMU plugins API.
> >>
> >> For SMMU, this is trickier. Tao recently (6ce361b02c82) an iommu test
> >> device, associated to qtest to unit test the smmu implementation. We
> >> could maybe see to leverage that on a full machine, associated with the
> >> communication method mentioned above, to generate specific operations at
> >> runtime, all triggered via a plugin.
> >>
> >> Exposing qemu_plugin_flush_tb_cache is a hint we are missing something
> >> on QEMU side. Better to fix it than expose this very internal function.
> >
> > The reason this was needed is that the plugin may receive PC trigger
> > configuration
> > dynamically and need to register instruction callback at runtime.
> > If the TB for that PC is already translated and cached, our newly registered
> > callback might not be executed.
> >
> > If there is a more proper way to force QEMU to re-translate a specific
> > TB or attach
> > a callback to cached TB it would be great to reduce the complexity here.
> >
>
> I understand better. QEMU plugin current implementation is too limited
> for this, and everything has to be done/known at translation time.
> What is your use case for receiving PC trigger after translation? Do you
> have some mechanism to communicate with the plugin for this?

Yes, exactly. If the guest has already executed the target code, the newly
added trigger will be ignored, as the TB is cached.

For runtime configuration, the plugin spawns a background thread that listens
on a socket. External Python test script connects to this socket to send
dynamically generated XML faults.

There are several scenarios where this might be needed, mainly for faults that
are difficult to define statically at boot time.
Examples include injecting faults after specific chain of events, freezing or
overriding system registers values at specific execution points (since this
is currently implemented via PC triggers). Supporting environments with KASLR
enabled might be one more case.

>
> >> The associated TRIGGER_ON_PC is very similar to existing inline
> >> operations. They could be enhanced to support writing to a given
> >> register, all the bricks are there. For TRIGGER_ON_SYSREG it's a bit
> >> more complex, but we might enhance inline operations also to support
> >> hooks on specific register writes.
> >
> > TRIGGER_ON_PC may also be used for generating other faults too. For example,
> > one use-case is to trigger CPU exceptions on specific instructions.
> > Supporting TRIGGER_ON_SYSREG as an inline operation sounds like a
> > really interesting
> > direction to explore.
> >
>
> In general, having inline operations support on register read/writes
> would be a very nice thing to have (though might be tricky to implement
> correctly), and more efficient that the existing approach that requires
> to check their value everytime.
>
> >>
> >> For MMIO override, the current approach you have is good, and it's
> >> definitely something we could integrate.
> >>
> >> What are you toughts about this? (especially the device based approach
> >> in case that you maybe tried first).
> >
> > I agree such an approach can work well for IRQ's and Timers, and would be
> > more clean way to implement this.
> >
> > However, for SMMU and similar cases, triggering internal state errors is not
> > easy and requires accessing internal logic. So for those specific cases,
> > a different approach may be needed.
> >
>
> Thus the iommu-testdev I mentioned, that could be extended to support this.
>
> >>
> >> Regards,
> >> Pierrick
> >
> > BR,
> > Ruslan
>
> Regards,
> Pierrick

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Pierrick Bouvier 2 weeks, 2 days ago

On 3/19/26 3:29 PM, Ruslan Ruslichenko wrote:
> On Thu, Mar 19, 2026 at 8:04 PM Pierrick Bouvier
> <pierrick.bouvier@linaro.org> wrote:
>>
>> On 3/19/26 11:20 AM, Ruslan Ruslichenko wrote:
>>> Hi Pierrick,
>>>
>>> Thank you for the feedback and review!
>>>
>>> Our current plan is to put this plugin through our internal workflows to gather
>>> more data on its limitations and performance.
>>> Based on results, we may consider extending or refining the implementation
>>> in the future.
>>>
>>> Any further feedback on potential issues is highly appreciated.
>>>
>>
>> By design, the approach of modifying QEMU internals to allow to inject
>> IRQ, set a timer, or trigger SMMU has very few chances to be integrated
>> as it is. At least, it should be discussed with the concerned
>> maintainers, and see if they would be open to it or not.
>>
>> It's not wrong in itself, if you want a downstream solution, but it does
>> not scale upstream if we have to consider and accept everyone's needs.
>> The plugin API in itself can accept the burden for such things, but it's
>> harder to justify for internal stuff.
>>
>> I believe it would be better to rely on ad hoc devices generating this,
>> with the advantage that even if they don't get accepted upstream, it
>> will be more easy for you to maintain them downstream compared to more
>> intrusive patches.
>>
>>> On Wed, Mar 18, 2026 at 6:16 PM Pierrick Bouvier
>>> <pierrick.bouvier@linaro.org> wrote:
>>>>
>>>> Hi Ruslan,
>>>>
>>>> On 3/18/26 3:46 AM, Ruslan Ruslichenko wrote:
>>>>> From: Ruslan Ruslichenko <Ruslan_Ruslichenko@epam.com>
>>>>>
>>>>> This patch series is submitted as an RFC to gather early feedback on a Fault Injection (FI) framework built on top of the QEMU TCG plugin subsystem.
>>>>>
>>>>> Motivation
>>>>>
>>>>> Testing guest operating systems, hypervisors (like Xen), and low-level drivers against unexpected hardware failures can be difficult.
>>>>> This series provides an interface to inject faults dynamically without altering QEMU's core emulation source code for every test case.
>>>>>
>>>>> Architecture & Key Features
>>>>>
>>>>> The series introduces the core API extensions and implements a fault injection plugin (contrib/plugins/fault_injection.c) targeting AArch64.
>>>>> The plugin can be controlled statically via XML configurations on boot, or dynamically at runtime via a UNIX socket (enabling integration with automated testing frameworks via Python or GDB).
>>>>>
>>>>> New Plugin API Capabilities:
>>>>>
>>>>> MMIO Interception: Allows plugins to hook into memory_region_dispatch_read/write to modify hardware register reads or drop writes.
>>>>> Asynchronous Timers: Exposes QEMU_CLOCK_VIRTUAL to plugins, allowing callbacks to be scheduled based on guest virtual time.
>>>>> TB Cache Flushing: Exposes qemu_plugin_flush_tb_cache() so plugins can force re-translation when applying dynamic PC-based hooks.
>>>>> Interrupt & Exception Injection: Exposes APIs to raise/pulse hardware IRQs on the primary INTC and inject CPU exceptions (e.g., SErrors).
>>>>> Custom Device Faults: Introduces a registry where device models (e.g., SMMUv3) can expose specific fault handlers (like CMDQ errors) to be triggered externally by plugins.
>>>>>
>>>>> Patch Summary
>>>>> Patch 1 (target/arm): Adds support for asynchronous CPU exception injection.
>>>>> Patch 2-3 (plugins/api): Exposes virtual clock timers and TB cache flushing to the public plugin API.
>>>>> Patch 4 (plugins): Introduces the core fault injection subsystem, IRQ/Exception routing, and the Custom Fault registry.
>>>>> Patch 5 (system/memory): Adds the MMIO override hooks into the memory dispatch path.
>>>>> Patch 6 (hw/intc): Registers the ARM GIC (v2/v3) with the plugin subsystem to enable direct hardware IRQ injection.
>>>>> Patch 7 (hw/arm): Registers the SMMUv3 with the custom fault registry to demonstrate how device models can expose specific errors (like CMDQ faults) to plugins.
>>>>> Patch 8 (contrib/plugins): Implements the actual fault_injection plugin using the new APIs.
>>>>> Patch 9 (docs): Adds documentation and usage examples for the plugin.
>>>>>
>>>>> Request for Comments & Feedback
>>>>>
>>>>> Any suggestions on improvements, potential edge cases, or issues with the current design are highly welcome.
>>>>>
>>>>> Ruslan Ruslichenko (9):
>>>>>      target/arm: Add API for dynamic exception injection
>>>>>      plugins/api: Expose virtual clock timers to plugins
>>>>>      plugins: Expose Transaction Block cache flush API to plugins
>>>>>      plugins: Introduce fault injection API and core subsystem
>>>>>      system/memory: Add plugin callbacks to intercept MMIO accesses
>>>>>      hw/intc/arm_gic: Register primary GIC for plugin IRQ injection
>>>>>      hw/arm/smmuv3: Add plugin fault handler for CMDQ errors
>>>>>      contrib/plugins: Add fault injection plugin
>>>>>      docs: Add description of fault-injection plugin and subsystem
>>>>>
>>>>>     contrib/plugins/fault_injection.c | 772 ++++++++++++++++++++++++++++++
>>>>>     contrib/plugins/meson.build       |   1 +
>>>>>     docs/fault-injection.txt          | 111 +++++
>>>>>     hw/arm/smmuv3.c                   |  54 +++
>>>>>     hw/intc/arm_gic.c                 |  28 ++
>>>>>     hw/intc/arm_gicv3.c               |  28 ++
>>>>>     include/plugins/qemu-plugin.h     |  28 ++
>>>>>     include/qemu/plugin.h             |  39 ++
>>>>>     plugins/api.c                     |  62 +++
>>>>>     plugins/core.c                    |  11 +
>>>>>     plugins/fault.c                   | 116 +++++
>>>>>     plugins/meson.build               |   1 +
>>>>>     plugins/plugin.h                  |   2 +
>>>>>     system/memory.c                   |   8 +
>>>>>     target/arm/cpu.h                  |   4 +
>>>>>     target/arm/helper.c               |  55 +++
>>>>>     16 files changed, 1320 insertions(+)
>>>>>     create mode 100644 contrib/plugins/fault_injection.c
>>>>>     create mode 100644 docs/fault-injection.txt
>>>>>     create mode 100644 plugins/fault.c
>>>>>
>>>>
>>>> first, thanks for posting your series!
>>>>
>>>> About the general approach.
>>>> As you noticed, this is exposing a lot of QEMU internals, and it's
>>>> something we tend to avoid to do. As well, it's very architecture
>>>> specific, which is another pattern we try to avoid.
>>>>
>>>> For some of your needs (especially IRQ injection and timer injection),
>>>> did you consider writing a custom ad-hoc device and timer generating those?
>>>> There is nothing preventing you from writing a plugin that can
>>>> communicate with this specific device (through a socket for instance),
>>>> to request specific injections. I feel that it would scale better than
>>>> exposing all this to QEMU plugins API.
>>>>
>>>> For SMMU, this is trickier. Tao recently (6ce361b02c82) an iommu test
>>>> device, associated to qtest to unit test the smmu implementation. We
>>>> could maybe see to leverage that on a full machine, associated with the
>>>> communication method mentioned above, to generate specific operations at
>>>> runtime, all triggered via a plugin.
>>>>
>>>> Exposing qemu_plugin_flush_tb_cache is a hint we are missing something
>>>> on QEMU side. Better to fix it than expose this very internal function.
>>>
>>> The reason this was needed is that the plugin may receive PC trigger
>>> configuration
>>> dynamically and need to register instruction callback at runtime.
>>> If the TB for that PC is already translated and cached, our newly registered
>>> callback might not be executed.
>>>
>>> If there is a more proper way to force QEMU to re-translate a specific
>>> TB or attach
>>> a callback to cached TB it would be great to reduce the complexity here.
>>>
>>
>> I understand better. QEMU plugin current implementation is too limited
>> for this, and everything has to be done/known at translation time.
>> What is your use case for receiving PC trigger after translation? Do you
>> have some mechanism to communicate with the plugin for this?
> 
> Yes, exactly. If the guest has already executed the target code, the newly
> added trigger will be ignored, as the TB is cached.
> 
> For runtime configuration, the plugin spawns a background thread that listens
> on a socket. External Python test script connects to this socket to send
> dynamically generated XML faults.
>

Ok.

Internally, we have tb_invalidate_phys_range that will invalidate a 
given range of tb. This is called when writing to memory for a given 
address holding code.

Thus from your plugin, if you write to pc address with
qemu_plugin_write_memory_vaddr, it should trigger a re-translation of 
this tb. You'll need to read 1 byte, and write it back. As well, it 
should be more efficient, since you will only invalidate this tb.

Give it a try and let us know if it works for your need.

> There are several scenarios where this might be needed, mainly for faults that
> are difficult to define statically at boot time.
> Examples include injecting faults after specific chain of events, freezing or
> overriding system registers values at specific execution points (since this
> is currently implemented via PC triggers). Supporting environments with KASLR
> enabled might be one more case.
>

For system registers, you can (heavy but would work) instrument 
inconditionally all instructions that touch those registers, so there 
would be no need to flush anything. System registers are not accessed 
for every instruction, so hopefully, it should not impact too much 
execution time.

With both solutions, it should remove the need to expose tb_flush 
through plugin API.

>>
>>>> The associated TRIGGER_ON_PC is very similar to existing inline
>>>> operations. They could be enhanced to support writing to a given
>>>> register, all the bricks are there. For TRIGGER_ON_SYSREG it's a bit
>>>> more complex, but we might enhance inline operations also to support
>>>> hooks on specific register writes.
>>>
>>> TRIGGER_ON_PC may also be used for generating other faults too. For example,
>>> one use-case is to trigger CPU exceptions on specific instructions.
>>> Supporting TRIGGER_ON_SYSREG as an inline operation sounds like a
>>> really interesting
>>> direction to explore.
>>>
>>
>> In general, having inline operations support on register read/writes
>> would be a very nice thing to have (though might be tricky to implement
>> correctly), and more efficient that the existing approach that requires
>> to check their value everytime.
>>
>>>>
>>>> For MMIO override, the current approach you have is good, and it's
>>>> definitely something we could integrate.
>>>>
>>>> What are you toughts about this? (especially the device based approach
>>>> in case that you maybe tried first).
>>>
>>> I agree such an approach can work well for IRQ's and Timers, and would be
>>> more clean way to implement this.
>>>
>>> However, for SMMU and similar cases, triggering internal state errors is not
>>> easy and requires accessing internal logic. So for those specific cases,
>>> a different approach may be needed.
>>>
>>
>> Thus the iommu-testdev I mentioned, that could be extended to support this.
>>
>>>>
>>>> Regards,
>>>> Pierrick
>>>
>>> BR,
>>> Ruslan
>>
>> Regards,
>> Pierrick

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Ruslan Ruslichenko 1 week, 4 days ago

On Fri, Mar 20, 2026 at 7:08 PM Pierrick Bouvier
<pierrick.bouvier@linaro.org> wrote:
>
> On 3/19/26 3:29 PM, Ruslan Ruslichenko wrote:
> > On Thu, Mar 19, 2026 at 8:04 PM Pierrick Bouvier
> > <pierrick.bouvier@linaro.org> wrote:
> >>
> >> On 3/19/26 11:20 AM, Ruslan Ruslichenko wrote:
> >>> Hi Pierrick,
> >>>
> >>> Thank you for the feedback and review!
> >>>
> >>> Our current plan is to put this plugin through our internal workflows to gather
> >>> more data on its limitations and performance.
> >>> Based on results, we may consider extending or refining the implementation
> >>> in the future.
> >>>
> >>> Any further feedback on potential issues is highly appreciated.
> >>>
> >>
> >> By design, the approach of modifying QEMU internals to allow to inject
> >> IRQ, set a timer, or trigger SMMU has very few chances to be integrated
> >> as it is. At least, it should be discussed with the concerned
> >> maintainers, and see if they would be open to it or not.
> >>
> >> It's not wrong in itself, if you want a downstream solution, but it does
> >> not scale upstream if we have to consider and accept everyone's needs.
> >> The plugin API in itself can accept the burden for such things, but it's
> >> harder to justify for internal stuff.
> >>
> >> I believe it would be better to rely on ad hoc devices generating this,
> >> with the advantage that even if they don't get accepted upstream, it
> >> will be more easy for you to maintain them downstream compared to more
> >> intrusive patches.
> >>
> >>> On Wed, Mar 18, 2026 at 6:16 PM Pierrick Bouvier
> >>> <pierrick.bouvier@linaro.org> wrote:
> >>>>
> >>>> Hi Ruslan,
> >>>>
> >>>> On 3/18/26 3:46 AM, Ruslan Ruslichenko wrote:
> >>>>> From: Ruslan Ruslichenko <Ruslan_Ruslichenko@epam.com>
> >>>>>
> >>>>> This patch series is submitted as an RFC to gather early feedback on a Fault Injection (FI) framework built on top of the QEMU TCG plugin subsystem.
> >>>>>
> >>>>> Motivation
> >>>>>
> >>>>> Testing guest operating systems, hypervisors (like Xen), and low-level drivers against unexpected hardware failures can be difficult.
> >>>>> This series provides an interface to inject faults dynamically without altering QEMU's core emulation source code for every test case.
> >>>>>
> >>>>> Architecture & Key Features
> >>>>>
> >>>>> The series introduces the core API extensions and implements a fault injection plugin (contrib/plugins/fault_injection.c) targeting AArch64.
> >>>>> The plugin can be controlled statically via XML configurations on boot, or dynamically at runtime via a UNIX socket (enabling integration with automated testing frameworks via Python or GDB).
> >>>>>
> >>>>> New Plugin API Capabilities:
> >>>>>
> >>>>> MMIO Interception: Allows plugins to hook into memory_region_dispatch_read/write to modify hardware register reads or drop writes.
> >>>>> Asynchronous Timers: Exposes QEMU_CLOCK_VIRTUAL to plugins, allowing callbacks to be scheduled based on guest virtual time.
> >>>>> TB Cache Flushing: Exposes qemu_plugin_flush_tb_cache() so plugins can force re-translation when applying dynamic PC-based hooks.
> >>>>> Interrupt & Exception Injection: Exposes APIs to raise/pulse hardware IRQs on the primary INTC and inject CPU exceptions (e.g., SErrors).
> >>>>> Custom Device Faults: Introduces a registry where device models (e.g., SMMUv3) can expose specific fault handlers (like CMDQ errors) to be triggered externally by plugins.
> >>>>>
> >>>>> Patch Summary
> >>>>> Patch 1 (target/arm): Adds support for asynchronous CPU exception injection.
> >>>>> Patch 2-3 (plugins/api): Exposes virtual clock timers and TB cache flushing to the public plugin API.
> >>>>> Patch 4 (plugins): Introduces the core fault injection subsystem, IRQ/Exception routing, and the Custom Fault registry.
> >>>>> Patch 5 (system/memory): Adds the MMIO override hooks into the memory dispatch path.
> >>>>> Patch 6 (hw/intc): Registers the ARM GIC (v2/v3) with the plugin subsystem to enable direct hardware IRQ injection.
> >>>>> Patch 7 (hw/arm): Registers the SMMUv3 with the custom fault registry to demonstrate how device models can expose specific errors (like CMDQ faults) to plugins.
> >>>>> Patch 8 (contrib/plugins): Implements the actual fault_injection plugin using the new APIs.
> >>>>> Patch 9 (docs): Adds documentation and usage examples for the plugin.
> >>>>>
> >>>>> Request for Comments & Feedback
> >>>>>
> >>>>> Any suggestions on improvements, potential edge cases, or issues with the current design are highly welcome.
> >>>>>
> >>>>> Ruslan Ruslichenko (9):
> >>>>>      target/arm: Add API for dynamic exception injection
> >>>>>      plugins/api: Expose virtual clock timers to plugins
> >>>>>      plugins: Expose Transaction Block cache flush API to plugins
> >>>>>      plugins: Introduce fault injection API and core subsystem
> >>>>>      system/memory: Add plugin callbacks to intercept MMIO accesses
> >>>>>      hw/intc/arm_gic: Register primary GIC for plugin IRQ injection
> >>>>>      hw/arm/smmuv3: Add plugin fault handler for CMDQ errors
> >>>>>      contrib/plugins: Add fault injection plugin
> >>>>>      docs: Add description of fault-injection plugin and subsystem
> >>>>>
> >>>>>     contrib/plugins/fault_injection.c | 772 ++++++++++++++++++++++++++++++
> >>>>>     contrib/plugins/meson.build       |   1 +
> >>>>>     docs/fault-injection.txt          | 111 +++++
> >>>>>     hw/arm/smmuv3.c                   |  54 +++
> >>>>>     hw/intc/arm_gic.c                 |  28 ++
> >>>>>     hw/intc/arm_gicv3.c               |  28 ++
> >>>>>     include/plugins/qemu-plugin.h     |  28 ++
> >>>>>     include/qemu/plugin.h             |  39 ++
> >>>>>     plugins/api.c                     |  62 +++
> >>>>>     plugins/core.c                    |  11 +
> >>>>>     plugins/fault.c                   | 116 +++++
> >>>>>     plugins/meson.build               |   1 +
> >>>>>     plugins/plugin.h                  |   2 +
> >>>>>     system/memory.c                   |   8 +
> >>>>>     target/arm/cpu.h                  |   4 +
> >>>>>     target/arm/helper.c               |  55 +++
> >>>>>     16 files changed, 1320 insertions(+)
> >>>>>     create mode 100644 contrib/plugins/fault_injection.c
> >>>>>     create mode 100644 docs/fault-injection.txt
> >>>>>     create mode 100644 plugins/fault.c
> >>>>>
> >>>>
> >>>> first, thanks for posting your series!
> >>>>
> >>>> About the general approach.
> >>>> As you noticed, this is exposing a lot of QEMU internals, and it's
> >>>> something we tend to avoid to do. As well, it's very architecture
> >>>> specific, which is another pattern we try to avoid.
> >>>>
> >>>> For some of your needs (especially IRQ injection and timer injection),
> >>>> did you consider writing a custom ad-hoc device and timer generating those?
> >>>> There is nothing preventing you from writing a plugin that can
> >>>> communicate with this specific device (through a socket for instance),
> >>>> to request specific injections. I feel that it would scale better than
> >>>> exposing all this to QEMU plugins API.
> >>>>
> >>>> For SMMU, this is trickier. Tao recently (6ce361b02c82) an iommu test
> >>>> device, associated to qtest to unit test the smmu implementation. We
> >>>> could maybe see to leverage that on a full machine, associated with the
> >>>> communication method mentioned above, to generate specific operations at
> >>>> runtime, all triggered via a plugin.
> >>>>
> >>>> Exposing qemu_plugin_flush_tb_cache is a hint we are missing something
> >>>> on QEMU side. Better to fix it than expose this very internal function.
> >>>
> >>> The reason this was needed is that the plugin may receive PC trigger
> >>> configuration
> >>> dynamically and need to register instruction callback at runtime.
> >>> If the TB for that PC is already translated and cached, our newly registered
> >>> callback might not be executed.
> >>>
> >>> If there is a more proper way to force QEMU to re-translate a specific
> >>> TB or attach
> >>> a callback to cached TB it would be great to reduce the complexity here.
> >>>
> >>
> >> I understand better. QEMU plugin current implementation is too limited
> >> for this, and everything has to be done/known at translation time.
> >> What is your use case for receiving PC trigger after translation? Do you
> >> have some mechanism to communicate with the plugin for this?
> >
> > Yes, exactly. If the guest has already executed the target code, the newly
> > added trigger will be ignored, as the TB is cached.
> >
> > For runtime configuration, the plugin spawns a background thread that listens
> > on a socket. External Python test script connects to this socket to send
> > dynamically generated XML faults.
> >
>
> Ok.
>
> Internally, we have tb_invalidate_phys_range that will invalidate a
> given range of tb. This is called when writing to memory for a given
> address holding code.
>
> Thus from your plugin, if you write to pc address with
> qemu_plugin_write_memory_vaddr, it should trigger a re-translation of
> this tb. You'll need to read 1 byte, and write it back. As well, it
> should be more efficient, since you will only invalidate this tb.
>
> Give it a try and let us know if it works for your need.
>

Thank you for your suggestion. This is really useful information regarding
internals of tb processing.

I set up a test to simulate a scenario where a TB flush is needed
and used the described mechanism. However, there is a threading limitation:
qemu_plugin_write_memory_vaddr() must be called from a CPU thread.
In our current implementation dynamic faults are received and processed
by a background thread listening on a socket, so we cannot directly
use API from that context to trigger invalidation.

> > There are several scenarios where this might be needed, mainly for faults that
> > are difficult to define statically at boot time.
> > Examples include injecting faults after specific chain of events, freezing or
> > overriding system registers values at specific execution points (since this
> > is currently implemented via PC triggers). Supporting environments with KASLR
> > enabled might be one more case.
> >
>
> For system registers, you can (heavy but would work) instrument
> inconditionally all instructions that touch those registers, so there
> would be no need to flush anything. System registers are not accessed
> for every instruction, so hopefully, it should not impact too much
> execution time.
>

Agree, this is a good optimization and indeed simplifies dynamic faults
handling for system register reads.
Thank you for the recommendation!

> With both solutions, it should remove the need to expose tb_flush
> through plugin API.
>
> >>
> >>>> The associated TRIGGER_ON_PC is very similar to existing inline
> >>>> operations. They could be enhanced to support writing to a given
> >>>> register, all the bricks are there. For TRIGGER_ON_SYSREG it's a bit
> >>>> more complex, but we might enhance inline operations also to support
> >>>> hooks on specific register writes.
> >>>
> >>> TRIGGER_ON_PC may also be used for generating other faults too. For example,
> >>> one use-case is to trigger CPU exceptions on specific instructions.
> >>> Supporting TRIGGER_ON_SYSREG as an inline operation sounds like a
> >>> really interesting
> >>> direction to explore.
> >>>
> >>
> >> In general, having inline operations support on register read/writes
> >> would be a very nice thing to have (though might be tricky to implement
> >> correctly), and more efficient that the existing approach that requires
> >> to check their value everytime.
> >>
> >>>>
> >>>> For MMIO override, the current approach you have is good, and it's
> >>>> definitely something we could integrate.
> >>>>
> >>>> What are you toughts about this? (especially the device based approach
> >>>> in case that you maybe tried first).
> >>>
> >>> I agree such an approach can work well for IRQ's and Timers, and would be
> >>> more clean way to implement this.
> >>>
> >>> However, for SMMU and similar cases, triggering internal state errors is not
> >>> easy and requires accessing internal logic. So for those specific cases,
> >>> a different approach may be needed.
> >>>
> >>
> >> Thus the iommu-testdev I mentioned, that could be extended to support this.
> >>
> >>>>
> >>>> Regards,
> >>>> Pierrick
> >>>
> >>> BR,
> >>> Ruslan
> >>
> >> Regards,
> >> Pierrick
>

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Pierrick Bouvier 1 week, 4 days ago

On 3/25/26 4:39 PM, Ruslan Ruslichenko wrote:
> On Fri, Mar 20, 2026 at 7:08 PM Pierrick Bouvier
> <pierrick.bouvier@linaro.org> wrote:
>>
>> On 3/19/26 3:29 PM, Ruslan Ruslichenko wrote:
>>> On Thu, Mar 19, 2026 at 8:04 PM Pierrick Bouvier
>>> <pierrick.bouvier@linaro.org> wrote:
>>>>
>>>> On 3/19/26 11:20 AM, Ruslan Ruslichenko wrote:
>>>>> Hi Pierrick,
>>>>>
>>>>> Thank you for the feedback and review!
>>>>>
>>>>> Our current plan is to put this plugin through our internal workflows to gather
>>>>> more data on its limitations and performance.
>>>>> Based on results, we may consider extending or refining the implementation
>>>>> in the future.
>>>>>
>>>>> Any further feedback on potential issues is highly appreciated.
>>>>>
>>>>
>>>> By design, the approach of modifying QEMU internals to allow to inject
>>>> IRQ, set a timer, or trigger SMMU has very few chances to be integrated
>>>> as it is. At least, it should be discussed with the concerned
>>>> maintainers, and see if they would be open to it or not.
>>>>
>>>> It's not wrong in itself, if you want a downstream solution, but it does
>>>> not scale upstream if we have to consider and accept everyone's needs.
>>>> The plugin API in itself can accept the burden for such things, but it's
>>>> harder to justify for internal stuff.
>>>>
>>>> I believe it would be better to rely on ad hoc devices generating this,
>>>> with the advantage that even if they don't get accepted upstream, it
>>>> will be more easy for you to maintain them downstream compared to more
>>>> intrusive patches.
>>>>
>>>>> On Wed, Mar 18, 2026 at 6:16 PM Pierrick Bouvier
>>>>> <pierrick.bouvier@linaro.org> wrote:
>>>>>>
>>>>>> Hi Ruslan,
>>>>>>
>>>>>> On 3/18/26 3:46 AM, Ruslan Ruslichenko wrote:
>>>>>>> From: Ruslan Ruslichenko <Ruslan_Ruslichenko@epam.com>
>>>>>>>
>>>>>>> This patch series is submitted as an RFC to gather early feedback on a Fault Injection (FI) framework built on top of the QEMU TCG plugin subsystem.
>>>>>>>
>>>>>>> Motivation
>>>>>>>
>>>>>>> Testing guest operating systems, hypervisors (like Xen), and low-level drivers against unexpected hardware failures can be difficult.
>>>>>>> This series provides an interface to inject faults dynamically without altering QEMU's core emulation source code for every test case.
>>>>>>>
>>>>>>> Architecture & Key Features
>>>>>>>
>>>>>>> The series introduces the core API extensions and implements a fault injection plugin (contrib/plugins/fault_injection.c) targeting AArch64.
>>>>>>> The plugin can be controlled statically via XML configurations on boot, or dynamically at runtime via a UNIX socket (enabling integration with automated testing frameworks via Python or GDB).
>>>>>>>
>>>>>>> New Plugin API Capabilities:
>>>>>>>
>>>>>>> MMIO Interception: Allows plugins to hook into memory_region_dispatch_read/write to modify hardware register reads or drop writes.
>>>>>>> Asynchronous Timers: Exposes QEMU_CLOCK_VIRTUAL to plugins, allowing callbacks to be scheduled based on guest virtual time.
>>>>>>> TB Cache Flushing: Exposes qemu_plugin_flush_tb_cache() so plugins can force re-translation when applying dynamic PC-based hooks.
>>>>>>> Interrupt & Exception Injection: Exposes APIs to raise/pulse hardware IRQs on the primary INTC and inject CPU exceptions (e.g., SErrors).
>>>>>>> Custom Device Faults: Introduces a registry where device models (e.g., SMMUv3) can expose specific fault handlers (like CMDQ errors) to be triggered externally by plugins.
>>>>>>>
>>>>>>> Patch Summary
>>>>>>> Patch 1 (target/arm): Adds support for asynchronous CPU exception injection.
>>>>>>> Patch 2-3 (plugins/api): Exposes virtual clock timers and TB cache flushing to the public plugin API.
>>>>>>> Patch 4 (plugins): Introduces the core fault injection subsystem, IRQ/Exception routing, and the Custom Fault registry.
>>>>>>> Patch 5 (system/memory): Adds the MMIO override hooks into the memory dispatch path.
>>>>>>> Patch 6 (hw/intc): Registers the ARM GIC (v2/v3) with the plugin subsystem to enable direct hardware IRQ injection.
>>>>>>> Patch 7 (hw/arm): Registers the SMMUv3 with the custom fault registry to demonstrate how device models can expose specific errors (like CMDQ faults) to plugins.
>>>>>>> Patch 8 (contrib/plugins): Implements the actual fault_injection plugin using the new APIs.
>>>>>>> Patch 9 (docs): Adds documentation and usage examples for the plugin.
>>>>>>>
>>>>>>> Request for Comments & Feedback
>>>>>>>
>>>>>>> Any suggestions on improvements, potential edge cases, or issues with the current design are highly welcome.
>>>>>>>
>>>>>>> Ruslan Ruslichenko (9):
>>>>>>>       target/arm: Add API for dynamic exception injection
>>>>>>>       plugins/api: Expose virtual clock timers to plugins
>>>>>>>       plugins: Expose Transaction Block cache flush API to plugins
>>>>>>>       plugins: Introduce fault injection API and core subsystem
>>>>>>>       system/memory: Add plugin callbacks to intercept MMIO accesses
>>>>>>>       hw/intc/arm_gic: Register primary GIC for plugin IRQ injection
>>>>>>>       hw/arm/smmuv3: Add plugin fault handler for CMDQ errors
>>>>>>>       contrib/plugins: Add fault injection plugin
>>>>>>>       docs: Add description of fault-injection plugin and subsystem
>>>>>>>
>>>>>>>      contrib/plugins/fault_injection.c | 772 ++++++++++++++++++++++++++++++
>>>>>>>      contrib/plugins/meson.build       |   1 +
>>>>>>>      docs/fault-injection.txt          | 111 +++++
>>>>>>>      hw/arm/smmuv3.c                   |  54 +++
>>>>>>>      hw/intc/arm_gic.c                 |  28 ++
>>>>>>>      hw/intc/arm_gicv3.c               |  28 ++
>>>>>>>      include/plugins/qemu-plugin.h     |  28 ++
>>>>>>>      include/qemu/plugin.h             |  39 ++
>>>>>>>      plugins/api.c                     |  62 +++
>>>>>>>      plugins/core.c                    |  11 +
>>>>>>>      plugins/fault.c                   | 116 +++++
>>>>>>>      plugins/meson.build               |   1 +
>>>>>>>      plugins/plugin.h                  |   2 +
>>>>>>>      system/memory.c                   |   8 +
>>>>>>>      target/arm/cpu.h                  |   4 +
>>>>>>>      target/arm/helper.c               |  55 +++
>>>>>>>      16 files changed, 1320 insertions(+)
>>>>>>>      create mode 100644 contrib/plugins/fault_injection.c
>>>>>>>      create mode 100644 docs/fault-injection.txt
>>>>>>>      create mode 100644 plugins/fault.c
>>>>>>>
>>>>>>
>>>>>> first, thanks for posting your series!
>>>>>>
>>>>>> About the general approach.
>>>>>> As you noticed, this is exposing a lot of QEMU internals, and it's
>>>>>> something we tend to avoid to do. As well, it's very architecture
>>>>>> specific, which is another pattern we try to avoid.
>>>>>>
>>>>>> For some of your needs (especially IRQ injection and timer injection),
>>>>>> did you consider writing a custom ad-hoc device and timer generating those?
>>>>>> There is nothing preventing you from writing a plugin that can
>>>>>> communicate with this specific device (through a socket for instance),
>>>>>> to request specific injections. I feel that it would scale better than
>>>>>> exposing all this to QEMU plugins API.
>>>>>>
>>>>>> For SMMU, this is trickier. Tao recently (6ce361b02c82) an iommu test
>>>>>> device, associated to qtest to unit test the smmu implementation. We
>>>>>> could maybe see to leverage that on a full machine, associated with the
>>>>>> communication method mentioned above, to generate specific operations at
>>>>>> runtime, all triggered via a plugin.
>>>>>>
>>>>>> Exposing qemu_plugin_flush_tb_cache is a hint we are missing something
>>>>>> on QEMU side. Better to fix it than expose this very internal function.
>>>>>
>>>>> The reason this was needed is that the plugin may receive PC trigger
>>>>> configuration
>>>>> dynamically and need to register instruction callback at runtime.
>>>>> If the TB for that PC is already translated and cached, our newly registered
>>>>> callback might not be executed.
>>>>>
>>>>> If there is a more proper way to force QEMU to re-translate a specific
>>>>> TB or attach
>>>>> a callback to cached TB it would be great to reduce the complexity here.
>>>>>
>>>>
>>>> I understand better. QEMU plugin current implementation is too limited
>>>> for this, and everything has to be done/known at translation time.
>>>> What is your use case for receiving PC trigger after translation? Do you
>>>> have some mechanism to communicate with the plugin for this?
>>>
>>> Yes, exactly. If the guest has already executed the target code, the newly
>>> added trigger will be ignored, as the TB is cached.
>>>
>>> For runtime configuration, the plugin spawns a background thread that listens
>>> on a socket. External Python test script connects to this socket to send
>>> dynamically generated XML faults.
>>>
>>
>> Ok.
>>
>> Internally, we have tb_invalidate_phys_range that will invalidate a
>> given range of tb. This is called when writing to memory for a given
>> address holding code.
>>
>> Thus from your plugin, if you write to pc address with
>> qemu_plugin_write_memory_vaddr, it should trigger a re-translation of
>> this tb. You'll need to read 1 byte, and write it back. As well, it
>> should be more efficient, since you will only invalidate this tb.
>>
>> Give it a try and let us know if it works for your need.
>>
> 
> Thank you for your suggestion. This is really useful information regarding
> internals of tb processing.
> 
> I set up a test to simulate a scenario where a TB flush is needed
> and used the described mechanism. However, there is a threading limitation:
> qemu_plugin_write_memory_vaddr() must be called from a CPU thread.
> In our current implementation dynamic faults are received and processed
> by a background thread listening on a socket, so we cannot directly
> use API from that context to trigger invalidation.
>

Indeed, when writing to a virtual address, we need to know the current 
execution context and page table setup to translate it. I have two ideas:
- Register a callback per tb. When hitting a tb containing address where 
to inject the fault, perform the read/write described above.
You always instrument, and selectively "poke" the code to trigger a new 
translation.
- Simulate a given number of cpu watchpoints (N) by using N conditional 
callback on every instruction, comparing current pc to N addresses. I'm 
afraid it will be too slow.

One thing that could be considered on API side is to add a possibility 
to invalidate a specific hardware address (not all tb), based on 
tb_invalidate_phys_range. The problem is that plugin now need to keep 
track of all physical addresses matching virtual ones you want to 
invalidate, which is not convenient.

Else, the easiest way to solve all this is to expose tb_flush, like you 
did, but keep this patch downstream for now.
If your final plugin will stay downstream (which I expect, given it has 
its own protocol for injecting faults and no source for it), it's really 
the cheapest solution.

The current design is built around the assumption that instrumentation 
is made at translation time (and not later). So changing it by 
instrumenting after translation brings new constraints we can't solve at 
the moment without exposing internal details.

>>> There are several scenarios where this might be needed, mainly for faults that
>>> are difficult to define statically at boot time.
>>> Examples include injecting faults after specific chain of events, freezing or
>>> overriding system registers values at specific execution points (since this
>>> is currently implemented via PC triggers). Supporting environments with KASLR
>>> enabled might be one more case.
>>>
>>
>> For system registers, you can (heavy but would work) instrument
>> inconditionally all instructions that touch those registers, so there
>> would be no need to flush anything. System registers are not accessed
>> for every instruction, so hopefully, it should not impact too much
>> execution time.
>>
> 
> Agree, this is a good optimization and indeed simplifies dynamic faults
> handling for system register reads.
> Thank you for the recommendation!
>
>> With both solutions, it should remove the need to expose tb_flush
>> through plugin API.
>>
>>>>
>>>>>> The associated TRIGGER_ON_PC is very similar to existing inline
>>>>>> operations. They could be enhanced to support writing to a given
>>>>>> register, all the bricks are there. For TRIGGER_ON_SYSREG it's a bit
>>>>>> more complex, but we might enhance inline operations also to support
>>>>>> hooks on specific register writes.
>>>>>
>>>>> TRIGGER_ON_PC may also be used for generating other faults too. For example,
>>>>> one use-case is to trigger CPU exceptions on specific instructions.
>>>>> Supporting TRIGGER_ON_SYSREG as an inline operation sounds like a
>>>>> really interesting
>>>>> direction to explore.
>>>>>
>>>>
>>>> In general, having inline operations support on register read/writes
>>>> would be a very nice thing to have (though might be tricky to implement
>>>> correctly), and more efficient that the existing approach that requires
>>>> to check their value everytime.
>>>>
>>>>>>
>>>>>> For MMIO override, the current approach you have is good, and it's
>>>>>> definitely something we could integrate.
>>>>>>
>>>>>> What are you toughts about this? (especially the device based approach
>>>>>> in case that you maybe tried first).
>>>>>
>>>>> I agree such an approach can work well for IRQ's and Timers, and would be
>>>>> more clean way to implement this.
>>>>>
>>>>> However, for SMMU and similar cases, triggering internal state errors is not
>>>>> easy and requires accessing internal logic. So for those specific cases,
>>>>> a different approach may be needed.
>>>>>
>>>>
>>>> Thus the iommu-testdev I mentioned, that could be extended to support this.
>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Pierrick
>>>>>
>>>>> BR,
>>>>> Ruslan
>>>>
>>>> Regards,
>>>> Pierrick
>>

Regards,
Pierrick

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Alex Bennée 1 week, 4 days ago

(adding Richard to Cc)

Pierrick Bouvier <pierrick.bouvier@linaro.org> writes:

> On 3/25/26 4:39 PM, Ruslan Ruslichenko wrote:
>> On Fri, Mar 20, 2026 at 7:08 PM Pierrick Bouvier
>> <pierrick.bouvier@linaro.org> wrote:
>>>
>>> On 3/19/26 3:29 PM, Ruslan Ruslichenko wrote:
>>>> On Thu, Mar 19, 2026 at 8:04 PM Pierrick Bouvier
>>>> <pierrick.bouvier@linaro.org> wrote:
>>>>>
>>>>> On 3/19/26 11:20 AM, Ruslan Ruslichenko wrote:
>>>>>> Hi Pierrick,
>>>>>>
>>>>>> Thank you for the feedback and review!
>>>>>>
>>>>>> Our current plan is to put this plugin through our internal workflows to gather
>>>>>> more data on its limitations and performance.
>>>>>> Based on results, we may consider extending or refining the implementation
>>>>>> in the future.
>>>>>>
>>>>>> Any further feedback on potential issues is highly appreciated.
>>>>>>
>>>>>
>>>>> By design, the approach of modifying QEMU internals to allow to inject
>>>>> IRQ, set a timer, or trigger SMMU has very few chances to be integrated
>>>>> as it is. At least, it should be discussed with the concerned
>>>>> maintainers, and see if they would be open to it or not.
>>>>>
>>>>> It's not wrong in itself, if you want a downstream solution, but it does
>>>>> not scale upstream if we have to consider and accept everyone's needs.
>>>>> The plugin API in itself can accept the burden for such things, but it's
>>>>> harder to justify for internal stuff.
>>>>>
>>>>> I believe it would be better to rely on ad hoc devices generating this,
>>>>> with the advantage that even if they don't get accepted upstream, it
>>>>> will be more easy for you to maintain them downstream compared to more
>>>>> intrusive patches.
>>>>>
>>>>>> On Wed, Mar 18, 2026 at 6:16 PM Pierrick Bouvier
>>>>>> <pierrick.bouvier@linaro.org> wrote:
>>>>>>>
>>>>>>> Hi Ruslan,
>>>>>>>
>>>>>>> On 3/18/26 3:46 AM, Ruslan Ruslichenko wrote:
>>>>>>>> From: Ruslan Ruslichenko <Ruslan_Ruslichenko@epam.com>
>>>>>>>>
>>>>>>>> This patch series is submitted as an RFC to gather early feedback on a Fault Injection (FI) framework built on top of the QEMU TCG plugin subsystem.
>>>>>>>>
>>>>>>>> Motivation
>>>>>>>>
>>>>>>>> Testing guest operating systems, hypervisors (like Xen), and low-level drivers against unexpected hardware failures can be difficult.
>>>>>>>> This series provides an interface to inject faults dynamically without altering QEMU's core emulation source code for every test case.
>>>>>>>>
>>>>>>>> Architecture & Key Features
>>>>>>>>
>>>>>>>> The series introduces the core API extensions and implements a fault injection plugin (contrib/plugins/fault_injection.c) targeting AArch64.
>>>>>>>> The plugin can be controlled statically via XML configurations on boot, or dynamically at runtime via a UNIX socket (enabling integration with automated testing frameworks via Python or GDB).
>>>>>>>>
>>>>>>>> New Plugin API Capabilities:
>>>>>>>>
>>>>>>>> MMIO Interception: Allows plugins to hook into memory_region_dispatch_read/write to modify hardware register reads or drop writes.
>>>>>>>> Asynchronous Timers: Exposes QEMU_CLOCK_VIRTUAL to plugins, allowing callbacks to be scheduled based on guest virtual time.
>>>>>>>> TB Cache Flushing: Exposes qemu_plugin_flush_tb_cache() so plugins can force re-translation when applying dynamic PC-based hooks.
>>>>>>>> Interrupt & Exception Injection: Exposes APIs to raise/pulse hardware IRQs on the primary INTC and inject CPU exceptions (e.g., SErrors).
>>>>>>>> Custom Device Faults: Introduces a registry where device models (e.g., SMMUv3) can expose specific fault handlers (like CMDQ errors) to be triggered externally by plugins.
>>>>>>>>
>>>>>>>> Patch Summary
>>>>>>>> Patch 1 (target/arm): Adds support for asynchronous CPU exception injection.
>>>>>>>> Patch 2-3 (plugins/api): Exposes virtual clock timers and TB cache flushing to the public plugin API.
>>>>>>>> Patch 4 (plugins): Introduces the core fault injection subsystem, IRQ/Exception routing, and the Custom Fault registry.
>>>>>>>> Patch 5 (system/memory): Adds the MMIO override hooks into the memory dispatch path.
>>>>>>>> Patch 6 (hw/intc): Registers the ARM GIC (v2/v3) with the plugin subsystem to enable direct hardware IRQ injection.
>>>>>>>> Patch 7 (hw/arm): Registers the SMMUv3 with the custom fault registry to demonstrate how device models can expose specific errors (like CMDQ faults) to plugins.
>>>>>>>> Patch 8 (contrib/plugins): Implements the actual fault_injection plugin using the new APIs.
>>>>>>>> Patch 9 (docs): Adds documentation and usage examples for the plugin.
>>>>>>>>
>>>>>>>> Request for Comments & Feedback
>>>>>>>>
>>>>>>>> Any suggestions on improvements, potential edge cases, or issues with the current design are highly welcome.
>>>>>>>>
>>>>>>>> Ruslan Ruslichenko (9):
>>>>>>>>       target/arm: Add API for dynamic exception injection
>>>>>>>>       plugins/api: Expose virtual clock timers to plugins
>>>>>>>>       plugins: Expose Transaction Block cache flush API to plugins
>>>>>>>>       plugins: Introduce fault injection API and core subsystem
>>>>>>>>       system/memory: Add plugin callbacks to intercept MMIO accesses
>>>>>>>>       hw/intc/arm_gic: Register primary GIC for plugin IRQ injection
>>>>>>>>       hw/arm/smmuv3: Add plugin fault handler for CMDQ errors
>>>>>>>>       contrib/plugins: Add fault injection plugin
>>>>>>>>       docs: Add description of fault-injection plugin and subsystem
>>>>>>>>
>>>>>>>>      contrib/plugins/fault_injection.c | 772 ++++++++++++++++++++++++++++++
>>>>>>>>      contrib/plugins/meson.build       |   1 +
>>>>>>>>      docs/fault-injection.txt          | 111 +++++
>>>>>>>>      hw/arm/smmuv3.c                   |  54 +++
>>>>>>>>      hw/intc/arm_gic.c                 |  28 ++
>>>>>>>>      hw/intc/arm_gicv3.c               |  28 ++
>>>>>>>>      include/plugins/qemu-plugin.h     |  28 ++
>>>>>>>>      include/qemu/plugin.h             |  39 ++
>>>>>>>>      plugins/api.c                     |  62 +++
>>>>>>>>      plugins/core.c                    |  11 +
>>>>>>>>      plugins/fault.c                   | 116 +++++
>>>>>>>>      plugins/meson.build               |   1 +
>>>>>>>>      plugins/plugin.h                  |   2 +
>>>>>>>>      system/memory.c                   |   8 +
>>>>>>>>      target/arm/cpu.h                  |   4 +
>>>>>>>>      target/arm/helper.c               |  55 +++
>>>>>>>>      16 files changed, 1320 insertions(+)
>>>>>>>>      create mode 100644 contrib/plugins/fault_injection.c
>>>>>>>>      create mode 100644 docs/fault-injection.txt
>>>>>>>>      create mode 100644 plugins/fault.c
>>>>>>>>
>>>>>>>
>>>>>>> first, thanks for posting your series!
>>>>>>>
>>>>>>> About the general approach.
>>>>>>> As you noticed, this is exposing a lot of QEMU internals, and it's
>>>>>>> something we tend to avoid to do. As well, it's very architecture
>>>>>>> specific, which is another pattern we try to avoid.
>>>>>>>
>>>>>>> For some of your needs (especially IRQ injection and timer injection),
>>>>>>> did you consider writing a custom ad-hoc device and timer generating those?
>>>>>>> There is nothing preventing you from writing a plugin that can
>>>>>>> communicate with this specific device (through a socket for instance),
>>>>>>> to request specific injections. I feel that it would scale better than
>>>>>>> exposing all this to QEMU plugins API.
>>>>>>>
>>>>>>> For SMMU, this is trickier. Tao recently (6ce361b02c82) an iommu test
>>>>>>> device, associated to qtest to unit test the smmu implementation. We
>>>>>>> could maybe see to leverage that on a full machine, associated with the
>>>>>>> communication method mentioned above, to generate specific operations at
>>>>>>> runtime, all triggered via a plugin.
>>>>>>>
>>>>>>> Exposing qemu_plugin_flush_tb_cache is a hint we are missing something
>>>>>>> on QEMU side. Better to fix it than expose this very internal function.
>>>>>>
>>>>>> The reason this was needed is that the plugin may receive PC trigger
>>>>>> configuration
>>>>>> dynamically and need to register instruction callback at runtime.
>>>>>> If the TB for that PC is already translated and cached, our newly registered
>>>>>> callback might not be executed.
>>>>>>
>>>>>> If there is a more proper way to force QEMU to re-translate a specific
>>>>>> TB or attach
>>>>>> a callback to cached TB it would be great to reduce the complexity here.
>>>>>>
>>>>>
>>>>> I understand better. QEMU plugin current implementation is too limited
>>>>> for this, and everything has to be done/known at translation time.
>>>>> What is your use case for receiving PC trigger after translation? Do you
>>>>> have some mechanism to communicate with the plugin for this?
>>>>
>>>> Yes, exactly. If the guest has already executed the target code, the newly
>>>> added trigger will be ignored, as the TB is cached.
>>>>
>>>> For runtime configuration, the plugin spawns a background thread that listens
>>>> on a socket. External Python test script connects to this socket to send
>>>> dynamically generated XML faults.
>>>>
>>>
>>> Ok.
>>>
>>> Internally, we have tb_invalidate_phys_range that will invalidate a
>>> given range of tb. This is called when writing to memory for a given
>>> address holding code.
>>>
>>> Thus from your plugin, if you write to pc address with
>>> qemu_plugin_write_memory_vaddr, it should trigger a re-translation of
>>> this tb. You'll need to read 1 byte, and write it back. As well, it
>>> should be more efficient, since you will only invalidate this tb.
>>>
>>> Give it a try and let us know if it works for your need.
>>>
>> Thank you for your suggestion. This is really useful information
>> regarding
>> internals of tb processing.
>> I set up a test to simulate a scenario where a TB flush is needed
>> and used the described mechanism. However, there is a threading limitation:
>> qemu_plugin_write_memory_vaddr() must be called from a CPU thread.
>> In our current implementation dynamic faults are received and processed
>> by a background thread listening on a socket, so we cannot directly
>> use API from that context to trigger invalidation.
>>
>
> Indeed, when writing to a virtual address, we need to know the current
> execution context and page table setup to translate it. I have two
> ideas:
> - Register a callback per tb. When hitting a tb containing address
>   where to inject the fault, perform the read/write described above.

You could use a conditional callback with a scoreboard (or possibly
introduce a map feature similar to ebpf). You would track the address
ranges and latch the scoreboard when you want to look at something more
closely.

I wonder if allowing the TB itself to be invalidates conditionally would
be ok? We do try really hard to avoid exposing internal implementation
details to plugins but the concept of a block of instructions is kinda
already baked in. However we want to avoid plugins having to track a lot
of translation state to be useful.

> You always instrument, and selectively "poke" the code to trigger a
> new translation.
> - Simulate a given number of cpu watchpoints (N) by using N
>   conditional callback on every instruction, comparing current pc to N
>   addresses. I'm afraid it will be too slow.

I think you want at most one conditional check per instruction and then
take the slow path to check.

>
> One thing that could be considered on API side is to add a possibility
> to invalidate a specific hardware address (not all tb), based on
> tb_invalidate_phys_range. The problem is that plugin now need to keep
> track of all physical addresses matching virtual ones you want to
> invalidate, which is not convenient.
>
> Else, the easiest way to solve all this is to expose tb_flush, like
> you did, but keep this patch downstream for now.
> If your final plugin will stay downstream (which I expect, given it
> has its own protocol for injecting faults and no source for it), it's
> really the cheapest solution.
>
> The current design is built around the assumption that instrumentation
> is made at translation time (and not later). So changing it by
> instrumenting after translation brings new constraints we can't solve
> at the moment without exposing internal details.

We should certainly consider automatically triggering tb_flush() on each
qemu_plugin_register_vcpu_tb_trans_cb() so at least the case of
dynamically loading a plugin doesn't miss previous translations. 

>
>>>> There are several scenarios where this might be needed, mainly for faults that
>>>> are difficult to define statically at boot time.
>>>> Examples include injecting faults after specific chain of events, freezing or
>>>> overriding system registers values at specific execution points (since this
>>>> is currently implemented via PC triggers). Supporting environments with KASLR
>>>> enabled might be one more case.
>>>>
>>>
>>> For system registers, you can (heavy but would work) instrument
>>> inconditionally all instructions that touch those registers, so there
>>> would be no need to flush anything. System registers are not accessed
>>> for every instruction, so hopefully, it should not impact too much
>>> execution time.
>>>
>> Agree, this is a good optimization and indeed simplifies dynamic
>> faults
>> handling for system register reads.
>> Thank you for the recommendation!
>>
>>> With both solutions, it should remove the need to expose tb_flush
>>> through plugin API.
>>>
>>>>>
>>>>>>> The associated TRIGGER_ON_PC is very similar to existing inline
>>>>>>> operations. They could be enhanced to support writing to a given
>>>>>>> register, all the bricks are there. For TRIGGER_ON_SYSREG it's a bit
>>>>>>> more complex, but we might enhance inline operations also to support
>>>>>>> hooks on specific register writes.
>>>>>>
>>>>>> TRIGGER_ON_PC may also be used for generating other faults too. For example,
>>>>>> one use-case is to trigger CPU exceptions on specific instructions.
>>>>>> Supporting TRIGGER_ON_SYSREG as an inline operation sounds like a
>>>>>> really interesting
>>>>>> direction to explore.
>>>>>>
>>>>>
>>>>> In general, having inline operations support on register read/writes
>>>>> would be a very nice thing to have (though might be tricky to implement
>>>>> correctly), and more efficient that the existing approach that requires
>>>>> to check their value everytime.
>>>>>
>>>>>>>
>>>>>>> For MMIO override, the current approach you have is good, and it's
>>>>>>> definitely something we could integrate.
>>>>>>>
>>>>>>> What are you toughts about this? (especially the device based approach
>>>>>>> in case that you maybe tried first).
>>>>>>
>>>>>> I agree such an approach can work well for IRQ's and Timers, and would be
>>>>>> more clean way to implement this.
>>>>>>
>>>>>> However, for SMMU and similar cases, triggering internal state errors is not
>>>>>> easy and requires accessing internal logic. So for those specific cases,
>>>>>> a different approach may be needed.
>>>>>>
>>>>>
>>>>> Thus the iommu-testdev I mentioned, that could be extended to support this.
>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Pierrick
>>>>>>
>>>>>> BR,
>>>>>> Ruslan
>>>>>
>>>>> Regards,
>>>>> Pierrick
>>>
>
> Regards,
> Pierrick

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Pierrick Bouvier 1 week, 4 days ago

On 3/26/26 4:45 AM, Alex Bennée wrote:
> 
> (adding Richard to Cc)
> 
> Pierrick Bouvier <pierrick.bouvier@linaro.org> writes:
> 
>> On 3/25/26 4:39 PM, Ruslan Ruslichenko wrote:
>>> On Fri, Mar 20, 2026 at 7:08 PM Pierrick Bouvier
>>> <pierrick.bouvier@linaro.org> wrote:
>>>>
>>>> On 3/19/26 3:29 PM, Ruslan Ruslichenko wrote:
>>>>> On Thu, Mar 19, 2026 at 8:04 PM Pierrick Bouvier
>>>>> <pierrick.bouvier@linaro.org> wrote:
>>>>>>
>>>>>> On 3/19/26 11:20 AM, Ruslan Ruslichenko wrote:
>>>>>>> Hi Pierrick,
>>>>>>>
>>>>>>> Thank you for the feedback and review!
>>>>>>>
>>>>>>> Our current plan is to put this plugin through our internal workflows to gather
>>>>>>> more data on its limitations and performance.
>>>>>>> Based on results, we may consider extending or refining the implementation
>>>>>>> in the future.
>>>>>>>
>>>>>>> Any further feedback on potential issues is highly appreciated.
>>>>>>>
>>>>>>
>>>>>> By design, the approach of modifying QEMU internals to allow to inject
>>>>>> IRQ, set a timer, or trigger SMMU has very few chances to be integrated
>>>>>> as it is. At least, it should be discussed with the concerned
>>>>>> maintainers, and see if they would be open to it or not.
>>>>>>
>>>>>> It's not wrong in itself, if you want a downstream solution, but it does
>>>>>> not scale upstream if we have to consider and accept everyone's needs.
>>>>>> The plugin API in itself can accept the burden for such things, but it's
>>>>>> harder to justify for internal stuff.
>>>>>>
>>>>>> I believe it would be better to rely on ad hoc devices generating this,
>>>>>> with the advantage that even if they don't get accepted upstream, it
>>>>>> will be more easy for you to maintain them downstream compared to more
>>>>>> intrusive patches.
>>>>>>
>>>>>>> On Wed, Mar 18, 2026 at 6:16 PM Pierrick Bouvier
>>>>>>> <pierrick.bouvier@linaro.org> wrote:
>>>>>>>>
>>>>>>>> Hi Ruslan,
>>>>>>>>
>>>>>>>> On 3/18/26 3:46 AM, Ruslan Ruslichenko wrote:
>>>>>>>>> From: Ruslan Ruslichenko <Ruslan_Ruslichenko@epam.com>
>>>>>>>>>
>>>>>>>>> This patch series is submitted as an RFC to gather early feedback on a Fault Injection (FI) framework built on top of the QEMU TCG plugin subsystem.
>>>>>>>>>
>>>>>>>>> Motivation
>>>>>>>>>
>>>>>>>>> Testing guest operating systems, hypervisors (like Xen), and low-level drivers against unexpected hardware failures can be difficult.
>>>>>>>>> This series provides an interface to inject faults dynamically without altering QEMU's core emulation source code for every test case.
>>>>>>>>>
>>>>>>>>> Architecture & Key Features
>>>>>>>>>
>>>>>>>>> The series introduces the core API extensions and implements a fault injection plugin (contrib/plugins/fault_injection.c) targeting AArch64.
>>>>>>>>> The plugin can be controlled statically via XML configurations on boot, or dynamically at runtime via a UNIX socket (enabling integration with automated testing frameworks via Python or GDB).
>>>>>>>>>
>>>>>>>>> New Plugin API Capabilities:
>>>>>>>>>
>>>>>>>>> MMIO Interception: Allows plugins to hook into memory_region_dispatch_read/write to modify hardware register reads or drop writes.
>>>>>>>>> Asynchronous Timers: Exposes QEMU_CLOCK_VIRTUAL to plugins, allowing callbacks to be scheduled based on guest virtual time.
>>>>>>>>> TB Cache Flushing: Exposes qemu_plugin_flush_tb_cache() so plugins can force re-translation when applying dynamic PC-based hooks.
>>>>>>>>> Interrupt & Exception Injection: Exposes APIs to raise/pulse hardware IRQs on the primary INTC and inject CPU exceptions (e.g., SErrors).
>>>>>>>>> Custom Device Faults: Introduces a registry where device models (e.g., SMMUv3) can expose specific fault handlers (like CMDQ errors) to be triggered externally by plugins.
>>>>>>>>>
>>>>>>>>> Patch Summary
>>>>>>>>> Patch 1 (target/arm): Adds support for asynchronous CPU exception injection.
>>>>>>>>> Patch 2-3 (plugins/api): Exposes virtual clock timers and TB cache flushing to the public plugin API.
>>>>>>>>> Patch 4 (plugins): Introduces the core fault injection subsystem, IRQ/Exception routing, and the Custom Fault registry.
>>>>>>>>> Patch 5 (system/memory): Adds the MMIO override hooks into the memory dispatch path.
>>>>>>>>> Patch 6 (hw/intc): Registers the ARM GIC (v2/v3) with the plugin subsystem to enable direct hardware IRQ injection.
>>>>>>>>> Patch 7 (hw/arm): Registers the SMMUv3 with the custom fault registry to demonstrate how device models can expose specific errors (like CMDQ faults) to plugins.
>>>>>>>>> Patch 8 (contrib/plugins): Implements the actual fault_injection plugin using the new APIs.
>>>>>>>>> Patch 9 (docs): Adds documentation and usage examples for the plugin.
>>>>>>>>>
>>>>>>>>> Request for Comments & Feedback
>>>>>>>>>
>>>>>>>>> Any suggestions on improvements, potential edge cases, or issues with the current design are highly welcome.
>>>>>>>>>
>>>>>>>>> Ruslan Ruslichenko (9):
>>>>>>>>>        target/arm: Add API for dynamic exception injection
>>>>>>>>>        plugins/api: Expose virtual clock timers to plugins
>>>>>>>>>        plugins: Expose Transaction Block cache flush API to plugins
>>>>>>>>>        plugins: Introduce fault injection API and core subsystem
>>>>>>>>>        system/memory: Add plugin callbacks to intercept MMIO accesses
>>>>>>>>>        hw/intc/arm_gic: Register primary GIC for plugin IRQ injection
>>>>>>>>>        hw/arm/smmuv3: Add plugin fault handler for CMDQ errors
>>>>>>>>>        contrib/plugins: Add fault injection plugin
>>>>>>>>>        docs: Add description of fault-injection plugin and subsystem
>>>>>>>>>
>>>>>>>>>       contrib/plugins/fault_injection.c | 772 ++++++++++++++++++++++++++++++
>>>>>>>>>       contrib/plugins/meson.build       |   1 +
>>>>>>>>>       docs/fault-injection.txt          | 111 +++++
>>>>>>>>>       hw/arm/smmuv3.c                   |  54 +++
>>>>>>>>>       hw/intc/arm_gic.c                 |  28 ++
>>>>>>>>>       hw/intc/arm_gicv3.c               |  28 ++
>>>>>>>>>       include/plugins/qemu-plugin.h     |  28 ++
>>>>>>>>>       include/qemu/plugin.h             |  39 ++
>>>>>>>>>       plugins/api.c                     |  62 +++
>>>>>>>>>       plugins/core.c                    |  11 +
>>>>>>>>>       plugins/fault.c                   | 116 +++++
>>>>>>>>>       plugins/meson.build               |   1 +
>>>>>>>>>       plugins/plugin.h                  |   2 +
>>>>>>>>>       system/memory.c                   |   8 +
>>>>>>>>>       target/arm/cpu.h                  |   4 +
>>>>>>>>>       target/arm/helper.c               |  55 +++
>>>>>>>>>       16 files changed, 1320 insertions(+)
>>>>>>>>>       create mode 100644 contrib/plugins/fault_injection.c
>>>>>>>>>       create mode 100644 docs/fault-injection.txt
>>>>>>>>>       create mode 100644 plugins/fault.c
>>>>>>>>>
>>>>>>>>
>>>>>>>> first, thanks for posting your series!
>>>>>>>>
>>>>>>>> About the general approach.
>>>>>>>> As you noticed, this is exposing a lot of QEMU internals, and it's
>>>>>>>> something we tend to avoid to do. As well, it's very architecture
>>>>>>>> specific, which is another pattern we try to avoid.
>>>>>>>>
>>>>>>>> For some of your needs (especially IRQ injection and timer injection),
>>>>>>>> did you consider writing a custom ad-hoc device and timer generating those?
>>>>>>>> There is nothing preventing you from writing a plugin that can
>>>>>>>> communicate with this specific device (through a socket for instance),
>>>>>>>> to request specific injections. I feel that it would scale better than
>>>>>>>> exposing all this to QEMU plugins API.
>>>>>>>>
>>>>>>>> For SMMU, this is trickier. Tao recently (6ce361b02c82) an iommu test
>>>>>>>> device, associated to qtest to unit test the smmu implementation. We
>>>>>>>> could maybe see to leverage that on a full machine, associated with the
>>>>>>>> communication method mentioned above, to generate specific operations at
>>>>>>>> runtime, all triggered via a plugin.
>>>>>>>>
>>>>>>>> Exposing qemu_plugin_flush_tb_cache is a hint we are missing something
>>>>>>>> on QEMU side. Better to fix it than expose this very internal function.
>>>>>>>
>>>>>>> The reason this was needed is that the plugin may receive PC trigger
>>>>>>> configuration
>>>>>>> dynamically and need to register instruction callback at runtime.
>>>>>>> If the TB for that PC is already translated and cached, our newly registered
>>>>>>> callback might not be executed.
>>>>>>>
>>>>>>> If there is a more proper way to force QEMU to re-translate a specific
>>>>>>> TB or attach
>>>>>>> a callback to cached TB it would be great to reduce the complexity here.
>>>>>>>
>>>>>>
>>>>>> I understand better. QEMU plugin current implementation is too limited
>>>>>> for this, and everything has to be done/known at translation time.
>>>>>> What is your use case for receiving PC trigger after translation? Do you
>>>>>> have some mechanism to communicate with the plugin for this?
>>>>>
>>>>> Yes, exactly. If the guest has already executed the target code, the newly
>>>>> added trigger will be ignored, as the TB is cached.
>>>>>
>>>>> For runtime configuration, the plugin spawns a background thread that listens
>>>>> on a socket. External Python test script connects to this socket to send
>>>>> dynamically generated XML faults.
>>>>>
>>>>
>>>> Ok.
>>>>
>>>> Internally, we have tb_invalidate_phys_range that will invalidate a
>>>> given range of tb. This is called when writing to memory for a given
>>>> address holding code.
>>>>
>>>> Thus from your plugin, if you write to pc address with
>>>> qemu_plugin_write_memory_vaddr, it should trigger a re-translation of
>>>> this tb. You'll need to read 1 byte, and write it back. As well, it
>>>> should be more efficient, since you will only invalidate this tb.
>>>>
>>>> Give it a try and let us know if it works for your need.
>>>>
>>> Thank you for your suggestion. This is really useful information
>>> regarding
>>> internals of tb processing.
>>> I set up a test to simulate a scenario where a TB flush is needed
>>> and used the described mechanism. However, there is a threading limitation:
>>> qemu_plugin_write_memory_vaddr() must be called from a CPU thread.
>>> In our current implementation dynamic faults are received and processed
>>> by a background thread listening on a socket, so we cannot directly
>>> use API from that context to trigger invalidation.
>>>
>>
>> Indeed, when writing to a virtual address, we need to know the current
>> execution context and page table setup to translate it. I have two
>> ideas:
>> - Register a callback per tb. When hitting a tb containing address
>>    where to inject the fault, perform the read/write described above.
> 
> You could use a conditional callback with a scoreboard (or possibly
> introduce a map feature similar to ebpf). You would track the address
> ranges and latch the scoreboard when you want to look at something more
> closely.
> 
> I wonder if allowing the TB itself to be invalidates conditionally would
> be ok? We do try really hard to avoid exposing internal implementation
> details to plugins but the concept of a block of instructions is kinda
> already baked in. However we want to avoid plugins having to track a lot
> of translation state to be useful.
>

In general yes. As explained on the thread, this plugin has a specific 
use case of instrumenting code *after* it's translated, which implies 
there is a need to track something to fulfill the need.

>> You always instrument, and selectively "poke" the code to trigger a
>> new translation.
>> - Simulate a given number of cpu watchpoints (N) by using N
>>    conditional callback on every instruction, comparing current pc to N
>>    addresses. I'm afraid it will be too slow.
> 
> I think you want at most one conditional check per instruction and then
> take the slow path to check.
>

Yes, it might be cheaper than having several conditional checks. I still 
think idea above is better.

>>
>> One thing that could be considered on API side is to add a possibility
>> to invalidate a specific hardware address (not all tb), based on
>> tb_invalidate_phys_range. The problem is that plugin now need to keep
>> track of all physical addresses matching virtual ones you want to
>> invalidate, which is not convenient.
>>
>> Else, the easiest way to solve all this is to expose tb_flush, like
>> you did, but keep this patch downstream for now.
>> If your final plugin will stay downstream (which I expect, given it
>> has its own protocol for injecting faults and no source for it), it's
>> really the cheapest solution.
>>
>> The current design is built around the assumption that instrumentation
>> is made at translation time (and not later). So changing it by
>> instrumenting after translation brings new constraints we can't solve
>> at the moment without exposing internal details.
> 
> We should certainly consider automatically triggering tb_flush() on each
> qemu_plugin_register_vcpu_tb_trans_cb() so at least the case of
> dynamically loading a plugin doesn't miss previous translations.
>

Sounds fair yes, but not sure it covers the current use case.
It would require the plugin to register several tb_trans to achieve the 
desired effect, which I'm not sure is a good idea. At this point, maybe 
we just accept to expose tb_flush through plugin API and call it a day.

>>
>>>>> There are several scenarios where this might be needed, mainly for faults that
>>>>> are difficult to define statically at boot time.
>>>>> Examples include injecting faults after specific chain of events, freezing or
>>>>> overriding system registers values at specific execution points (since this
>>>>> is currently implemented via PC triggers). Supporting environments with KASLR
>>>>> enabled might be one more case.
>>>>>
>>>>
>>>> For system registers, you can (heavy but would work) instrument
>>>> inconditionally all instructions that touch those registers, so there
>>>> would be no need to flush anything. System registers are not accessed
>>>> for every instruction, so hopefully, it should not impact too much
>>>> execution time.
>>>>
>>> Agree, this is a good optimization and indeed simplifies dynamic
>>> faults
>>> handling for system register reads.
>>> Thank you for the recommendation!
>>>
>>>> With both solutions, it should remove the need to expose tb_flush
>>>> through plugin API.
>>>>
>>>>>>
>>>>>>>> The associated TRIGGER_ON_PC is very similar to existing inline
>>>>>>>> operations. They could be enhanced to support writing to a given
>>>>>>>> register, all the bricks are there. For TRIGGER_ON_SYSREG it's a bit
>>>>>>>> more complex, but we might enhance inline operations also to support
>>>>>>>> hooks on specific register writes.
>>>>>>>
>>>>>>> TRIGGER_ON_PC may also be used for generating other faults too. For example,
>>>>>>> one use-case is to trigger CPU exceptions on specific instructions.
>>>>>>> Supporting TRIGGER_ON_SYSREG as an inline operation sounds like a
>>>>>>> really interesting
>>>>>>> direction to explore.
>>>>>>>
>>>>>>
>>>>>> In general, having inline operations support on register read/writes
>>>>>> would be a very nice thing to have (though might be tricky to implement
>>>>>> correctly), and more efficient that the existing approach that requires
>>>>>> to check their value everytime.
>>>>>>
>>>>>>>>
>>>>>>>> For MMIO override, the current approach you have is good, and it's
>>>>>>>> definitely something we could integrate.
>>>>>>>>
>>>>>>>> What are you toughts about this? (especially the device based approach
>>>>>>>> in case that you maybe tried first).
>>>>>>>
>>>>>>> I agree such an approach can work well for IRQ's and Timers, and would be
>>>>>>> more clean way to implement this.
>>>>>>>
>>>>>>> However, for SMMU and similar cases, triggering internal state errors is not
>>>>>>> easy and requires accessing internal logic. So for those specific cases,
>>>>>>> a different approach may be needed.
>>>>>>>
>>>>>>
>>>>>> Thus the iommu-testdev I mentioned, that could be extended to support this.
>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Pierrick
>>>>>>>
>>>>>>> BR,
>>>>>>> Ruslan
>>>>>>
>>>>>> Regards,
>>>>>> Pierrick
>>>>
>>
>> Regards,
>> Pierrick
>

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Pierrick Bouvier 1 week, 2 days ago

On 3/26/26 8:59 AM, Pierrick Bouvier wrote:
>>> One thing that could be considered on API side is to add a possibility
>>> to invalidate a specific hardware address (not all tb), based on
>>> tb_invalidate_phys_range. The problem is that plugin now need to keep
>>> track of all physical addresses matching virtual ones you want to
>>> invalidate, which is not convenient.
>>>
>>> Else, the easiest way to solve all this is to expose tb_flush, like
>>> you did, but keep this patch downstream for now.
>>> If your final plugin will stay downstream (which I expect, given it
>>> has its own protocol for injecting faults and no source for it), it's
>>> really the cheapest solution.
>>>
>>> The current design is built around the assumption that instrumentation
>>> is made at translation time (and not later). So changing it by
>>> instrumenting after translation brings new constraints we can't solve
>>> at the moment without exposing internal details.
>>
>> We should certainly consider automatically triggering tb_flush() on each
>> qemu_plugin_register_vcpu_tb_trans_cb() so at least the case of
>> dynamically loading a plugin doesn't miss previous translations.
>>
> 
> Sounds fair yes, but not sure it covers the current use case.
> It would require the plugin to register several tb_trans to achieve the
> desired effect, which I'm not sure is a good idea. At this point, maybe
> we just accept to expose tb_flush through plugin API and call it a day.
>

@Ruslan:

After looking at it again, this is already covered with:
```
void qemu_plugin_reset(qemu_plugin_id_t id, qemu_plugin_simple_cb_t cb)
```

which performs a tb_flush as expected, and calls cb once the reset is 
asynchronously done.
In the callback cb, you can reinstall your trans_cb, and it should do 
exactly what you expect.

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Ruslan Ruslichenko 5 days, 20 hours ago

On Fri, Mar 27, 2026 at 7:18 PM Pierrick Bouvier
<pierrick.bouvier@linaro.org> wrote:
>
> On 3/26/26 8:59 AM, Pierrick Bouvier wrote:
> >>> One thing that could be considered on API side is to add a possibility
> >>> to invalidate a specific hardware address (not all tb), based on
> >>> tb_invalidate_phys_range. The problem is that plugin now need to keep
> >>> track of all physical addresses matching virtual ones you want to
> >>> invalidate, which is not convenient.
> >>>
> >>> Else, the easiest way to solve all this is to expose tb_flush, like
> >>> you did, but keep this patch downstream for now.
> >>> If your final plugin will stay downstream (which I expect, given it
> >>> has its own protocol for injecting faults and no source for it), it's
> >>> really the cheapest solution.
> >>>
> >>> The current design is built around the assumption that instrumentation
> >>> is made at translation time (and not later). So changing it by
> >>> instrumenting after translation brings new constraints we can't solve
> >>> at the moment without exposing internal details.
> >>
> >> We should certainly consider automatically triggering tb_flush() on each
> >> qemu_plugin_register_vcpu_tb_trans_cb() so at least the case of
> >> dynamically loading a plugin doesn't miss previous translations.
> >>
> >
> > Sounds fair yes, but not sure it covers the current use case.
> > It would require the plugin to register several tb_trans to achieve the
> > desired effect, which I'm not sure is a good idea. At this point, maybe
> > we just accept to expose tb_flush through plugin API and call it a day.
> >
>
> @Ruslan:
>
> After looking at it again, this is already covered with:
> ```
> void qemu_plugin_reset(qemu_plugin_id_t id, qemu_plugin_simple_cb_t cb)
> ```
>
> which performs a tb_flush as expected, and calls cb once the reset is
> asynchronously done.
> In the callback cb, you can reinstall your trans_cb, and it should do
> exactly what you expect.
>

Hi Pierrick,

Thank you for the suggestion!

I have tried the qemu_plugin_reset() approach, but unfortunately, it
seems to have the same threading limitation: the TB cache flush will only
be performed if it is called from a CPU thread.

Under the hood, plugin_reset_uninstall() checks current_cpu, which is a
thread-local variable specific to vCPU threads. So when called from
background thread current_cpu is NULL and it skips async_safe_run_on_cpu().

Here is the relevant snippet:

void plugin_reset_uninstall(qemu_plugin_id_t id,
                            qemu_plugin_simple_cb_t cb,
                            bool reset)
....
    /*
     * Only flush the code cache if the vCPUs have been created. If so,
     * current_cpu must be non-NULL.
     */
    if (current_cpu) {
        async_safe_run_on_cpu(current_cpu, plugin_flush_destroy,
                              RUN_ON_CPU_HOST_PTR(data));
    } else {
        /*
         * If current_cpu isn't set, then we don't have yet any vCPU threads
         * and we therefore can remove the callbacks synchronously.
         */
        plugin_reset_destroy(data);
    }
...
}

--
BR,
Ruslan

Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

Posted by Pierrick Bouvier 5 days, 19 hours ago

On 3/31/26 1:23 PM, Ruslan Ruslichenko wrote:
> On Fri, Mar 27, 2026 at 7:18 PM Pierrick Bouvier
> <pierrick.bouvier@linaro.org> wrote:
>>
>> On 3/26/26 8:59 AM, Pierrick Bouvier wrote:
>>>>> One thing that could be considered on API side is to add a possibility
>>>>> to invalidate a specific hardware address (not all tb), based on
>>>>> tb_invalidate_phys_range. The problem is that plugin now need to keep
>>>>> track of all physical addresses matching virtual ones you want to
>>>>> invalidate, which is not convenient.
>>>>>
>>>>> Else, the easiest way to solve all this is to expose tb_flush, like
>>>>> you did, but keep this patch downstream for now.
>>>>> If your final plugin will stay downstream (which I expect, given it
>>>>> has its own protocol for injecting faults and no source for it), it's
>>>>> really the cheapest solution.
>>>>>
>>>>> The current design is built around the assumption that instrumentation
>>>>> is made at translation time (and not later). So changing it by
>>>>> instrumenting after translation brings new constraints we can't solve
>>>>> at the moment without exposing internal details.
>>>>
>>>> We should certainly consider automatically triggering tb_flush() on each
>>>> qemu_plugin_register_vcpu_tb_trans_cb() so at least the case of
>>>> dynamically loading a plugin doesn't miss previous translations.
>>>>
>>>
>>> Sounds fair yes, but not sure it covers the current use case.
>>> It would require the plugin to register several tb_trans to achieve the
>>> desired effect, which I'm not sure is a good idea. At this point, maybe
>>> we just accept to expose tb_flush through plugin API and call it a day.
>>>
>>
>> @Ruslan:
>>
>> After looking at it again, this is already covered with:
>> ```
>> void qemu_plugin_reset(qemu_plugin_id_t id, qemu_plugin_simple_cb_t cb)
>> ```
>>
>> which performs a tb_flush as expected, and calls cb once the reset is
>> asynchronously done.
>> In the callback cb, you can reinstall your trans_cb, and it should do
>> exactly what you expect.
>>
> 
> Hi Pierrick,
> 
> Thank you for the suggestion!
> 
> I have tried the qemu_plugin_reset() approach, but unfortunately, it
> seems to have the same threading limitation: the TB cache flush will only
> be performed if it is called from a CPU thread.
> 
> Under the hood, plugin_reset_uninstall() checks current_cpu, which is a
> thread-local variable specific to vCPU threads. So when called from
> background thread current_cpu is NULL and it skips async_safe_run_on_cpu().
> 
> Here is the relevant snippet:
> 
> void plugin_reset_uninstall(qemu_plugin_id_t id,
>                              qemu_plugin_simple_cb_t cb,
>                              bool reset)
> ....
>      /*
>       * Only flush the code cache if the vCPUs have been created. If so,
>       * current_cpu must be non-NULL.
>       */
>      if (current_cpu) {
>          async_safe_run_on_cpu(current_cpu, plugin_flush_destroy,
>                                RUN_ON_CPU_HOST_PTR(data));
>      } else {
>          /*
>           * If current_cpu isn't set, then we don't have yet any vCPU threads
>           * and we therefore can remove the callbacks synchronously.
>           */
>          plugin_reset_destroy(data);
>      }
> ...
> }
>

Indeed, a lot of complications comes from the fact there is an external 
thread trying to control instrumentation with the current design.

In this case, you will need to queue that reset call on one of the vcpu, 
by either:
- using a tb_exec callback
- wait for a new block to be translated if you don't want any overhead 
at runtime

I don't think we really have a choice since we can't force an exclusive 
section (i.e. stopping all vcpus) from an external thread using 
start_exclusive/end_exclusive. We need to execute this from a given 
vcpu, which is what async_safe_run_on_cpu does (run on vcpu from 
exclusive section).

> --
> BR,
> Ruslan

Regards,
Pierrick