Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions

public inbox for qemu-arm@nongnu.org
 help / color / mirror / Atom feed

From: "Alex Bennée" <alex.bennee@linaro.org>
To: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Cc: Ruslan Ruslichenko <ruslichenko.r@gmail.com>,
	 qemu-devel@nongnu.org, qemu-arm@nongnu.org,
	 artem_mygaiev@epam.com, volodymyr_babchuk@epam.com,
	 peter.maydell@linaro.org, philmd@linaro.org,
	 Ruslan_Ruslichenko@epam.com,
	Richard Henderson <richard.henderson@linaro.org>
Subject: Re: [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions
Date: Thu, 26 Mar 2026 11:45:12 +0000	[thread overview]
Message-ID: <87a4vu3jcn.fsf@draig.linaro.org> (raw)
In-Reply-To: <005aa40d-4749-44d5-a65c-8f59cbd06d0c@linaro.org> (Pierrick Bouvier's message of "Wed, 25 Mar 2026 17:17:29 -0700")


(adding Richard to Cc)

Pierrick Bouvier <pierrick.bouvier@linaro.org> writes:

> On 3/25/26 4:39 PM, Ruslan Ruslichenko wrote:
>> On Fri, Mar 20, 2026 at 7:08 PM Pierrick Bouvier
>> <pierrick.bouvier@linaro.org> wrote:
>>>
>>> On 3/19/26 3:29 PM, Ruslan Ruslichenko wrote:
>>>> On Thu, Mar 19, 2026 at 8:04 PM Pierrick Bouvier
>>>> <pierrick.bouvier@linaro.org> wrote:
>>>>>
>>>>> On 3/19/26 11:20 AM, Ruslan Ruslichenko wrote:
>>>>>> Hi Pierrick,
>>>>>>
>>>>>> Thank you for the feedback and review!
>>>>>>
>>>>>> Our current plan is to put this plugin through our internal workflows to gather
>>>>>> more data on its limitations and performance.
>>>>>> Based on results, we may consider extending or refining the implementation
>>>>>> in the future.
>>>>>>
>>>>>> Any further feedback on potential issues is highly appreciated.
>>>>>>
>>>>>
>>>>> By design, the approach of modifying QEMU internals to allow to inject
>>>>> IRQ, set a timer, or trigger SMMU has very few chances to be integrated
>>>>> as it is. At least, it should be discussed with the concerned
>>>>> maintainers, and see if they would be open to it or not.
>>>>>
>>>>> It's not wrong in itself, if you want a downstream solution, but it does
>>>>> not scale upstream if we have to consider and accept everyone's needs.
>>>>> The plugin API in itself can accept the burden for such things, but it's
>>>>> harder to justify for internal stuff.
>>>>>
>>>>> I believe it would be better to rely on ad hoc devices generating this,
>>>>> with the advantage that even if they don't get accepted upstream, it
>>>>> will be more easy for you to maintain them downstream compared to more
>>>>> intrusive patches.
>>>>>
>>>>>> On Wed, Mar 18, 2026 at 6:16 PM Pierrick Bouvier
>>>>>> <pierrick.bouvier@linaro.org> wrote:
>>>>>>>
>>>>>>> Hi Ruslan,
>>>>>>>
>>>>>>> On 3/18/26 3:46 AM, Ruslan Ruslichenko wrote:
>>>>>>>> From: Ruslan Ruslichenko <Ruslan_Ruslichenko@epam.com>
>>>>>>>>
>>>>>>>> This patch series is submitted as an RFC to gather early feedback on a Fault Injection (FI) framework built on top of the QEMU TCG plugin subsystem.
>>>>>>>>
>>>>>>>> Motivation
>>>>>>>>
>>>>>>>> Testing guest operating systems, hypervisors (like Xen), and low-level drivers against unexpected hardware failures can be difficult.
>>>>>>>> This series provides an interface to inject faults dynamically without altering QEMU's core emulation source code for every test case.
>>>>>>>>
>>>>>>>> Architecture & Key Features
>>>>>>>>
>>>>>>>> The series introduces the core API extensions and implements a fault injection plugin (contrib/plugins/fault_injection.c) targeting AArch64.
>>>>>>>> The plugin can be controlled statically via XML configurations on boot, or dynamically at runtime via a UNIX socket (enabling integration with automated testing frameworks via Python or GDB).
>>>>>>>>
>>>>>>>> New Plugin API Capabilities:
>>>>>>>>
>>>>>>>> MMIO Interception: Allows plugins to hook into memory_region_dispatch_read/write to modify hardware register reads or drop writes.
>>>>>>>> Asynchronous Timers: Exposes QEMU_CLOCK_VIRTUAL to plugins, allowing callbacks to be scheduled based on guest virtual time.
>>>>>>>> TB Cache Flushing: Exposes qemu_plugin_flush_tb_cache() so plugins can force re-translation when applying dynamic PC-based hooks.
>>>>>>>> Interrupt & Exception Injection: Exposes APIs to raise/pulse hardware IRQs on the primary INTC and inject CPU exceptions (e.g., SErrors).
>>>>>>>> Custom Device Faults: Introduces a registry where device models (e.g., SMMUv3) can expose specific fault handlers (like CMDQ errors) to be triggered externally by plugins.
>>>>>>>>
>>>>>>>> Patch Summary
>>>>>>>> Patch 1 (target/arm): Adds support for asynchronous CPU exception injection.
>>>>>>>> Patch 2-3 (plugins/api): Exposes virtual clock timers and TB cache flushing to the public plugin API.
>>>>>>>> Patch 4 (plugins): Introduces the core fault injection subsystem, IRQ/Exception routing, and the Custom Fault registry.
>>>>>>>> Patch 5 (system/memory): Adds the MMIO override hooks into the memory dispatch path.
>>>>>>>> Patch 6 (hw/intc): Registers the ARM GIC (v2/v3) with the plugin subsystem to enable direct hardware IRQ injection.
>>>>>>>> Patch 7 (hw/arm): Registers the SMMUv3 with the custom fault registry to demonstrate how device models can expose specific errors (like CMDQ faults) to plugins.
>>>>>>>> Patch 8 (contrib/plugins): Implements the actual fault_injection plugin using the new APIs.
>>>>>>>> Patch 9 (docs): Adds documentation and usage examples for the plugin.
>>>>>>>>
>>>>>>>> Request for Comments & Feedback
>>>>>>>>
>>>>>>>> Any suggestions on improvements, potential edge cases, or issues with the current design are highly welcome.
>>>>>>>>
>>>>>>>> Ruslan Ruslichenko (9):
>>>>>>>>       target/arm: Add API for dynamic exception injection
>>>>>>>>       plugins/api: Expose virtual clock timers to plugins
>>>>>>>>       plugins: Expose Transaction Block cache flush API to plugins
>>>>>>>>       plugins: Introduce fault injection API and core subsystem
>>>>>>>>       system/memory: Add plugin callbacks to intercept MMIO accesses
>>>>>>>>       hw/intc/arm_gic: Register primary GIC for plugin IRQ injection
>>>>>>>>       hw/arm/smmuv3: Add plugin fault handler for CMDQ errors
>>>>>>>>       contrib/plugins: Add fault injection plugin
>>>>>>>>       docs: Add description of fault-injection plugin and subsystem
>>>>>>>>
>>>>>>>>      contrib/plugins/fault_injection.c | 772 ++++++++++++++++++++++++++++++
>>>>>>>>      contrib/plugins/meson.build       |   1 +
>>>>>>>>      docs/fault-injection.txt          | 111 +++++
>>>>>>>>      hw/arm/smmuv3.c                   |  54 +++
>>>>>>>>      hw/intc/arm_gic.c                 |  28 ++
>>>>>>>>      hw/intc/arm_gicv3.c               |  28 ++
>>>>>>>>      include/plugins/qemu-plugin.h     |  28 ++
>>>>>>>>      include/qemu/plugin.h             |  39 ++
>>>>>>>>      plugins/api.c                     |  62 +++
>>>>>>>>      plugins/core.c                    |  11 +
>>>>>>>>      plugins/fault.c                   | 116 +++++
>>>>>>>>      plugins/meson.build               |   1 +
>>>>>>>>      plugins/plugin.h                  |   2 +
>>>>>>>>      system/memory.c                   |   8 +
>>>>>>>>      target/arm/cpu.h                  |   4 +
>>>>>>>>      target/arm/helper.c               |  55 +++
>>>>>>>>      16 files changed, 1320 insertions(+)
>>>>>>>>      create mode 100644 contrib/plugins/fault_injection.c
>>>>>>>>      create mode 100644 docs/fault-injection.txt
>>>>>>>>      create mode 100644 plugins/fault.c
>>>>>>>>
>>>>>>>
>>>>>>> first, thanks for posting your series!
>>>>>>>
>>>>>>> About the general approach.
>>>>>>> As you noticed, this is exposing a lot of QEMU internals, and it's
>>>>>>> something we tend to avoid to do. As well, it's very architecture
>>>>>>> specific, which is another pattern we try to avoid.
>>>>>>>
>>>>>>> For some of your needs (especially IRQ injection and timer injection),
>>>>>>> did you consider writing a custom ad-hoc device and timer generating those?
>>>>>>> There is nothing preventing you from writing a plugin that can
>>>>>>> communicate with this specific device (through a socket for instance),
>>>>>>> to request specific injections. I feel that it would scale better than
>>>>>>> exposing all this to QEMU plugins API.
>>>>>>>
>>>>>>> For SMMU, this is trickier. Tao recently (6ce361b02c82) an iommu test
>>>>>>> device, associated to qtest to unit test the smmu implementation. We
>>>>>>> could maybe see to leverage that on a full machine, associated with the
>>>>>>> communication method mentioned above, to generate specific operations at
>>>>>>> runtime, all triggered via a plugin.
>>>>>>>
>>>>>>> Exposing qemu_plugin_flush_tb_cache is a hint we are missing something
>>>>>>> on QEMU side. Better to fix it than expose this very internal function.
>>>>>>
>>>>>> The reason this was needed is that the plugin may receive PC trigger
>>>>>> configuration
>>>>>> dynamically and need to register instruction callback at runtime.
>>>>>> If the TB for that PC is already translated and cached, our newly registered
>>>>>> callback might not be executed.
>>>>>>
>>>>>> If there is a more proper way to force QEMU to re-translate a specific
>>>>>> TB or attach
>>>>>> a callback to cached TB it would be great to reduce the complexity here.
>>>>>>
>>>>>
>>>>> I understand better. QEMU plugin current implementation is too limited
>>>>> for this, and everything has to be done/known at translation time.
>>>>> What is your use case for receiving PC trigger after translation? Do you
>>>>> have some mechanism to communicate with the plugin for this?
>>>>
>>>> Yes, exactly. If the guest has already executed the target code, the newly
>>>> added trigger will be ignored, as the TB is cached.
>>>>
>>>> For runtime configuration, the plugin spawns a background thread that listens
>>>> on a socket. External Python test script connects to this socket to send
>>>> dynamically generated XML faults.
>>>>
>>>
>>> Ok.
>>>
>>> Internally, we have tb_invalidate_phys_range that will invalidate a
>>> given range of tb. This is called when writing to memory for a given
>>> address holding code.
>>>
>>> Thus from your plugin, if you write to pc address with
>>> qemu_plugin_write_memory_vaddr, it should trigger a re-translation of
>>> this tb. You'll need to read 1 byte, and write it back. As well, it
>>> should be more efficient, since you will only invalidate this tb.
>>>
>>> Give it a try and let us know if it works for your need.
>>>
>> Thank you for your suggestion. This is really useful information
>> regarding
>> internals of tb processing.
>> I set up a test to simulate a scenario where a TB flush is needed
>> and used the described mechanism. However, there is a threading limitation:
>> qemu_plugin_write_memory_vaddr() must be called from a CPU thread.
>> In our current implementation dynamic faults are received and processed
>> by a background thread listening on a socket, so we cannot directly
>> use API from that context to trigger invalidation.
>>
>
> Indeed, when writing to a virtual address, we need to know the current
> execution context and page table setup to translate it. I have two
> ideas:
> - Register a callback per tb. When hitting a tb containing address
>   where to inject the fault, perform the read/write described above.

You could use a conditional callback with a scoreboard (or possibly
introduce a map feature similar to ebpf). You would track the address
ranges and latch the scoreboard when you want to look at something more
closely.

I wonder if allowing the TB itself to be invalidates conditionally would
be ok? We do try really hard to avoid exposing internal implementation
details to plugins but the concept of a block of instructions is kinda
already baked in. However we want to avoid plugins having to track a lot
of translation state to be useful.

> You always instrument, and selectively "poke" the code to trigger a
> new translation.
> - Simulate a given number of cpu watchpoints (N) by using N
>   conditional callback on every instruction, comparing current pc to N
>   addresses. I'm afraid it will be too slow.

I think you want at most one conditional check per instruction and then
take the slow path to check.

>
> One thing that could be considered on API side is to add a possibility
> to invalidate a specific hardware address (not all tb), based on
> tb_invalidate_phys_range. The problem is that plugin now need to keep
> track of all physical addresses matching virtual ones you want to
> invalidate, which is not convenient.
>
> Else, the easiest way to solve all this is to expose tb_flush, like
> you did, but keep this patch downstream for now.
> If your final plugin will stay downstream (which I expect, given it
> has its own protocol for injecting faults and no source for it), it's
> really the cheapest solution.
>
> The current design is built around the assumption that instrumentation
> is made at translation time (and not later). So changing it by
> instrumenting after translation brings new constraints we can't solve
> at the moment without exposing internal details.

We should certainly consider automatically triggering tb_flush() on each
qemu_plugin_register_vcpu_tb_trans_cb() so at least the case of
dynamically loading a plugin doesn't miss previous translations. 

>
>>>> There are several scenarios where this might be needed, mainly for faults that
>>>> are difficult to define statically at boot time.
>>>> Examples include injecting faults after specific chain of events, freezing or
>>>> overriding system registers values at specific execution points (since this
>>>> is currently implemented via PC triggers). Supporting environments with KASLR
>>>> enabled might be one more case.
>>>>
>>>
>>> For system registers, you can (heavy but would work) instrument
>>> inconditionally all instructions that touch those registers, so there
>>> would be no need to flush anything. System registers are not accessed
>>> for every instruction, so hopefully, it should not impact too much
>>> execution time.
>>>
>> Agree, this is a good optimization and indeed simplifies dynamic
>> faults
>> handling for system register reads.
>> Thank you for the recommendation!
>>
>>> With both solutions, it should remove the need to expose tb_flush
>>> through plugin API.
>>>
>>>>>
>>>>>>> The associated TRIGGER_ON_PC is very similar to existing inline
>>>>>>> operations. They could be enhanced to support writing to a given
>>>>>>> register, all the bricks are there. For TRIGGER_ON_SYSREG it's a bit
>>>>>>> more complex, but we might enhance inline operations also to support
>>>>>>> hooks on specific register writes.
>>>>>>
>>>>>> TRIGGER_ON_PC may also be used for generating other faults too. For example,
>>>>>> one use-case is to trigger CPU exceptions on specific instructions.
>>>>>> Supporting TRIGGER_ON_SYSREG as an inline operation sounds like a
>>>>>> really interesting
>>>>>> direction to explore.
>>>>>>
>>>>>
>>>>> In general, having inline operations support on register read/writes
>>>>> would be a very nice thing to have (though might be tricky to implement
>>>>> correctly), and more efficient that the existing approach that requires
>>>>> to check their value everytime.
>>>>>
>>>>>>>
>>>>>>> For MMIO override, the current approach you have is good, and it's
>>>>>>> definitely something we could integrate.
>>>>>>>
>>>>>>> What are you toughts about this? (especially the device based approach
>>>>>>> in case that you maybe tried first).
>>>>>>
>>>>>> I agree such an approach can work well for IRQ's and Timers, and would be
>>>>>> more clean way to implement this.
>>>>>>
>>>>>> However, for SMMU and similar cases, triggering internal state errors is not
>>>>>> easy and requires accessing internal logic. So for those specific cases,
>>>>>> a different approach may be needed.
>>>>>>
>>>>>
>>>>> Thus the iommu-testdev I mentioned, that could be extended to support this.
>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Pierrick
>>>>>>
>>>>>> BR,
>>>>>> Ruslan
>>>>>
>>>>> Regards,
>>>>> Pierrick
>>>
>
> Regards,
> Pierrick

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

next prev parent reply	other threads:[~2026-03-26 11:45 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-18 10:46 [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions Ruslan Ruslichenko
2026-03-18 10:46 ` [RFC PATCH 1/9] target/arm: Add API for dynamic exception injection Ruslan Ruslichenko
2026-03-18 10:46 ` [RFC PATCH 2/9] plugins/api: Expose virtual clock timers to plugins Ruslan Ruslichenko
2026-03-18 10:46 ` [RFC PATCH 3/9] plugins: Expose Transaction Block cache flush API " Ruslan Ruslichenko
2026-03-18 10:46 ` [RFC PATCH 4/9] plugins: Introduce fault injection API and core subsystem Ruslan Ruslichenko
2026-03-18 10:46 ` [RFC PATCH 5/9] system/memory: Add plugin callbacks to intercept MMIO accesses Ruslan Ruslichenko
2026-03-18 10:46 ` [RFC PATCH 6/9] hw/intc/arm_gic: Register primary GIC for plugin IRQ injection Ruslan Ruslichenko
2026-03-18 10:46 ` [RFC PATCH 7/9] hw/arm/smmuv3: Add plugin fault handler for CMDQ errors Ruslan Ruslichenko
2026-03-18 10:46 ` [RFC PATCH 8/9] contrib/plugins: Add fault injection plugin Ruslan Ruslichenko
2026-03-18 10:46 ` [RFC PATCH 9/9] docs: Add description of fault-injection plugin and subsystem Ruslan Ruslichenko
2026-03-18 17:16 ` [RFC PATCH 0/9] plugins: Introduce Fault Injection framework and API extensions Pierrick Bouvier
2026-03-19 18:20   ` Ruslan Ruslichenko
2026-03-19 19:04     ` Pierrick Bouvier
2026-03-19 22:29       ` Ruslan Ruslichenko
2026-03-20 18:08         ` Pierrick Bouvier
2026-03-25 23:39           ` Ruslan Ruslichenko
2026-03-26  0:17             ` Pierrick Bouvier
2026-03-26 11:45               ` Alex Bennée [this message]
2026-03-26 15:59                 ` Pierrick Bouvier
2026-03-27 18:18                   ` Pierrick Bouvier
2026-03-31 20:23                     ` Ruslan Ruslichenko
2026-03-31 21:24                       ` Pierrick Bouvier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a4vu3jcn.fsf@draig.linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=Ruslan_Ruslichenko@epam.com \
    --cc=artem_mygaiev@epam.com \
    --cc=peter.maydell@linaro.org \
    --cc=philmd@linaro.org \
    --cc=pierrick.bouvier@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=ruslichenko.r@gmail.com \
    --cc=volodymyr_babchuk@epam.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox