[PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
@ 2025-09-09 20:48 Timothy Pearson
  2025-09-19 18:56 ` Alex Williamson
  0 siblings, 1 reply; 9+ messages in thread
From: Timothy Pearson @ 2025-09-09 20:48 UTC (permalink / raw)
  To: kvm; +Cc: linuxppc-dev

PCI devices prior to PCI 2.3 both use level interrupts and do not support
interrupt masking, leading to a failure when passed through to a KVM guest on
at least the ppc64 platform, which does not utilize the resample IRQFD. This
failure manifests as receiving and acknowledging a single interrupt in the guest
while leaving the host physical device VFIO IRQ pending.

Level interrupts in general require special handling due to their inherently
asynchronous nature; both the host and guest interrupt controller need to
remain in synchronization in order to coordinate mask and unmask operations.
When lazy IRQ masking is used on DisINTx- hardware, the following sequence
occurs:

 * Level IRQ assertion on host
 * IRQ trigger within host interrupt controller, routed to VFIO driver
 * Host EOI with hardware level IRQ still asserted
 * Software mask of interrupt source by VFIO driver
 * Generation of event and IRQ trigger in KVM guest interrupt controller
 * Level IRQ deassertion on host
 * Guest EOI
 * Guest IRQ level deassertion
 * Removal of software mask by VFIO driver

Note that no actual state change occurs within the host interrupt controller,
unlike what would happen with either DisINTx+ hardware or message interrupts.
The host EOI is not fired with the hardware level IRQ deasserted, and the
level interrupt is not re-armed within the host interrupt controller, leading
to an unrecoverable stall of the device.

Work around this by disabling lazy IRQ masking for DisINTx- INTx devices.

---
 drivers/vfio/pci/vfio_pci_intrs.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 123298a4dc8f..011169ca7a34 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -304,6 +304,9 @@ static int vfio_intx_enable(struct vfio_pci_core_device *vdev,

 	vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX;

+	if (is_intx(vdev) && !vdev->pci_2_3)
+		irq_set_status_flags(pdev->irq, IRQ_DISABLE_UNLAZY);
+
 	ret = request_irq(pdev->irq, vfio_intx_handler,
 			  irqflags, ctx->name, ctx);
 	if (ret) {
@@ -351,6 +354,8 @@ static void vfio_intx_disable(struct vfio_pci_core_device *vdev)
 	if (ctx) {
 		vfio_virqfd_disable(&ctx->unmask);
 		vfio_virqfd_disable(&ctx->mask);
+		if (!vdev->pci_2_3)
+			irq_clear_status_flags(pdev->irq, IRQ_DISABLE_UNLAZY);
 		free_irq(pdev->irq, ctx);
 		if (ctx->trigger)
 			eventfd_ctx_put(ctx->trigger);
-- 
2.39.5

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
  2025-09-09 20:48 [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices Timothy Pearson
@ 2025-09-19 18:56 ` Alex Williamson
  2025-09-19 20:51   ` Timothy Pearson
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2025-09-19 18:56 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: kvm, linuxppc-dev

On Tue, 9 Sep 2025 15:48:46 -0500 (CDT)
Timothy Pearson <tpearson@raptorengineering.com> wrote:

> PCI devices prior to PCI 2.3 both use level interrupts and do not support
> interrupt masking, leading to a failure when passed through to a KVM guest on
> at least the ppc64 platform, which does not utilize the resample IRQFD. This
> failure manifests as receiving and acknowledging a single interrupt in the guest
> while leaving the host physical device VFIO IRQ pending.
> 
> Level interrupts in general require special handling due to their inherently
> asynchronous nature; both the host and guest interrupt controller need to
> remain in synchronization in order to coordinate mask and unmask operations.
> When lazy IRQ masking is used on DisINTx- hardware, the following sequence
> occurs:
>
>  * Level IRQ assertion on host
>  * IRQ trigger within host interrupt controller, routed to VFIO driver
>  * Host EOI with hardware level IRQ still asserted
>  * Software mask of interrupt source by VFIO driver
>  * Generation of event and IRQ trigger in KVM guest interrupt controller
>  * Level IRQ deassertion on host
>  * Guest EOI
>  * Guest IRQ level deassertion
>  * Removal of software mask by VFIO driver
> 
> Note that no actual state change occurs within the host interrupt controller,
> unlike what would happen with either DisINTx+ hardware or message interrupts.
> The host EOI is not fired with the hardware level IRQ deasserted, and the
> level interrupt is not re-armed within the host interrupt controller, leading
> to an unrecoverable stall of the device.
> 
> Work around this by disabling lazy IRQ masking for DisINTx- INTx devices.

I'm not really following here.  It's claimed above that no actual state
change occurs within the host interrupt controller, but that's exactly
what disable_irq_nosync() intends to do, mask the interrupt line at the
controller.  The lazy optimization that's being proposed here should
only change the behavior such that the interrupt is masked at the call
to disable_irq_nosync() rather than at a subsequent re-assertion of the
interrupt.  In any case, enable_irq() should mark the line enabled and
reenable the controller if necessary.

Also, contrary to above, when a device supports DisINT+ we're not
manipulating the host controller.  We're able to mask the interrupt at
the device.  MSI is edge triggered, we don't mask it, so it's not
relevant to this discussion afaict.

There may be good reason to disable the lazy masking behavior as you're
proposing, but I'm not able to glean it from this discussion of the
issue.

> 
> ---
>  drivers/vfio/pci/vfio_pci_intrs.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index 123298a4dc8f..011169ca7a34 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -304,6 +304,9 @@ static int vfio_intx_enable(struct vfio_pci_core_device *vdev,
>  
>  	vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX;
>  
> +	if (is_intx(vdev) && !vdev->pci_2_3)

We just set irq_type, which is what is_intx() tests, how could it be
anything other?  Thanks,

Alex

> +		irq_set_status_flags(pdev->irq, IRQ_DISABLE_UNLAZY);
> +
>  	ret = request_irq(pdev->irq, vfio_intx_handler,
>  			  irqflags, ctx->name, ctx);
>  	if (ret) {
> @@ -351,6 +354,8 @@ static void vfio_intx_disable(struct vfio_pci_core_device *vdev)
>  	if (ctx) {
>  		vfio_virqfd_disable(&ctx->unmask);
>  		vfio_virqfd_disable(&ctx->mask);
> +		if (!vdev->pci_2_3)
> +			irq_clear_status_flags(pdev->irq, IRQ_DISABLE_UNLAZY);
>  		free_irq(pdev->irq, ctx);
>  		if (ctx->trigger)
>  			eventfd_ctx_put(ctx->trigger);



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
  2025-09-19 18:56 ` Alex Williamson
@ 2025-09-19 20:51   ` Timothy Pearson
  2025-09-19 22:27     ` Alex Williamson
  0 siblings, 1 reply; 9+ messages in thread
From: Timothy Pearson @ 2025-09-19 20:51 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linuxppc-dev

----- Original Message -----
> From: "Alex Williamson" <alex.williamson@redhat.com>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "kvm" <kvm@vger.kernel.org>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Friday, September 19, 2025 1:56:03 PM
> Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices

> On Tue, 9 Sep 2025 15:48:46 -0500 (CDT)
> Timothy Pearson <tpearson@raptorengineering.com> wrote:
> 
>> PCI devices prior to PCI 2.3 both use level interrupts and do not support
>> interrupt masking, leading to a failure when passed through to a KVM guest on
>> at least the ppc64 platform, which does not utilize the resample IRQFD. This
>> failure manifests as receiving and acknowledging a single interrupt in the guest
>> while leaving the host physical device VFIO IRQ pending.
>> 
>> Level interrupts in general require special handling due to their inherently
>> asynchronous nature; both the host and guest interrupt controller need to
>> remain in synchronization in order to coordinate mask and unmask operations.
>> When lazy IRQ masking is used on DisINTx- hardware, the following sequence
>> occurs:
>>
>>  * Level IRQ assertion on host
>>  * IRQ trigger within host interrupt controller, routed to VFIO driver
>>  * Host EOI with hardware level IRQ still asserted
>>  * Software mask of interrupt source by VFIO driver
>>  * Generation of event and IRQ trigger in KVM guest interrupt controller
>>  * Level IRQ deassertion on host
>>  * Guest EOI
>>  * Guest IRQ level deassertion
>>  * Removal of software mask by VFIO driver
>> 
>> Note that no actual state change occurs within the host interrupt controller,
>> unlike what would happen with either DisINTx+ hardware or message interrupts.
>> The host EOI is not fired with the hardware level IRQ deasserted, and the
>> level interrupt is not re-armed within the host interrupt controller, leading
>> to an unrecoverable stall of the device.
>> 
>> Work around this by disabling lazy IRQ masking for DisINTx- INTx devices.
> 
> I'm not really following here.  It's claimed above that no actual state
> change occurs within the host interrupt controller, but that's exactly
> what disable_irq_nosync() intends to do, mask the interrupt line at the
> controller.

While it seems that way on the surface (and this tripped me up originally), the actual call chain is:

disable_irq_nosync()
__disable_irq_nosync()
__disable_irq()
irq_disable()

Inside void irq_disable(), __irq_disable() is gated on irq_settings_disable_unlazy().  The lazy disable is intended to *not* touch the interrupt controller itself, instead lazy mode masks the interrupt at the device level (DisINT+ registers).  If the IRQ is set up to run in lazy mode, the interrupt is not disabled at the actual interrupt controller by disable_irq_nosync().

> The lazy optimization that's being proposed here should
> only change the behavior such that the interrupt is masked at the call
> to disable_irq_nosync() rather than at a subsequent re-assertion of the
> interrupt.  In any case, enable_irq() should mark the line enabled and
> reenable the controller if necessary.

If the interrupt was not disabled at the controller, then reenabling a level interrupt is not guaranteed to actually do anything (although it *might*).  The hardware in the interrupt controller will still "see" an active level assert for which it fired an interrupt without a prior acknowledge (or disable/enable cycle) from software, and can then decide to not re-raise the IRQ on a platform-specific basis.

The key here is that the interrupt controllers differ somewhat in behavior across various architectures.  On POWER, the controller will only raise the external processor interrupt once for each level interrupt when that interrupt changes state to asserted, and will only re-raise the external processor interrupt once an acknowledge for that interrupt has been sent to the interrupt controller hardware while the level interrupt is deasserted.  As a result, if the interrupt handler executes (acknowledging the interrupt), but does not first clear the interrupt on the device itself, the interrupt controller will never re-raise that interrupt -- from its perspective, it has issued another IRQ (because the device level interrupt was left asserted) and the associated handler has never completed.  Disabling the interrupt causes the controller to reassert the interrupt if the level interrupt is still asserted when the interrupt is reenabled at the controller level.

On other platforms the external processor interrupt itself is disabled until the interrupt handler has finished, and the controller doesn't auto-mask the level interrupts at the hardware level; instead, it will happily re-assert the processor interrupt if the interrupt was not cleared at the device level after IRQ acknowledge.  I suspect on those platforms this bug may be masked at the expense of a bunch of "spurious" / unwanted interrupts if the interrupt handler hasn't acked the interrupt at the device level; as long as the guest interrupt handler is able to somewhat rapidly clear the device interrupt, performance won't be impacted too much by the extra interrupt load, further hiding the bug on these platforms.

Again, this is also a specific and unusual case of an old level-driven interrupt device that doesn't support interrupt masking at the device level (i.e. the device is DisINT-), in combination with the VFIO driver.  Under that *specific* use case, the VFIO driver purposefully acnowledges the interrupt without first clearing the interrupt on the device, which then exposes the platform-specific differences in interrupt controller behavior.

> Also, contrary to above, when a device supports DisINT+ we're not
> manipulating the host controller.  We're able to mask the interrupt at
> the device.  MSI is edge triggered, we don't mask it, so it's not
> relevant to this discussion afaict.

That's correct -- I don't recall saying the opposite. ;)  If I did, I apologize.

> There may be good reason to disable the lazy masking behavior as you're
> proposing, but I'm not able to glean it from this discussion of the
> issue.

Does this help clarify anything?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
  2025-09-19 20:51   ` Timothy Pearson
@ 2025-09-19 22:27     ` Alex Williamson
  2025-09-20 19:25       ` Timothy Pearson
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2025-09-19 22:27 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: kvm, linuxppc-dev

On Fri, 19 Sep 2025 15:51:14 -0500 (CDT)
Timothy Pearson <tpearson@raptorengineering.com> wrote:

> ----- Original Message -----
> > From: "Alex Williamson" <alex.williamson@redhat.com>
> > To: "Timothy Pearson" <tpearson@raptorengineering.com>
> > Cc: "kvm" <kvm@vger.kernel.org>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> > Sent: Friday, September 19, 2025 1:56:03 PM
> > Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices  
> 
> > On Tue, 9 Sep 2025 15:48:46 -0500 (CDT)
> > Timothy Pearson <tpearson@raptorengineering.com> wrote:
> >   
> >> PCI devices prior to PCI 2.3 both use level interrupts and do not support
> >> interrupt masking, leading to a failure when passed through to a KVM guest on
> >> at least the ppc64 platform, which does not utilize the resample IRQFD. This
> >> failure manifests as receiving and acknowledging a single interrupt in the guest
> >> while leaving the host physical device VFIO IRQ pending.
> >> 
> >> Level interrupts in general require special handling due to their inherently
> >> asynchronous nature; both the host and guest interrupt controller need to
> >> remain in synchronization in order to coordinate mask and unmask operations.
> >> When lazy IRQ masking is used on DisINTx- hardware, the following sequence
> >> occurs:
> >>
> >>  * Level IRQ assertion on host
> >>  * IRQ trigger within host interrupt controller, routed to VFIO driver
> >>  * Host EOI with hardware level IRQ still asserted
> >>  * Software mask of interrupt source by VFIO driver
> >>  * Generation of event and IRQ trigger in KVM guest interrupt controller
> >>  * Level IRQ deassertion on host
> >>  * Guest EOI
> >>  * Guest IRQ level deassertion
> >>  * Removal of software mask by VFIO driver
> >> 
> >> Note that no actual state change occurs within the host interrupt controller,
> >> unlike what would happen with either DisINTx+ hardware or message interrupts.
> >> The host EOI is not fired with the hardware level IRQ deasserted, and the
> >> level interrupt is not re-armed within the host interrupt controller, leading
> >> to an unrecoverable stall of the device.
> >> 
> >> Work around this by disabling lazy IRQ masking for DisINTx- INTx devices.  
> > 
> > I'm not really following here.  It's claimed above that no actual state
> > change occurs within the host interrupt controller, but that's exactly
> > what disable_irq_nosync() intends to do, mask the interrupt line at the
> > controller.  
> 
> While it seems that way on the surface (and this tripped me up
> originally), the actual call chain is:
> 
> disable_irq_nosync()
> __disable_irq_nosync()
> __disable_irq()
> irq_disable()
> 
> Inside void irq_disable(), __irq_disable() is gated on
> irq_settings_disable_unlazy().  The lazy disable is intended to *not*
> touch the interrupt controller itself, instead lazy mode masks the
> interrupt at the device level (DisINT+ registers).  If the IRQ is set
> up to run in lazy mode, the interrupt is not disabled at the actual
> interrupt controller by disable_irq_nosync().

What chip handler are you using?  The comment above irq_disable
reiterates the behavior, yes if the chip doesn't support irq_disable it
marks the interrupt masked but leaves the hardware unmasked.  It does
not describe using DisINTx to mask the device, which would be at a
different level from the chip.  In this case __irq_disable() just calls
irq_state_set_disabled().  Only with the change proposed here would we
also call mask_irq().
 
> > The lazy optimization that's being proposed here should
> > only change the behavior such that the interrupt is masked at the
> > call to disable_irq_nosync() rather than at a subsequent
> > re-assertion of the interrupt.  In any case, enable_irq() should
> > mark the line enabled and reenable the controller if necessary.  
> 
> If the interrupt was not disabled at the controller, then reenabling
> a level interrupt is not guaranteed to actually do anything (although
> it *might*).  The hardware in the interrupt controller will still
> "see" an active level assert for which it fired an interrupt without
> a prior acknowledge (or disable/enable cycle) from software, and can
> then decide to not re-raise the IRQ on a platform-specific basis.
> 
> The key here is that the interrupt controllers differ somewhat in
> behavior across various architectures.  On POWER, the controller will
> only raise the external processor interrupt once for each level
> interrupt when that interrupt changes state to asserted, and will
> only re-raise the external processor interrupt once an acknowledge
> for that interrupt has been sent to the interrupt controller hardware
> while the level interrupt is deasserted.  As a result, if the
> interrupt handler executes (acknowledging the interrupt), but does
> not first clear the interrupt on the device itself, the interrupt
> controller will never re-raise that interrupt -- from its
> perspective, it has issued another IRQ (because the device level
> interrupt was left asserted) and the associated handler has never
> completed.  Disabling the interrupt causes the controller to reassert
> the interrupt if the level interrupt is still asserted when the
> interrupt is reenabled at the controller level.

This sounds a lot more like the problem than the previous description.
Is the actual scenario something like the irq is marked disabled, the
eventfd is delivered to userspace, userspace handles the device, the
interrupt is de-asserted at the device, but then the device re-asserts
the interrupt before the unmask ioctl, causing the interrupt chip to
mask the interrupt, then enable_irq() from the unmask ioctl doesn't
reassert the interrupt?

> On other platforms the external processor interrupt itself is
> disabled until the interrupt handler has finished, and the controller
> doesn't auto-mask the level interrupts at the hardware level;
> instead, it will happily re-assert the processor interrupt if the
> interrupt was not cleared at the device level after IRQ acknowledge.
> I suspect on those platforms this bug may be masked at the expense of
> a bunch of "spurious" / unwanted interrupts if the interrupt handler
> hasn't acked the interrupt at the device level; as long as the guest
> interrupt handler is able to somewhat rapidly clear the device
> interrupt, performance won't be impacted too much by the extra
> interrupt load, further hiding the bug on these platforms.

It seems this is the trade off the lazy handling makes intentionally,
we risk some spurious interrupts while the line is disabled to avoid
poking the hardware.  So long as we're not triggering the spurious
interrupt handler to permanently disabling the interrupt line, this is a
valid choice.

That's also a consideration for this patch, we're slowing down all
non-PCI2.3 INTx handling for the behavior of this platform.  Is there
some behavior of the interrupt controller we can query to know which
path to use?  Can this issue be seen with the irqfd INTx handling
disabled on other architectures?  Do we actually care whether we're
making old INTx devices slower?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
  2025-09-19 22:27     ` Alex Williamson
@ 2025-09-20 19:25       ` Timothy Pearson
  2025-09-22 16:01         ` Alex Williamson
  0 siblings, 1 reply; 9+ messages in thread
From: Timothy Pearson @ 2025-09-20 19:25 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linuxppc-dev



----- Original Message -----
> From: "Alex Williamson" <alex.williamson@redhat.com>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "kvm" <kvm@vger.kernel.org>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Friday, September 19, 2025 5:27:21 PM
> Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices

> On Fri, 19 Sep 2025 15:51:14 -0500 (CDT)
> Timothy Pearson <tpearson@raptorengineering.com> wrote:
> 
>> ----- Original Message -----
>> > From: "Alex Williamson" <alex.williamson@redhat.com>
>> > To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> > Cc: "kvm" <kvm@vger.kernel.org>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
>> > Sent: Friday, September 19, 2025 1:56:03 PM
>> > Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
>> 
>> > On Tue, 9 Sep 2025 15:48:46 -0500 (CDT)
>> > Timothy Pearson <tpearson@raptorengineering.com> wrote:
>> >   
>> >> PCI devices prior to PCI 2.3 both use level interrupts and do not support
>> >> interrupt masking, leading to a failure when passed through to a KVM guest on
>> >> at least the ppc64 platform, which does not utilize the resample IRQFD. This
>> >> failure manifests as receiving and acknowledging a single interrupt in the guest
>> >> while leaving the host physical device VFIO IRQ pending.
>> >> 
>> >> Level interrupts in general require special handling due to their inherently
>> >> asynchronous nature; both the host and guest interrupt controller need to
>> >> remain in synchronization in order to coordinate mask and unmask operations.
>> >> When lazy IRQ masking is used on DisINTx- hardware, the following sequence
>> >> occurs:
>> >>
>> >>  * Level IRQ assertion on host
>> >>  * IRQ trigger within host interrupt controller, routed to VFIO driver
>> >>  * Host EOI with hardware level IRQ still asserted
>> >>  * Software mask of interrupt source by VFIO driver
>> >>  * Generation of event and IRQ trigger in KVM guest interrupt controller
>> >>  * Level IRQ deassertion on host
>> >>  * Guest EOI
>> >>  * Guest IRQ level deassertion
>> >>  * Removal of software mask by VFIO driver
>> >> 
>> >> Note that no actual state change occurs within the host interrupt controller,
>> >> unlike what would happen with either DisINTx+ hardware or message interrupts.
>> >> The host EOI is not fired with the hardware level IRQ deasserted, and the
>> >> level interrupt is not re-armed within the host interrupt controller, leading
>> >> to an unrecoverable stall of the device.
>> >> 
>> >> Work around this by disabling lazy IRQ masking for DisINTx- INTx devices.
>> > 
>> > I'm not really following here.  It's claimed above that no actual state
>> > change occurs within the host interrupt controller, but that's exactly
>> > what disable_irq_nosync() intends to do, mask the interrupt line at the
>> > controller.
>> 
>> While it seems that way on the surface (and this tripped me up
>> originally), the actual call chain is:
>> 
>> disable_irq_nosync()
>> __disable_irq_nosync()
>> __disable_irq()
>> irq_disable()
>> 
>> Inside void irq_disable(), __irq_disable() is gated on
>> irq_settings_disable_unlazy().  The lazy disable is intended to *not*
>> touch the interrupt controller itself, instead lazy mode masks the
>> interrupt at the device level (DisINT+ registers).  If the IRQ is set
>> up to run in lazy mode, the interrupt is not disabled at the actual
>> interrupt controller by disable_irq_nosync().
> 
> What chip handler are you using?  The comment above irq_disable
> reiterates the behavior, yes if the chip doesn't support irq_disable it
> marks the interrupt masked but leaves the hardware unmasked.  It does
> not describe using DisINTx to mask the device, which would be at a
> different level from the chip.  In this case __irq_disable() just calls
> irq_state_set_disabled().  Only with the change proposed here would we
> also call mask_irq().

This is all tested on a POWER XIVE controller, so arch/powerpc/sysdev/xive (CONFIG_PPC_XIVE_NATIVE=y)

>> > The lazy optimization that's being proposed here should
>> > only change the behavior such that the interrupt is masked at the
>> > call to disable_irq_nosync() rather than at a subsequent
>> > re-assertion of the interrupt.  In any case, enable_irq() should
>> > mark the line enabled and reenable the controller if necessary.
>> 
>> If the interrupt was not disabled at the controller, then reenabling
>> a level interrupt is not guaranteed to actually do anything (although
>> it *might*).  The hardware in the interrupt controller will still
>> "see" an active level assert for which it fired an interrupt without
>> a prior acknowledge (or disable/enable cycle) from software, and can
>> then decide to not re-raise the IRQ on a platform-specific basis.
>> 
>> The key here is that the interrupt controllers differ somewhat in
>> behavior across various architectures.  On POWER, the controller will
>> only raise the external processor interrupt once for each level
>> interrupt when that interrupt changes state to asserted, and will
>> only re-raise the external processor interrupt once an acknowledge
>> for that interrupt has been sent to the interrupt controller hardware
>> while the level interrupt is deasserted.  As a result, if the
>> interrupt handler executes (acknowledging the interrupt), but does
>> not first clear the interrupt on the device itself, the interrupt
>> controller will never re-raise that interrupt -- from its
>> perspective, it has issued another IRQ (because the device level
>> interrupt was left asserted) and the associated handler has never
>> completed.  Disabling the interrupt causes the controller to reassert
>> the interrupt if the level interrupt is still asserted when the
>> interrupt is reenabled at the controller level.
> 
> This sounds a lot more like the problem than the previous description.

Apologies for that.  There's a lot of moving parts here and I guess I muddled it all up in the first description.

> Is the actual scenario something like the irq is marked disabled, the
> eventfd is delivered to userspace, userspace handles the device, the
> interrupt is de-asserted at the device, but then the device re-asserts
> the interrupt before the unmask ioctl, causing the interrupt chip to
> mask the interrupt, then enable_irq() from the unmask ioctl doesn't
> reassert the interrupt?

That is exactly it, yes.  This particular scenario only occurs on old pre-PCI 2.3 devices that advertise DisINT- ; newer devices that advertise DisINT+ don't trip the fault since the host interrupt handler sets the device interrupt mask (thus deasserting the IRQ at the interrupt controller) before exiting.

>> On other platforms the external processor interrupt itself is
>> disabled until the interrupt handler has finished, and the controller
>> doesn't auto-mask the level interrupts at the hardware level;
>> instead, it will happily re-assert the processor interrupt if the
>> interrupt was not cleared at the device level after IRQ acknowledge.
>> I suspect on those platforms this bug may be masked at the expense of
>> a bunch of "spurious" / unwanted interrupts if the interrupt handler
>> hasn't acked the interrupt at the device level; as long as the guest
>> interrupt handler is able to somewhat rapidly clear the device
>> interrupt, performance won't be impacted too much by the extra
>> interrupt load, further hiding the bug on these platforms.
> 
> It seems this is the trade off the lazy handling makes intentionally,
> we risk some spurious interrupts while the line is disabled to avoid
> poking the hardware.  So long as we're not triggering the spurious
> interrupt handler to permanently disabling the interrupt line, this is a
> valid choice.

At least for some platforms, yes, though it's not exactly clear in the documentation that this is intentional or that the side effect exists at all.

> That's also a consideration for this patch, we're slowing down all
> non-PCI2.3 INTx handling for the behavior of this platform.  Is there
> some behavior of the interrupt controller we can query to know which
> path to use?  Can this issue be seen with the irqfd INTx handling
> disabled on other architectures?  Do we actually care whether we're
> making old INTx devices slower?  Thanks,

I honestly don't know.  In reality, we're talking about amd64 and ppc64el as the only two platforms that I know of that both support VFIO and such old PCIe INTX pre-2.3 devices, and their respective interrupt controller handling in a spurious INTX IRQ context is very different from one another.

Personally, I'd argue that such old devices were intended to work with much slower host systems, therefore the slowdown probably doesn't matter vs. being more correct in terms of interrupt handling.  In terms of general kernel design, my understanding has always been is that best practice is to always mask, disable, or clear a level interrupt before exiting the associated IRQ handler, and the current design seems to violate that rule.  In that context, I'd personally want to see an argument as to why echewing this traditional IRQ handler design is beneficial enough to justify making the VFIO driver dependent on platform-specific behavior.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
  2025-09-20 19:25       ` Timothy Pearson
@ 2025-09-22 16:01         ` Alex Williamson
  2025-09-22 16:34           ` Timothy Pearson
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2025-09-22 16:01 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: kvm, linuxppc-dev

On Sat, 20 Sep 2025 14:25:03 -0500 (CDT)
Timothy Pearson <tpearson@raptorengineering.com> wrote:

> ----- Original Message -----
> > From: "Alex Williamson" <alex.williamson@redhat.com>
> > To: "Timothy Pearson" <tpearson@raptorengineering.com>
> > Cc: "kvm" <kvm@vger.kernel.org>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> > Sent: Friday, September 19, 2025 5:27:21 PM
> > Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices  
> 
> > On Fri, 19 Sep 2025 15:51:14 -0500 (CDT)
> > Timothy Pearson <tpearson@raptorengineering.com> wrote:
> >   
> >> ----- Original Message -----  
> >> > From: "Alex Williamson" <alex.williamson@redhat.com>
> >> > To: "Timothy Pearson" <tpearson@raptorengineering.com>
> >> > Cc: "kvm" <kvm@vger.kernel.org>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> >> > Sent: Friday, September 19, 2025 1:56:03 PM
> >> > Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices  
> >>   
> >> > On Tue, 9 Sep 2025 15:48:46 -0500 (CDT)
> >> > Timothy Pearson <tpearson@raptorengineering.com> wrote:
> >> >     
> >> >> PCI devices prior to PCI 2.3 both use level interrupts and do not support
> >> >> interrupt masking, leading to a failure when passed through to a KVM guest on
> >> >> at least the ppc64 platform, which does not utilize the resample IRQFD. This
> >> >> failure manifests as receiving and acknowledging a single interrupt in the guest
> >> >> while leaving the host physical device VFIO IRQ pending.
> >> >> 
> >> >> Level interrupts in general require special handling due to their inherently
> >> >> asynchronous nature; both the host and guest interrupt controller need to
> >> >> remain in synchronization in order to coordinate mask and unmask operations.
> >> >> When lazy IRQ masking is used on DisINTx- hardware, the following sequence
> >> >> occurs:
> >> >>
> >> >>  * Level IRQ assertion on host
> >> >>  * IRQ trigger within host interrupt controller, routed to VFIO driver
> >> >>  * Host EOI with hardware level IRQ still asserted
> >> >>  * Software mask of interrupt source by VFIO driver
> >> >>  * Generation of event and IRQ trigger in KVM guest interrupt controller
> >> >>  * Level IRQ deassertion on host
> >> >>  * Guest EOI
> >> >>  * Guest IRQ level deassertion
> >> >>  * Removal of software mask by VFIO driver
> >> >> 
> >> >> Note that no actual state change occurs within the host interrupt controller,
> >> >> unlike what would happen with either DisINTx+ hardware or message interrupts.
> >> >> The host EOI is not fired with the hardware level IRQ deasserted, and the
> >> >> level interrupt is not re-armed within the host interrupt controller, leading
> >> >> to an unrecoverable stall of the device.
> >> >> 
> >> >> Work around this by disabling lazy IRQ masking for DisINTx- INTx devices.  
> >> > 
> >> > I'm not really following here.  It's claimed above that no actual state
> >> > change occurs within the host interrupt controller, but that's exactly
> >> > what disable_irq_nosync() intends to do, mask the interrupt line at the
> >> > controller.  
> >> 
> >> While it seems that way on the surface (and this tripped me up
> >> originally), the actual call chain is:
> >> 
> >> disable_irq_nosync()
> >> __disable_irq_nosync()
> >> __disable_irq()
> >> irq_disable()
> >> 
> >> Inside void irq_disable(), __irq_disable() is gated on
> >> irq_settings_disable_unlazy().  The lazy disable is intended to *not*
> >> touch the interrupt controller itself, instead lazy mode masks the
> >> interrupt at the device level (DisINT+ registers).  If the IRQ is set
> >> up to run in lazy mode, the interrupt is not disabled at the actual
> >> interrupt controller by disable_irq_nosync().  
> > 
> > What chip handler are you using?  The comment above irq_disable
> > reiterates the behavior, yes if the chip doesn't support irq_disable it
> > marks the interrupt masked but leaves the hardware unmasked.  It does
> > not describe using DisINTx to mask the device, which would be at a
> > different level from the chip.  In this case __irq_disable() just calls
> > irq_state_set_disabled().  Only with the change proposed here would we
> > also call mask_irq().  
> 
> This is all tested on a POWER XIVE controller, so arch/powerpc/sysdev/xive (CONFIG_PPC_XIVE_NATIVE=y)
> 
> >> > The lazy optimization that's being proposed here should
> >> > only change the behavior such that the interrupt is masked at the
> >> > call to disable_irq_nosync() rather than at a subsequent
> >> > re-assertion of the interrupt.  In any case, enable_irq() should
> >> > mark the line enabled and reenable the controller if necessary.  
> >> 
> >> If the interrupt was not disabled at the controller, then reenabling
> >> a level interrupt is not guaranteed to actually do anything (although
> >> it *might*).  The hardware in the interrupt controller will still
> >> "see" an active level assert for which it fired an interrupt without
> >> a prior acknowledge (or disable/enable cycle) from software, and can
> >> then decide to not re-raise the IRQ on a platform-specific basis.
> >> 
> >> The key here is that the interrupt controllers differ somewhat in
> >> behavior across various architectures.  On POWER, the controller will
> >> only raise the external processor interrupt once for each level
> >> interrupt when that interrupt changes state to asserted, and will
> >> only re-raise the external processor interrupt once an acknowledge
> >> for that interrupt has been sent to the interrupt controller hardware
> >> while the level interrupt is deasserted.  As a result, if the
> >> interrupt handler executes (acknowledging the interrupt), but does
> >> not first clear the interrupt on the device itself, the interrupt
> >> controller will never re-raise that interrupt -- from its
> >> perspective, it has issued another IRQ (because the device level
> >> interrupt was left asserted) and the associated handler has never
> >> completed.  Disabling the interrupt causes the controller to reassert
> >> the interrupt if the level interrupt is still asserted when the
> >> interrupt is reenabled at the controller level.  
> > 
> > This sounds a lot more like the problem than the previous description.  
> 
> Apologies for that.  There's a lot of moving parts here and I guess I muddled it all up in the first description.
> 
> > Is the actual scenario something like the irq is marked disabled, the
> > eventfd is delivered to userspace, userspace handles the device, the
> > interrupt is de-asserted at the device, but then the device re-asserts
> > the interrupt before the unmask ioctl, causing the interrupt chip to
> > mask the interrupt, then enable_irq() from the unmask ioctl doesn't
> > reassert the interrupt?  
> 
> That is exactly it, yes.  This particular scenario only occurs on old
> pre-PCI 2.3 devices that advertise DisINT- ; newer devices that
> advertise DisINT+ don't trip the fault since the host interrupt
> handler sets the device interrupt mask (thus deasserting the IRQ at
> the interrupt controller) before exiting.
> 
> >> On other platforms the external processor interrupt itself is
> >> disabled until the interrupt handler has finished, and the
> >> controller doesn't auto-mask the level interrupts at the hardware
> >> level; instead, it will happily re-assert the processor interrupt
> >> if the interrupt was not cleared at the device level after IRQ
> >> acknowledge. I suspect on those platforms this bug may be masked
> >> at the expense of a bunch of "spurious" / unwanted interrupts if
> >> the interrupt handler hasn't acked the interrupt at the device
> >> level; as long as the guest interrupt handler is able to somewhat
> >> rapidly clear the device interrupt, performance won't be impacted
> >> too much by the extra interrupt load, further hiding the bug on
> >> these platforms.  
> > 
> > It seems this is the trade off the lazy handling makes
> > intentionally, we risk some spurious interrupts while the line is
> > disabled to avoid poking the hardware.  So long as we're not
> > triggering the spurious interrupt handler to permanently disabling
> > the interrupt line, this is a valid choice.  
> 
> At least for some platforms, yes, though it's not exactly clear in
> the documentation that this is intentional or that the side effect
> exists at all.
> 
> > That's also a consideration for this patch, we're slowing down all
> > non-PCI2.3 INTx handling for the behavior of this platform.  Is
> > there some behavior of the interrupt controller we can query to
> > know which path to use?  Can this issue be seen with the irqfd INTx
> > handling disabled on other architectures?  Do we actually care
> > whether we're making old INTx devices slower?  Thanks,  
> 
> I honestly don't know.  In reality, we're talking about amd64 and
> ppc64el as the only two platforms that I know of that both support
> VFIO and such old PCIe INTX pre-2.3 devices, and their respective
> interrupt controller handling in a spurious INTX IRQ context is very
> different from one another.
> 
> Personally, I'd argue that such old devices were intended to work
> with much slower host systems, therefore the slowdown probably
> doesn't matter vs. being more correct in terms of interrupt handling.
>  In terms of general kernel design, my understanding has always been
> is that best practice is to always mask, disable, or clear a level
> interrupt before exiting the associated IRQ handler, and the current
> design seems to violate that rule.  In that context, I'd personally
> want to see an argument as to why echewing this traditional IRQ
> handler design is beneficial enough to justify making the VFIO driver
> dependent on platform-specific behavior.

Yep, I kind of agree.  The unlazy flag seems to provide the more
intended behavior.  It moves the irq chip masking into the fast path,
whereas it would have been asynchronous on a subsequent interrupt
previously, but the impact is only to ancient devices operating in INTx
mode, so as long as we can verify those still work on both ppc and x86,
I don't think it's worth complicating the code to make setting the
unlazy flag conditional on anything other than the device support.

Care to send out a new version documenting the actual sequence fixed by
this change and updating the code based on this thread?  Note that we
can test non-pci2.3 mode for any device/driver that supports INTx using
the nointxmask=1 option for vfio-pci and booting a linux guest with
pci=nomsi.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
  2025-09-22 16:01         ` Alex Williamson
@ 2025-09-22 16:34           ` Timothy Pearson
  2025-09-22 16:46             ` Alex Williamson
  0 siblings, 1 reply; 9+ messages in thread
From: Timothy Pearson @ 2025-09-22 16:34 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Timothy Pearson, kvm, linuxppc-dev



----- Original Message -----
> From: "Alex Williamson" <alex.williamson@redhat.com>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "kvm" <kvm@vger.kernel.org>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Monday, September 22, 2025 11:01:43 AM
> Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices

> On Sat, 20 Sep 2025 14:25:03 -0500 (CDT)
> Timothy Pearson <tpearson@raptorengineering.com> wrote:
>> Personally, I'd argue that such old devices were intended to work
>> with much slower host systems, therefore the slowdown probably
>> doesn't matter vs. being more correct in terms of interrupt handling.
>>  In terms of general kernel design, my understanding has always been
>> is that best practice is to always mask, disable, or clear a level
>> interrupt before exiting the associated IRQ handler, and the current
>> design seems to violate that rule.  In that context, I'd personally
>> want to see an argument as to why echewing this traditional IRQ
>> handler design is beneficial enough to justify making the VFIO driver
>> dependent on platform-specific behavior.
> 
> Yep, I kind of agree.  The unlazy flag seems to provide the more
> intended behavior.  It moves the irq chip masking into the fast path,
> whereas it would have been asynchronous on a subsequent interrupt
> previously, but the impact is only to ancient devices operating in INTx
> mode, so as long as we can verify those still work on both ppc and x86,
> I don't think it's worth complicating the code to make setting the
> unlazy flag conditional on anything other than the device support.
> 
> Care to send out a new version documenting the actual sequence fixed by
> this change and updating the code based on this thread?  Note that we
> can test non-pci2.3 mode for any device/driver that supports INTx using
> the nointxmask=1 option for vfio-pci and booting a linux guest with
> pci=nomsi.  Thanks,
> 
> Alex

Sure, I can update the commit message easily enough, but I must have missed something in regard to a needed code update.  The existing patch only sets unlazy for non-PCI 2.3 INTX devices, and as I understand it that's the behavior we have both agreed on at this point?

I've tested this on ppc64el and it works quite well, repairing the broken behavior where the guest would receive exactly one interrupt on the legacy PCI device per boot.  I don't have amd64 systems available to test on, however.

Thanks!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
  2025-09-22 16:34           ` Timothy Pearson
@ 2025-09-22 16:46             ` Alex Williamson
  2025-09-22 17:23               ` Timothy Pearson
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2025-09-22 16:46 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: kvm, linuxppc-dev

On Mon, 22 Sep 2025 11:34:23 -0500 (CDT)
Timothy Pearson <tpearson@raptorengineering.com> wrote:

> ----- Original Message -----
> > From: "Alex Williamson" <alex.williamson@redhat.com>
> > To: "Timothy Pearson" <tpearson@raptorengineering.com>
> > Cc: "kvm" <kvm@vger.kernel.org>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> > Sent: Monday, September 22, 2025 11:01:43 AM
> > Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices  
> 
> > On Sat, 20 Sep 2025 14:25:03 -0500 (CDT)
> > Timothy Pearson <tpearson@raptorengineering.com> wrote:  
> >> Personally, I'd argue that such old devices were intended to work
> >> with much slower host systems, therefore the slowdown probably
> >> doesn't matter vs. being more correct in terms of interrupt handling.
> >>  In terms of general kernel design, my understanding has always been
> >> is that best practice is to always mask, disable, or clear a level
> >> interrupt before exiting the associated IRQ handler, and the current
> >> design seems to violate that rule.  In that context, I'd personally
> >> want to see an argument as to why echewing this traditional IRQ
> >> handler design is beneficial enough to justify making the VFIO driver
> >> dependent on platform-specific behavior.  
> > 
> > Yep, I kind of agree.  The unlazy flag seems to provide the more
> > intended behavior.  It moves the irq chip masking into the fast path,
> > whereas it would have been asynchronous on a subsequent interrupt
> > previously, but the impact is only to ancient devices operating in INTx
> > mode, so as long as we can verify those still work on both ppc and x86,
> > I don't think it's worth complicating the code to make setting the
> > unlazy flag conditional on anything other than the device support.
> > 
> > Care to send out a new version documenting the actual sequence fixed by
> > this change and updating the code based on this thread?  Note that we
> > can test non-pci2.3 mode for any device/driver that supports INTx using
> > the nointxmask=1 option for vfio-pci and booting a linux guest with
> > pci=nomsi.  Thanks,
> > 
> > Alex  
> 
> Sure, I can update the commit message easily enough, but I must have
> missed something in regard to a needed code update.  The existing
> patch only sets unlazy for non-PCI 2.3 INTX devices, and as I
> understand it that's the behavior we have both agreed on at this
> point?

I had commented[1] that testing the interrupt type immediately after
setting the interrupt type is redundant.  Also, looking again, if we
set the flag before request_irq, it seems logical that we'd clear the
flag after free_irq.  I think there are also some unaccounted error
paths where we can set the flag without clearing it that need to be
considered.

> I've tested this on ppc64el and it works quite well, repairing the
> broken behavior where the guest would receive exactly one interrupt
> on the legacy PCI device per boot.  I don't have amd64 systems
> available to test on, however.

Noted, I'll incorporate some targeted testing here.  Thanks,

Alex

[1]https://lore.kernel.org/all/20250919125603.08f600ac.alex.williamson@redhat.com/



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
  2025-09-22 16:46             ` Alex Williamson
@ 2025-09-22 17:23               ` Timothy Pearson
  0 siblings, 0 replies; 9+ messages in thread
From: Timothy Pearson @ 2025-09-22 17:23 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linuxppc-dev



----- Original Message -----
> From: "Alex Williamson" <alex.williamson@redhat.com>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "kvm" <kvm@vger.kernel.org>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
> Sent: Monday, September 22, 2025 11:46:58 AM
> Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices

> On Mon, 22 Sep 2025 11:34:23 -0500 (CDT)
> Timothy Pearson <tpearson@raptorengineering.com> wrote:
> 
>> ----- Original Message -----
>> > From: "Alex Williamson" <alex.williamson@redhat.com>
>> > To: "Timothy Pearson" <tpearson@raptorengineering.com>
>> > Cc: "kvm" <kvm@vger.kernel.org>, "linuxppc-dev" <linuxppc-dev@lists.ozlabs.org>
>> > Sent: Monday, September 22, 2025 11:01:43 AM
>> > Subject: Re: [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices
>> 
>> > On Sat, 20 Sep 2025 14:25:03 -0500 (CDT)
>> > Timothy Pearson <tpearson@raptorengineering.com> wrote:
>> >> Personally, I'd argue that such old devices were intended to work
>> >> with much slower host systems, therefore the slowdown probably
>> >> doesn't matter vs. being more correct in terms of interrupt handling.
>> >>  In terms of general kernel design, my understanding has always been
>> >> is that best practice is to always mask, disable, or clear a level
>> >> interrupt before exiting the associated IRQ handler, and the current
>> >> design seems to violate that rule.  In that context, I'd personally
>> >> want to see an argument as to why echewing this traditional IRQ
>> >> handler design is beneficial enough to justify making the VFIO driver
>> >> dependent on platform-specific behavior.
>> > 
>> > Yep, I kind of agree.  The unlazy flag seems to provide the more
>> > intended behavior.  It moves the irq chip masking into the fast path,
>> > whereas it would have been asynchronous on a subsequent interrupt
>> > previously, but the impact is only to ancient devices operating in INTx
>> > mode, so as long as we can verify those still work on both ppc and x86,
>> > I don't think it's worth complicating the code to make setting the
>> > unlazy flag conditional on anything other than the device support.
>> > 
>> > Care to send out a new version documenting the actual sequence fixed by
>> > this change and updating the code based on this thread?  Note that we
>> > can test non-pci2.3 mode for any device/driver that supports INTx using
>> > the nointxmask=1 option for vfio-pci and booting a linux guest with
>> > pci=nomsi.  Thanks,
>> > 
>> > Alex
>> 
>> Sure, I can update the commit message easily enough, but I must have
>> missed something in regard to a needed code update.  The existing
>> patch only sets unlazy for non-PCI 2.3 INTX devices, and as I
>> understand it that's the behavior we have both agreed on at this
>> point?
> 
> I had commented[1] that testing the interrupt type immediately after
> setting the interrupt type is redundant.  Also, looking again, if we
> set the flag before request_irq, it seems logical that we'd clear the
> flag after free_irq.  I think there are also some unaccounted error
> paths where we can set the flag without clearing it that need to be
> considered.

Gotcha, I missed that the first time around.  I also did a quick check for any other exit paths and only saw the MSI exit handlers, which wouln't be relevant here.

An interesting quirk I found while debugging is that the guest will receive quite a few spurious interrupts.  Changing to unlazy IRQ doesn't fix that, it's just how VFIO works with legacy non-PCI 2.3 INTX devices.  Since the host kernel doesn't know how to clear a pending interrupt on the device, it also doesn't know how to check if the asserted interrupt is actually valid or is the result of the deferred eventfd handling flow we've discussed in this thread.  Therefore, it will always send the IRQ to the guest, which has the somewhat annoying but harmelss effect of incrementing the spurious IRQ counters in the guest with certain drivers.

>> I've tested this on ppc64el and it works quite well, repairing the
>> broken behavior where the guest would receive exactly one interrupt
>> on the legacy PCI device per boot.  I don't have amd64 systems
>> available to test on, however.
> 
> Noted, I'll incorporate some targeted testing here.  Thanks,

Sounds good, thanks.  V2 sent.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-09-22 17:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-09 20:48 [PATCH] vfio/pci: Fix INTx handling on legacy DisINTx- PCI devices Timothy Pearson
2025-09-19 18:56 ` Alex Williamson
2025-09-19 20:51   ` Timothy Pearson
2025-09-19 22:27     ` Alex Williamson
2025-09-20 19:25       ` Timothy Pearson
2025-09-22 16:01         ` Alex Williamson
2025-09-22 16:34           ` Timothy Pearson
2025-09-22 16:46             ` Alex Williamson
2025-09-22 17:23               ` Timothy Pearson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox