Re: [patch 1/2] irqchip/msi-lib: Honor the MSI_FLAG_PCI_MSI_MASK_PARENT flag

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@kernel.org>
To: Luigi Rizzo <lrizzo@google.com>
Cc: Marc Zyngier <maz@kernel.org>,
	bhelgaas@google.com, linux-kernel@vger.kernel.org
Subject: Re: [patch 1/2] irqchip/msi-lib: Honor the MSI_FLAG_PCI_MSI_MASK_PARENT flag
Date: Fri, 09 Jan 2026 18:01:53 +0100	[thread overview]
Message-ID: <87jyxqiuta.ffs@tglx> (raw)
In-Reply-To: <CAMOZA0JEia2Yz2HC=xz+LbsCPNHfx_KPZx41ohHns0zguQd+rw@mail.gmail.com>

On Fri, Jan 09 2026 at 14:00, Luigi Rizzo wrote:
> On Fri, Jan 9, 2026 at 1:20 PM Thomas Gleixner <tglx@kernel.org> wrote:
>> That's not the point. The point is that _after_ the mask has reached the
>> device it is guaranteed that no further interrupts are generated until
>> unmask() is invoked. That's what is required e.g. for writing the MSI
>> message to the device.

Actually I have to correct myself. We stopped doing the readback on
mask() in pci_write_msg_msix() because the writes are guaranteed to be
ordered, so the device will see them in correct order:

         PCI_MSIX_ENTRY_VECTOR_CTRL  // mask
         PCI_MSIX_ENTRY_LOWER_ADDR   // MSI message address low
         PCI_MSIX_ENTRY_UPPER_ADDR   // MSI message address high
         PCI_MSIX_ENTRY_DATA         // MSI message data
         PCI_MSIX_ENTRY_VECTOR_CTRL  // unmask

which in turn satisfies the specification requirement that the MSIX
entry has to be masked in order to change the message:

  "If software changes the Address or Data value of an entry while the
   entry is unmasked, the result is undefined."

So yes, for the affinity change case, the read back is not required and
not done anymore. The read back was there in the initial implementation
and it turned out to be for paranoia sake.

It's still required to mask, but that's achieved through ordering.

> I keep hearing about this guarantee, which is surely true, but I argue that
> it is useless both during normal activity (when interrupts may be
> occasionally masked during affinity changes) and at shutdown
> (when there is a lot more additional state cleanup done on
> the device, which insures that there are no more interrupts,
> save the one(s) happening between the mask() write and readback,
> which we have no way to block, anyways (and may fire on the CPU at
> some unspecified time, even after the readback, because that does not
> control the pipeline from the device to the interrupt controller etc.),
> and that the kernel stack is well equipped to handle and ignore.

There is a difference and it has nothing to do with other device
cleanups at all. That's a pure interrupt management problem:

free_irq()
   1)   mask()
   2)   synchronize_irq()
   3)   invalidate vector

So if #1 does not guarantee that the interrupt is masked, then #2 is not
guaranteed to see a pending interrupt and #3 can invalidate the vector
before the interrupt becomes pending and raised in a CPU, which then
results in a spurious interrupt warning or in the worst case an IOMMU
exception. While that's "handled" by the kernel somewhat gracefully,
it's not ignored because it's invalid state.

The readback after mask() guarantees that an interrupt which was raised
_before_ the mask write reached the device has arrived at its
destination and is therefore detectable as pending or in progress
_before_ the vector is invalidated. That's simply because the interrupt
message preceeds the read reply and the CPU is stalled until the read
comes back.

> The only, weak, justification that Gemini gives for masking is that

Groan....

> Again, I have seen no explanation so far on why the mask requires a
> guarantee, and as I have tried to explain, the unavoidable stray
> interrupts generated before the readback are as bad (or actually as
> harmless) as the ones that we avoid with the readback.

Stray interrupts are only harmless when the interrupt is in a functional
state, but _not_ when the required resources are not longer available.

So instead of this unholy per CPU state and penalizing the VIRT case,
which Marc is concerned off, something like the uncompiled below should
just work. Then you can use these new callbacks for your mitigation muck
and everything else is unaffected.

Though this mitigation stuff will have serious implications when running
in a VM because the MMIO write will be intercepted, but that's a
different story.

What might be worth to investigate is whether the irq_mask() callback
for MSI-X can use the lazy mechanism (without readback). But that's an
orthogonal issue and needs some thoughts and testing.

Thanks,

        tglx
---
--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -162,6 +162,11 @@ static void pci_irq_mask_msix(struct irq
 	pci_msix_mask(irq_data_get_msi_desc(data));
 }
 
+static void pci_irq_mask_msix_lazy(struct irq_data *data)
+{
+	pci_msix_mask_lazy(irq_data_get_msi_desc(data));
+}
+
 static void pci_irq_unmask_msix(struct irq_data *data)
 {
 	pci_msix_unmask(irq_data_get_msi_desc(data));
@@ -182,7 +187,9 @@ static const struct msi_domain_template
 		.irq_startup		= pci_irq_startup_msix,
 		.irq_shutdown		= pci_irq_shutdown_msix,
 		.irq_mask		= pci_irq_mask_msix,
+		.irq_mask_dev		= pci_irq_mask_msix_lazy,
 		.irq_unmask		= pci_irq_unmask_msix,
+		.irq_unmask_dev		= pci_irq_unmask_msix,
 		.irq_write_msi_msg	= pci_msi_domain_write_msg,
 		.flags			= IRQCHIP_ONESHOT_SAFE,
 	},
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -40,10 +40,15 @@ static inline void pci_msix_write_vector
 		writel(ctrl, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
 }
 
-static inline void pci_msix_mask(struct msi_desc *desc)
+static inline void pci_msix_mask_lazy(struct msi_desc *desc)
 {
 	desc->pci.msix_ctrl |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
 	pci_msix_write_vector_ctrl(desc, desc->pci.msix_ctrl);
+}
+
+static inline void pci_msix_mask(struct msi_desc *desc)
+{
+	pci_msix_mask_lazy(desc);
 	/* Flush write to device */
 	readl(desc->pci.mask_base);
 }
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -451,7 +451,11 @@ static inline irq_hw_number_t irqd_to_hw
  * @irq_ack:		start of a new interrupt
  * @irq_mask:		mask an interrupt source
  * @irq_mask_ack:	ack and mask an interrupt source
+ * @irq_mask_dev:	Optional callback to mask at the device level for
+ *			interrupt moderation purposes. Only valid for the outermost
+ *			interrupt chip in a hierarchy.
  * @irq_unmask:		unmask an interrupt source
+ * @irq_unmask_dev:	Counterpart to @irq_mask_dev (required when @irq_mask_dev is not NULL)
  * @irq_eoi:		end of interrupt
  * @irq_set_affinity:	Set the CPU affinity on SMP machines. If the force
  *			argument is true, it tells the driver to
@@ -499,7 +503,9 @@ struct irq_chip {
 	void		(*irq_ack)(struct irq_data *data);
 	void		(*irq_mask)(struct irq_data *data);
 	void		(*irq_mask_ack)(struct irq_data *data);
+	void		(*irq_mask_dev)(struct irq_data *data);
 	void		(*irq_unmask)(struct irq_data *data);
+	void		(*irq_unmask_dev)(struct irq_data *data);
 	void		(*irq_eoi)(struct irq_data *data);
 
 	int		(*irq_set_affinity)(struct irq_data *data, const struct cpumask *dest, bool force);

next prev parent reply	other threads:[~2026-01-09 17:01 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-17 10:30 [PATCH] irqchip/msi-lib: Honor the MSI_FLAG_PCI_MSI_MASK_PARENT flag Marc Zyngier
2025-05-17 19:59 ` Thomas Gleixner
2025-05-23  9:06   ` Marc Zyngier
2025-06-30  8:59     ` Thomas Gleixner
2025-09-03 14:04       ` [patch 0/2] PCI/MSI: Avoid PCI level masking during normal operation if requested Thomas Gleixner
2025-09-03 14:04         ` [patch 1/2] irqchip/msi-lib: Honor the MSI_FLAG_PCI_MSI_MASK_PARENT flag Thomas Gleixner
2025-09-09 12:47           ` [tip: irq/drivers] " tip-bot2 for Marc Zyngier
2025-12-20 19:31           ` [patch 1/2] " Luigi Rizzo
2025-12-21 11:55             ` Marc Zyngier
2025-12-21 12:41               ` Luigi Rizzo
2025-12-22 16:16                 ` Marc Zyngier
2026-01-08 21:32                   ` Thomas Gleixner
2026-01-08 21:55                     ` Luigi Rizzo
2026-01-09 12:20                       ` Thomas Gleixner
2026-01-09 13:00                         ` Luigi Rizzo
2026-01-09 17:01                           ` Thomas Gleixner [this message]
2025-09-03 14:04         ` [patch 2/2] PCI/MSI: Remove the conditional parent [un]mask logic Thomas Gleixner
2025-09-03 17:38           ` Bjorn Helgaas
2025-09-09 12:47           ` [tip: irq/drivers] " tip-bot2 for Thomas Gleixner
2025-09-09 10:21         ` [patch 0/2] PCI/MSI: Avoid PCI level masking during normal operation if requested Marc Zyngier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jyxqiuta.ffs@tglx \
    --to=tglx@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lrizzo@google.com \
    --cc=maz@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox