From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1D22735A939 for ; Fri, 9 Jan 2026 17:01:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767978118; cv=none; b=t4304kcevAlj/XGbr05Lq2ZNxt56rUKFmkHHF+Dq09CWMgArSU4ARPkriP9Q/s/81XGZRIdhJkyYsbdAUetqCZCHvfUic2UWwEQwzRTSPPaa0bL9dTspXpgSL1zHtlmLlmqItEiwdbXyPHm9rRgl/Laf8TWcECxzE5AK/2YCSGU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767978118; c=relaxed/simple; bh=g8z2KlWHCQwIaH9Gc9+Tj1UOcih6XUPTu6H1zIDKu+g=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=tCI9+NyCUy8f94YTNCAz2sotBJcaq311tkD3r2u4p8idVzN7lOgQgxaokQLaLxS1h49ikjL2RC8dcB8AShudjPbaD0xuWogibOcxnTOL4XSOjwt3O5sIU4l7jiGXVc/m3kpqjWNgndU8kj4TpzLBpvHWvDlhe/jW3rWjW2gfS8w= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=HaDrXrcb; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="HaDrXrcb" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 317B3C4CEF1; Fri, 9 Jan 2026 17:01:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1767978117; bh=g8z2KlWHCQwIaH9Gc9+Tj1UOcih6XUPTu6H1zIDKu+g=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=HaDrXrcbU4HsVbjt15sOBupPgY7JMxsVa8s3yQ1NJm470GTWs4GjlO10gl+rVgZCd ETJDO4/yaGbINVHlV9/nyEP4dKak0mHzTMrY9s1+bn5ZFzEfchCzEJBQu/LFwVWHrQ +mcFiYrJVAC8aYa9hgznqd6c5FBDh1QMokZ7+SjsJ7XBsyblv2zl01H/mFO6DyWmGT m4KimbCZTwxaIV1R2VwlMgc9LwBuZJ3m1BeEPr+VgD1bdKjZmq+aNyPf4WkkR2ngXv ay4pRNlSfK2IVZYqXNnsX2GYnHUhnFsmpZ68VAvwxHOIywz7aGjws4BWDDFsnI3x31 q97/QcYFcH90A== From: Thomas Gleixner To: Luigi Rizzo Cc: Marc Zyngier , bhelgaas@google.com, linux-kernel@vger.kernel.org Subject: Re: [patch 1/2] irqchip/msi-lib: Honor the MSI_FLAG_PCI_MSI_MASK_PARENT flag In-Reply-To: References: <20250903135433.380783272@linutronix.de> <20251220193120.3339162-1-lrizzo@google.com> <87tsxkdp6s.wl-maz@kernel.org> <86ms3amqzm.wl-maz@kernel.org> <871pjzu6wl.ffs@tglx> <878qe7rn9c.ffs@tglx> Date: Fri, 09 Jan 2026 18:01:53 +0100 Message-ID: <87jyxqiuta.ffs@tglx> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On Fri, Jan 09 2026 at 14:00, Luigi Rizzo wrote: > On Fri, Jan 9, 2026 at 1:20=E2=80=AFPM Thomas Gleixner = wrote: >> That's not the point. The point is that _after_ the mask has reached the >> device it is guaranteed that no further interrupts are generated until >> unmask() is invoked. That's what is required e.g. for writing the MSI >> message to the device. Actually I have to correct myself. We stopped doing the readback on mask() in pci_write_msg_msix() because the writes are guaranteed to be ordered, so the device will see them in correct order: PCI_MSIX_ENTRY_VECTOR_CTRL // mask PCI_MSIX_ENTRY_LOWER_ADDR // MSI message address low PCI_MSIX_ENTRY_UPPER_ADDR // MSI message address high PCI_MSIX_ENTRY_DATA // MSI message data PCI_MSIX_ENTRY_VECTOR_CTRL // unmask which in turn satisfies the specification requirement that the MSIX entry has to be masked in order to change the message: "If software changes the Address or Data value of an entry while the entry is unmasked, the result is undefined." So yes, for the affinity change case, the read back is not required and not done anymore. The read back was there in the initial implementation and it turned out to be for paranoia sake. It's still required to mask, but that's achieved through ordering. > I keep hearing about this guarantee, which is surely true, but I argue th= at > it is useless both during normal activity (when interrupts may be > occasionally masked during affinity changes) and at shutdown > (when there is a lot more additional state cleanup done on > the device, which insures that there are no more interrupts, > save the one(s) happening between the mask() write and readback, > which we have no way to block, anyways (and may fire on the CPU at > some unspecified time, even after the readback, because that does not > control the pipeline from the device to the interrupt controller etc.), > and that the kernel stack is well equipped to handle and ignore. There is a difference and it has nothing to do with other device cleanups at all. That's a pure interrupt management problem: free_irq() 1) mask() 2) synchronize_irq() 3) invalidate vector So if #1 does not guarantee that the interrupt is masked, then #2 is not guaranteed to see a pending interrupt and #3 can invalidate the vector before the interrupt becomes pending and raised in a CPU, which then results in a spurious interrupt warning or in the worst case an IOMMU exception. While that's "handled" by the kernel somewhat gracefully, it's not ignored because it's invalid state. The readback after mask() guarantees that an interrupt which was raised _before_ the mask write reached the device has arrived at its destination and is therefore detectable as pending or in progress _before_ the vector is invalidated. That's simply because the interrupt message preceeds the read reply and the CPU is stalled until the read comes back. > The only, weak, justification that Gemini gives for masking is that Groan.... > Again, I have seen no explanation so far on why the mask requires a > guarantee, and as I have tried to explain, the unavoidable stray > interrupts generated before the readback are as bad (or actually as > harmless) as the ones that we avoid with the readback. Stray interrupts are only harmless when the interrupt is in a functional state, but _not_ when the required resources are not longer available. So instead of this unholy per CPU state and penalizing the VIRT case, which Marc is concerned off, something like the uncompiled below should just work. Then you can use these new callbacks for your mitigation muck and everything else is unaffected. Though this mitigation stuff will have serious implications when running in a VM because the MMIO write will be intercepted, but that's a different story. What might be worth to investigate is whether the irq_mask() callback for MSI-X can use the lazy mechanism (without readback). But that's an orthogonal issue and needs some thoughts and testing. Thanks, tglx --- --- a/drivers/pci/msi/irqdomain.c +++ b/drivers/pci/msi/irqdomain.c @@ -162,6 +162,11 @@ static void pci_irq_mask_msix(struct irq pci_msix_mask(irq_data_get_msi_desc(data)); } =20 +static void pci_irq_mask_msix_lazy(struct irq_data *data) +{ + pci_msix_mask_lazy(irq_data_get_msi_desc(data)); +} + static void pci_irq_unmask_msix(struct irq_data *data) { pci_msix_unmask(irq_data_get_msi_desc(data)); @@ -182,7 +187,9 @@ static const struct msi_domain_template .irq_startup =3D pci_irq_startup_msix, .irq_shutdown =3D pci_irq_shutdown_msix, .irq_mask =3D pci_irq_mask_msix, + .irq_mask_dev =3D pci_irq_mask_msix_lazy, .irq_unmask =3D pci_irq_unmask_msix, + .irq_unmask_dev =3D pci_irq_unmask_msix, .irq_write_msi_msg =3D pci_msi_domain_write_msg, .flags =3D IRQCHIP_ONESHOT_SAFE, }, --- a/drivers/pci/msi/msi.h +++ b/drivers/pci/msi/msi.h @@ -40,10 +40,15 @@ static inline void pci_msix_write_vector writel(ctrl, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL); } =20 -static inline void pci_msix_mask(struct msi_desc *desc) +static inline void pci_msix_mask_lazy(struct msi_desc *desc) { desc->pci.msix_ctrl |=3D PCI_MSIX_ENTRY_CTRL_MASKBIT; pci_msix_write_vector_ctrl(desc, desc->pci.msix_ctrl); +} + +static inline void pci_msix_mask(struct msi_desc *desc) +{ + pci_msix_mask_lazy(desc); /* Flush write to device */ readl(desc->pci.mask_base); } --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -451,7 +451,11 @@ static inline irq_hw_number_t irqd_to_hw * @irq_ack: start of a new interrupt * @irq_mask: mask an interrupt source * @irq_mask_ack: ack and mask an interrupt source + * @irq_mask_dev: Optional callback to mask at the device level for + * interrupt moderation purposes. Only valid for the outermost + * interrupt chip in a hierarchy. * @irq_unmask: unmask an interrupt source + * @irq_unmask_dev: Counterpart to @irq_mask_dev (required when @irq_mask_= dev is not NULL) * @irq_eoi: end of interrupt * @irq_set_affinity: Set the CPU affinity on SMP machines. If the force * argument is true, it tells the driver to @@ -499,7 +503,9 @@ struct irq_chip { void (*irq_ack)(struct irq_data *data); void (*irq_mask)(struct irq_data *data); void (*irq_mask_ack)(struct irq_data *data); + void (*irq_mask_dev)(struct irq_data *data); void (*irq_unmask)(struct irq_data *data); + void (*irq_unmask_dev)(struct irq_data *data); void (*irq_eoi)(struct irq_data *data); =20 int (*irq_set_affinity)(struct irq_data *data, const struct cpumask *des= t, bool force);