* [PATCH] amd/iommu: do not split domain flushes when flushing the entire range
@ 2026-03-04 21:30 Josef Bacik
2026-03-12 13:40 ` Jason Gunthorpe
2026-03-24 20:14 ` Josef Bacik
0 siblings, 2 replies; 4+ messages in thread
From: Josef Bacik @ 2026-03-04 21:30 UTC (permalink / raw)
To: joro, iommu, linux-kernel; +Cc: stable
We are hitting the following soft lockup in production on v6.6 and
v6.12, but the bug exists in all versions
watchdog: BUG: soft lockup - CPU#24 stuck for 31s! [tokio-runtime-w:1274919]
CPU: 24 PID: 1274919 Comm: tokio-runtime-w Not tainted 6.6.105+ #1
Hardware name: Google Google Compute Engine/Google Comput Engine, BIOS Google 10/25/2025
RIP: 0010:__raw_spin_unlock_irqrestore+0x21/0x30
Call Trace:
<TASK>
amd_iommu_attach_device+0x69/0x450
__iommu_device_set_domain+0x7b/0x190
__iommu_group_set_core_domain+0x61/0xd0
iommu_detatch_group+0x27/0x40
vfio_iommu_type1_detach_group+0x157/0x780 [vfio_iommu_type1]
vfio_group_detach_container+0x59/0x160 [vfio]
vfio_group_fops_release+0x4d/0x90 [vfio]
__fput+0x95/0x2a0
task_work_run+0x93/0xc0
do_exit+0x321/0x950
do_group_exit+0x7f/0xa0
get_signal_0x77d/0x780
</TASK>
This occurs because we're a VM and we're splitting up the size
CMD_INV_IOMMU_ALL_PAGES_ADDRESS we get from
amd_iommu_domain_flush_tlb_pde() into a bunch of smaller flushes. These
trap into the host on each flush, all while holding the domain lock with
IRQs disabled.
Fix this by not splitting up this special size amount and sending the
whole command in, so perhaps the host will decide to be gracious and not
spend 7 business years to do a flush.
cc: stable@vger.kernel.org
Fixes: a270be1b3fdf ("iommu/amd: Use only natural aligned flushes in a VM")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
drivers/iommu/amd/iommu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 81c4d7733872..f0d3e06734ef 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1769,7 +1769,8 @@ void amd_iommu_domain_flush_pages(struct protection_domain *domain,
{
lockdep_assert_held(&domain->lock);
- if (likely(!amd_iommu_np_cache)) {
+ if (likely(!amd_iommu_np_cache) ||
+ size == CMD_INV_IOMMU_ALL_PAGES_ADDRESS) {
__domain_flush_pages(domain, address, size);
/* Wait until IOMMU TLB and all device IOTLB flushes are complete */
--
2.53.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] amd/iommu: do not split domain flushes when flushing the entire range
2026-03-04 21:30 [PATCH] amd/iommu: do not split domain flushes when flushing the entire range Josef Bacik
@ 2026-03-12 13:40 ` Jason Gunthorpe
2026-03-14 18:24 ` Josef Bacik
2026-03-24 20:14 ` Josef Bacik
1 sibling, 1 reply; 4+ messages in thread
From: Jason Gunthorpe @ 2026-03-12 13:40 UTC (permalink / raw)
To: Josef Bacik; +Cc: joro, iommu, linux-kernel, stable
On Wed, Mar 04, 2026 at 04:30:03PM -0500, Josef Bacik wrote:
> We are hitting the following soft lockup in production on v6.6 and
> v6.12, but the bug exists in all versions
>
> watchdog: BUG: soft lockup - CPU#24 stuck for 31s! [tokio-runtime-w:1274919]
> CPU: 24 PID: 1274919 Comm: tokio-runtime-w Not tainted 6.6.105+ #1
> Hardware name: Google Google Compute Engine/Google Comput Engine, BIOS Google 10/25/2025
> RIP: 0010:__raw_spin_unlock_irqrestore+0x21/0x30
> Call Trace:
> <TASK>
> amd_iommu_attach_device+0x69/0x450
> __iommu_device_set_domain+0x7b/0x190
> __iommu_group_set_core_domain+0x61/0xd0
> iommu_detatch_group+0x27/0x40
> vfio_iommu_type1_detach_group+0x157/0x780 [vfio_iommu_type1]
> vfio_group_detach_container+0x59/0x160 [vfio]
> vfio_group_fops_release+0x4d/0x90 [vfio]
> __fput+0x95/0x2a0
> task_work_run+0x93/0xc0
> do_exit+0x321/0x950
> do_group_exit+0x7f/0xa0
> get_signal_0x77d/0x780
> </TASK>
>
> This occurs because we're a VM and we're splitting up the size
> CMD_INV_IOMMU_ALL_PAGES_ADDRESS we get from
> amd_iommu_domain_flush_tlb_pde() into a bunch of smaller flushes.
This function doesn't exist in the upstream kernel anymore, and the
new code doesn't generate CMD_INV_IOMMU_ALL_PAGES_ADDRESS flushes at
all, AFAIK.
Your patch makes sense, but it needs to go to stable only somehow.
Jason
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] amd/iommu: do not split domain flushes when flushing the entire range
2026-03-12 13:40 ` Jason Gunthorpe
@ 2026-03-14 18:24 ` Josef Bacik
0 siblings, 0 replies; 4+ messages in thread
From: Josef Bacik @ 2026-03-14 18:24 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: joro, iommu, linux-kernel, stable
On Thu, Mar 12, 2026 at 9:40 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Mar 04, 2026 at 04:30:03PM -0500, Josef Bacik wrote:
> > We are hitting the following soft lockup in production on v6.6 and
> > v6.12, but the bug exists in all versions
> >
> > watchdog: BUG: soft lockup - CPU#24 stuck for 31s! [tokio-runtime-w:1274919]
> > CPU: 24 PID: 1274919 Comm: tokio-runtime-w Not tainted 6.6.105+ #1
> > Hardware name: Google Google Compute Engine/Google Comput Engine, BIOS Google 10/25/2025
> > RIP: 0010:__raw_spin_unlock_irqrestore+0x21/0x30
> > Call Trace:
> > <TASK>
> > amd_iommu_attach_device+0x69/0x450
> > __iommu_device_set_domain+0x7b/0x190
> > __iommu_group_set_core_domain+0x61/0xd0
> > iommu_detatch_group+0x27/0x40
> > vfio_iommu_type1_detach_group+0x157/0x780 [vfio_iommu_type1]
> > vfio_group_detach_container+0x59/0x160 [vfio]
> > vfio_group_fops_release+0x4d/0x90 [vfio]
> > __fput+0x95/0x2a0
> > task_work_run+0x93/0xc0
> > do_exit+0x321/0x950
> > do_group_exit+0x7f/0xa0
> > get_signal_0x77d/0x780
> > </TASK>
> >
> > This occurs because we're a VM and we're splitting up the size
> > CMD_INV_IOMMU_ALL_PAGES_ADDRESS we get from
> > amd_iommu_domain_flush_tlb_pde() into a bunch of smaller flushes.
>
> This function doesn't exist in the upstream kernel anymore, and the
> new code doesn't generate CMD_INV_IOMMU_ALL_PAGES_ADDRESS flushes at
> all, AFAIK.
This was based on linus/master as of March 4th, and we get here via
amd_iommu_flush_tlb_all, which definitely still exists, so what
specifically are you talking about? Thanks,
Josef
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] amd/iommu: do not split domain flushes when flushing the entire range
2026-03-04 21:30 [PATCH] amd/iommu: do not split domain flushes when flushing the entire range Josef Bacik
2026-03-12 13:40 ` Jason Gunthorpe
@ 2026-03-24 20:14 ` Josef Bacik
1 sibling, 0 replies; 4+ messages in thread
From: Josef Bacik @ 2026-03-24 20:14 UTC (permalink / raw)
To: joro, iommu, linux-kernel, suravee.suthikulpanit; +Cc: stable
On Wed, Mar 04, 2026 at 04:30:03PM -0500, Josef Bacik wrote:
> We are hitting the following soft lockup in production on v6.6 and
> v6.12, but the bug exists in all versions
>
Can I get this reviewed/merged? I'm hitting this softlockup hundreds of times a
day in production and I need it in stable so I can have it backported to our
kernels. Thanks,
Josef
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-03-24 20:14 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-04 21:30 [PATCH] amd/iommu: do not split domain flushes when flushing the entire range Josef Bacik
2026-03-12 13:40 ` Jason Gunthorpe
2026-03-14 18:24 ` Josef Bacik
2026-03-24 20:14 ` Josef Bacik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox