* [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
@ 2025-07-09 6:28 Lu Baolu
2025-07-09 15:29 ` Dave Hansen
` (3 more replies)
0 siblings, 4 replies; 38+ messages in thread
From: Lu Baolu @ 2025-07-09 6:28 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Tested-by : Yi Lai
Cc: iommu, security, linux-kernel, Lu Baolu, stable
The vmalloc() and vfree() functions manage virtually contiguous, but not
necessarily physically contiguous, kernel memory regions. When vfree()
unmaps such a region, it tears down the associated kernel page table
entries and frees the physical pages.
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. Architectures like x86 share
static kernel address mappings across all user page tables, allowing the
IOMMU to access the kernel portion of these tables.
Modern IOMMUs often cache page table entries to optimize walk performance,
even for intermediate page table levels. If kernel page table mappings are
changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
entries, Use-After-Free (UAF) vulnerability condition arises. If these
freed page table pages are reallocated for a different purpose, potentially
by an attacker, the IOMMU could misinterpret the new data as valid page
table entries. This allows the IOMMU to walk into attacker-controlled
memory, leading to arbitrary physical memory DMA access or privilege
escalation.
To mitigate this, introduce a new iommu interface to flush IOMMU caches
and fence pending page table walks when kernel page mappings are updated.
This interface should be invoked from architecture-specific code that
manages combined user and kernel page tables.
Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
Cc: stable@vger.kernel.org
Reported-by: Jann Horn <jannh@google.com>
Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
---
arch/x86/mm/tlb.c | 2 ++
drivers/iommu/iommu-sva.c | 34 +++++++++++++++++++++++++++++++++-
include/linux/iommu.h | 4 ++++
3 files changed, 39 insertions(+), 1 deletion(-)
Change log:
v2:
- Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range);
- Replace the mutex with a spinlock to make the interface usable in the
critical regions.
v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux.intel.com/
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 39f80111e6f1..a41499dfdc3f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
#include <linux/task_work.h>
#include <linux/mmu_notifier.h>
#include <linux/mmu_context.h>
+#include <linux/iommu.h>
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -1540,6 +1541,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
kernel_tlb_flush_range(info);
put_flush_tlb_info();
+ iommu_sva_invalidate_kva_range(start, end);
}
/*
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 1a51cfd82808..fd76aefa5a88 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -10,6 +10,9 @@
#include "iommu-priv.h"
static DEFINE_MUTEX(iommu_sva_lock);
+static DEFINE_STATIC_KEY_FALSE(iommu_sva_present);
+static LIST_HEAD(iommu_sva_mms);
+static DEFINE_SPINLOCK(iommu_mms_lock);
static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
struct mm_struct *mm);
@@ -42,6 +45,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de
return ERR_PTR(-ENOSPC);
}
iommu_mm->pasid = pasid;
+ iommu_mm->mm = mm;
INIT_LIST_HEAD(&iommu_mm->sva_domains);
/*
* Make sure the write to mm->iommu_mm is not reordered in front of
@@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
if (ret)
goto out_free_domain;
domain->users = 1;
- list_add(&domain->next, &mm->iommu_mm->sva_domains);
+ if (list_empty(&iommu_mm->sva_domains)) {
+ scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
+ if (list_empty(&iommu_sva_mms))
+ static_branch_enable(&iommu_sva_present);
+ list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
+ }
+ }
+ list_add(&domain->next, &iommu_mm->sva_domains);
out:
refcount_set(&handle->users, 1);
mutex_unlock(&iommu_sva_lock);
@@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
list_del(&domain->next);
iommu_domain_free(domain);
}
+
+ if (list_empty(&iommu_mm->sva_domains)) {
+ scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
+ list_del(&iommu_mm->mm_list_elm);
+ if (list_empty(&iommu_sva_mms))
+ static_branch_disable(&iommu_sva_present);
+ }
+ }
+
mutex_unlock(&iommu_sva_lock);
kfree(handle);
}
@@ -312,3 +332,15 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
return domain;
}
+
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
+{
+ struct iommu_mm_data *iommu_mm;
+
+ if (!static_branch_unlikely(&iommu_sva_present))
+ return;
+
+ guard(spinlock_irqsave)(&iommu_mms_lock);
+ list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
+ mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end);
+}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 156732807994..31330c12b8ee 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1090,7 +1090,9 @@ struct iommu_sva {
struct iommu_mm_data {
u32 pasid;
+ struct mm_struct *mm;
struct list_head sva_domains;
+ struct list_head mm_list_elm;
};
int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode);
@@ -1571,6 +1573,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
struct mm_struct *mm);
void iommu_sva_unbind_device(struct iommu_sva *handle);
u32 iommu_sva_get_pasid(struct iommu_sva *handle);
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end);
#else
static inline struct iommu_sva *
iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
@@ -1595,6 +1598,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm)
}
static inline void mm_pasid_drop(struct mm_struct *mm) {}
+static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {}
#endif /* CONFIG_IOMMU_SVA */
#ifdef CONFIG_IOMMU_IOPF
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-09 6:28 [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush Lu Baolu
@ 2025-07-09 15:29 ` Dave Hansen
2025-07-10 2:14 ` Baolu Lu
2025-07-10 3:02 ` Tian, Kevin
` (2 subsequent siblings)
3 siblings, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2025-07-09 15:29 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai
Cc: iommu, security, linux-kernel, stable
On 7/8/25 23:28, Lu Baolu wrote:
> Modern IOMMUs often cache page table entries to optimize walk performance,
> even for intermediate page table levels. If kernel page table mappings are
> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> entries, Use-After-Free (UAF) vulnerability condition arises. If these
> freed page table pages are reallocated for a different purpose, potentially
> by an attacker, the IOMMU could misinterpret the new data as valid page
> table entries. This allows the IOMMU to walk into attacker-controlled
> memory, leading to arbitrary physical memory DMA access or privilege
> escalation.
The approach here is certainly conservative and simple. It's also not
going to cause big problems on systems without fancy IOMMUs.
But I am a _bit_ worried that it's _too_ conservative. The changelog
talks about page table page freeing, but the actual code:
> @@ -1540,6 +1541,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
> kernel_tlb_flush_range(info);
>
> put_flush_tlb_info();
> + iommu_sva_invalidate_kva_range(start, end);
> }
is in a very generic TLB flushing spot that's used for a lot more than
just freeing page tables.
If the problem is truly limited to freeing page tables, it needs to be
commented appropriately.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-09 15:29 ` Dave Hansen
@ 2025-07-10 2:14 ` Baolu Lu
2025-07-10 2:55 ` Tian, Kevin
2025-07-10 12:53 ` Dave Hansen
0 siblings, 2 replies; 38+ messages in thread
From: Baolu Lu @ 2025-07-10 2:14 UTC (permalink / raw)
To: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai
Cc: iommu, security, linux-kernel, stable
On 7/9/25 23:29, Dave Hansen wrote:
> On 7/8/25 23:28, Lu Baolu wrote:
>> Modern IOMMUs often cache page table entries to optimize walk performance,
>> even for intermediate page table levels. If kernel page table mappings are
>> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
>> entries, Use-After-Free (UAF) vulnerability condition arises. If these
>> freed page table pages are reallocated for a different purpose, potentially
>> by an attacker, the IOMMU could misinterpret the new data as valid page
>> table entries. This allows the IOMMU to walk into attacker-controlled
>> memory, leading to arbitrary physical memory DMA access or privilege
>> escalation.
>
> The approach here is certainly conservative and simple. It's also not
> going to cause big problems on systems without fancy IOMMUs.
>
> But I am a _bit_ worried that it's _too_ conservative. The changelog
> talks about page table page freeing, but the actual code:
>
>> @@ -1540,6 +1541,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>> kernel_tlb_flush_range(info);
>>
>> put_flush_tlb_info();
>> + iommu_sva_invalidate_kva_range(start, end);
>> }
>
> is in a very generic TLB flushing spot that's used for a lot more than
> just freeing page tables.
>
> If the problem is truly limited to freeing page tables, it needs to be
> commented appropriately.
Yeah, good comments. It should not be limited to freeing page tables;
freeing page tables is just a real case that we can see in the vmalloc/
vfree paths. Theoretically, whenever a kernel page table update is done
and the CPU TLB needs to be flushed, the secondary TLB (i.e., the caches
on the IOMMU) should be flushed accordingly. It's assumed that this
happens in flush_tlb_kernel_range().
Thanks,
baolu
^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 2:14 ` Baolu Lu
@ 2025-07-10 2:55 ` Tian, Kevin
2025-07-10 12:53 ` Dave Hansen
1 sibling, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2025-07-10 2:55 UTC (permalink / raw)
To: Baolu Lu, Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Lai, Yi1
Cc: iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Thursday, July 10, 2025 10:15 AM
>
> On 7/9/25 23:29, Dave Hansen wrote:
> > On 7/8/25 23:28, Lu Baolu wrote:
> >> Modern IOMMUs often cache page table entries to optimize walk
> performance,
> >> even for intermediate page table levels. If kernel page table mappings are
> >> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> >> entries, Use-After-Free (UAF) vulnerability condition arises. If these
> >> freed page table pages are reallocated for a different purpose, potentially
> >> by an attacker, the IOMMU could misinterpret the new data as valid page
> >> table entries. This allows the IOMMU to walk into attacker-controlled
> >> memory, leading to arbitrary physical memory DMA access or privilege
> >> escalation.
> >
> > The approach here is certainly conservative and simple. It's also not
> > going to cause big problems on systems without fancy IOMMUs.
> >
> > But I am a _bit_ worried that it's _too_ conservative. The changelog
> > talks about page table page freeing, but the actual code:
> >
> >> @@ -1540,6 +1541,7 @@ void flush_tlb_kernel_range(unsigned long start,
> unsigned long end)
> >> kernel_tlb_flush_range(info);
> >>
> >> put_flush_tlb_info();
> >> + iommu_sva_invalidate_kva_range(start, end);
> >> }
> >
> > is in a very generic TLB flushing spot that's used for a lot more than
> > just freeing page tables.
> >
> > If the problem is truly limited to freeing page tables, it needs to be
> > commented appropriately.
>
> Yeah, good comments. It should not be limited to freeing page tables;
> freeing page tables is just a real case that we can see in the vmalloc/
> vfree paths. Theoretically, whenever a kernel page table update is done
> and the CPU TLB needs to be flushed, the secondary TLB (i.e., the caches
> on the IOMMU) should be flushed accordingly. It's assumed that this
> happens in flush_tlb_kernel_range().
>
this conservative approach sounds safer - even if we overlooked any
threat beyond relying on page table free, doing invalidation in this
function is sufficient to mitigate.
but as Dave suggested let's add a comment in above to clarify the
motivation of doing so.
^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-09 6:28 [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush Lu Baolu
2025-07-09 15:29 ` Dave Hansen
@ 2025-07-10 3:02 ` Tian, Kevin
2025-07-10 8:11 ` Yu Zhang
2025-07-10 13:54 ` Peter Zijlstra
2025-07-16 10:54 ` Yi Liu
3 siblings, 1 reply; 38+ messages in thread
From: Tian, Kevin @ 2025-07-10 3:02 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Hansen, Dave,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1
Cc: iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Wednesday, July 9, 2025 2:28 PM
>
> The vmalloc() and vfree() functions manage virtually contiguous, but not
> necessarily physically contiguous, kernel memory regions. When vfree()
> unmaps such a region, it tears down the associated kernel page table
> entries and frees the physical pages.
>
> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU
> hardware
> shares and walks the CPU's page tables. Architectures like x86 share
> static kernel address mappings across all user page tables, allowing the
I'd remove 'static'
> IOMMU to access the kernel portion of these tables.
>
> Modern IOMMUs often cache page table entries to optimize walk
> performance,
> even for intermediate page table levels. If kernel page table mappings are
> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> entries, Use-After-Free (UAF) vulnerability condition arises. If these
> freed page table pages are reallocated for a different purpose, potentially
> by an attacker, the IOMMU could misinterpret the new data as valid page
> table entries. This allows the IOMMU to walk into attacker-controlled
> memory, leading to arbitrary physical memory DMA access or privilege
> escalation.
this lacks of a background that currently the iommu driver is notified
only for changes of user VA mappings, so the IOMMU's internal caches
may retain stale entries for kernel VA.
>
> To mitigate this, introduce a new iommu interface to flush IOMMU caches
> and fence pending page table walks when kernel page mappings are updated.
> This interface should be invoked from architecture-specific code that
> manages combined user and kernel page tables.
this also needs some words about the fact that new flushes are triggered
not just for freeing page tables.
>
> static DEFINE_MUTEX(iommu_sva_lock);
> +static DEFINE_STATIC_KEY_FALSE(iommu_sva_present);
> +static LIST_HEAD(iommu_sva_mms);
> +static DEFINE_SPINLOCK(iommu_mms_lock);
s/iommu_mms_lock/iommu_mm_lock/
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 3:02 ` Tian, Kevin
@ 2025-07-10 8:11 ` Yu Zhang
2025-07-10 8:15 ` Tian, Kevin
0 siblings, 1 reply; 38+ messages in thread
From: Yu Zhang @ 2025-07-10 8:11 UTC (permalink / raw)
To: Tian, Kevin
Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Hansen, Dave,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
On Thu, Jul 10, 2025 at 03:02:07AM +0000, Tian, Kevin wrote:
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> > Sent: Wednesday, July 9, 2025 2:28 PM
> >
> > The vmalloc() and vfree() functions manage virtually contiguous, but not
> > necessarily physically contiguous, kernel memory regions. When vfree()
> > unmaps such a region, it tears down the associated kernel page table
> > entries and frees the physical pages.
> >
> > In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU
> > hardware
> > shares and walks the CPU's page tables. Architectures like x86 share
> > static kernel address mappings across all user page tables, allowing the
>
> I'd remove 'static'
>
> > IOMMU to access the kernel portion of these tables.
> >
> > Modern IOMMUs often cache page table entries to optimize walk
> > performance,
> > even for intermediate page table levels. If kernel page table mappings are
> > changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> > entries, Use-After-Free (UAF) vulnerability condition arises. If these
> > freed page table pages are reallocated for a different purpose, potentially
> > by an attacker, the IOMMU could misinterpret the new data as valid page
> > table entries. This allows the IOMMU to walk into attacker-controlled
> > memory, leading to arbitrary physical memory DMA access or privilege
> > escalation.
>
> this lacks of a background that currently the iommu driver is notified
> only for changes of user VA mappings, so the IOMMU's internal caches
> may retain stale entries for kernel VA.
>
> >
> > To mitigate this, introduce a new iommu interface to flush IOMMU caches
> > and fence pending page table walks when kernel page mappings are updated.
> > This interface should be invoked from architecture-specific code that
> > manages combined user and kernel page tables.
>
> this also needs some words about the fact that new flushes are triggered
> not just for freeing page tables.
>
Thank you, Kevin. A question about the background of this issue:
My understanding of the attacking scenario is, a malicious user application
could initiate DMAs to some vmalloced address, causing the paging structure
cache being loaded and then possibly being used after that paging structure
is freed(may be allocated to some other users later).
If that is the case, only when the paging structures are freed, do we need
to do the flush. I mean, the IOTLB entries may not be loaded at all when the
permission check failes. Did I miss anything? :)
B.R.
Yu
^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 8:11 ` Yu Zhang
@ 2025-07-10 8:15 ` Tian, Kevin
2025-07-10 9:37 ` Yu Zhang
0 siblings, 1 reply; 38+ messages in thread
From: Tian, Kevin @ 2025-07-10 8:15 UTC (permalink / raw)
To: Yu Zhang
Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Hansen, Dave,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
> From: Yu Zhang <zhangyu1@linux.microsoft.com>
> Sent: Thursday, July 10, 2025 4:11 PM
>
> On Thu, Jul 10, 2025 at 03:02:07AM +0000, Tian, Kevin wrote:
> > > From: Lu Baolu <baolu.lu@linux.intel.com>
> > > Sent: Wednesday, July 9, 2025 2:28 PM
> > >
> > > The vmalloc() and vfree() functions manage virtually contiguous, but not
> > > necessarily physically contiguous, kernel memory regions. When vfree()
> > > unmaps such a region, it tears down the associated kernel page table
> > > entries and frees the physical pages.
> > >
> > > In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU
> > > hardware
> > > shares and walks the CPU's page tables. Architectures like x86 share
> > > static kernel address mappings across all user page tables, allowing the
> >
> > I'd remove 'static'
> >
> > > IOMMU to access the kernel portion of these tables.
> > >
> > > Modern IOMMUs often cache page table entries to optimize walk
> > > performance,
> > > even for intermediate page table levels. If kernel page table mappings are
> > > changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> > > entries, Use-After-Free (UAF) vulnerability condition arises. If these
> > > freed page table pages are reallocated for a different purpose, potentially
> > > by an attacker, the IOMMU could misinterpret the new data as valid page
> > > table entries. This allows the IOMMU to walk into attacker-controlled
> > > memory, leading to arbitrary physical memory DMA access or privilege
> > > escalation.
> >
> > this lacks of a background that currently the iommu driver is notified
> > only for changes of user VA mappings, so the IOMMU's internal caches
> > may retain stale entries for kernel VA.
> >
> > >
> > > To mitigate this, introduce a new iommu interface to flush IOMMU caches
> > > and fence pending page table walks when kernel page mappings are
> updated.
> > > This interface should be invoked from architecture-specific code that
> > > manages combined user and kernel page tables.
> >
> > this also needs some words about the fact that new flushes are triggered
> > not just for freeing page tables.
> >
> Thank you, Kevin. A question about the background of this issue:
>
> My understanding of the attacking scenario is, a malicious user application
> could initiate DMAs to some vmalloced address, causing the paging structure
> cache being loaded and then possibly being used after that paging structure
> is freed(may be allocated to some other users later).
>
> If that is the case, only when the paging structures are freed, do we need
> to do the flush. I mean, the IOTLB entries may not be loaded at all when the
> permission check failes. Did I miss anything? :)
>
It's about the paging structure cache instead of IOTLB.
You may look at the discussion in v1 for more background, especially
the latest reply from Baolu about a detailed example:
https://lore.kernel.org/linux-iommu/2080aaea-0d6e-418e-8391-ddac9b39c109@linux.intel.com/
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 8:15 ` Tian, Kevin
@ 2025-07-10 9:37 ` Yu Zhang
0 siblings, 0 replies; 38+ messages in thread
From: Yu Zhang @ 2025-07-10 9:37 UTC (permalink / raw)
To: Tian, Kevin
Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Hansen, Dave,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
On Thu, Jul 10, 2025 at 08:15:27AM +0000, Tian, Kevin wrote:
> > From: Yu Zhang <zhangyu1@linux.microsoft.com>
> > Sent: Thursday, July 10, 2025 4:11 PM
> >
> > On Thu, Jul 10, 2025 at 03:02:07AM +0000, Tian, Kevin wrote:
> > > > From: Lu Baolu <baolu.lu@linux.intel.com>
> > > > Sent: Wednesday, July 9, 2025 2:28 PM
> > > >
> > > > The vmalloc() and vfree() functions manage virtually contiguous, but not
> > > > necessarily physically contiguous, kernel memory regions. When vfree()
> > > > unmaps such a region, it tears down the associated kernel page table
> > > > entries and frees the physical pages.
> > > >
> > > > In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU
> > > > hardware
> > > > shares and walks the CPU's page tables. Architectures like x86 share
> > > > static kernel address mappings across all user page tables, allowing the
> > >
> > > I'd remove 'static'
> > >
> > > > IOMMU to access the kernel portion of these tables.
> > > >
> > > > Modern IOMMUs often cache page table entries to optimize walk
> > > > performance,
> > > > even for intermediate page table levels. If kernel page table mappings are
> > > > changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> > > > entries, Use-After-Free (UAF) vulnerability condition arises. If these
> > > > freed page table pages are reallocated for a different purpose, potentially
> > > > by an attacker, the IOMMU could misinterpret the new data as valid page
> > > > table entries. This allows the IOMMU to walk into attacker-controlled
> > > > memory, leading to arbitrary physical memory DMA access or privilege
> > > > escalation.
> > >
> > > this lacks of a background that currently the iommu driver is notified
> > > only for changes of user VA mappings, so the IOMMU's internal caches
> > > may retain stale entries for kernel VA.
> > >
> > > >
> > > > To mitigate this, introduce a new iommu interface to flush IOMMU caches
> > > > and fence pending page table walks when kernel page mappings are
> > updated.
> > > > This interface should be invoked from architecture-specific code that
> > > > manages combined user and kernel page tables.
> > >
> > > this also needs some words about the fact that new flushes are triggered
> > > not just for freeing page tables.
> > >
> > Thank you, Kevin. A question about the background of this issue:
> >
> > My understanding of the attacking scenario is, a malicious user application
> > could initiate DMAs to some vmalloced address, causing the paging structure
> > cache being loaded and then possibly being used after that paging structure
> > is freed(may be allocated to some other users later).
> >
> > If that is the case, only when the paging structures are freed, do we need
> > to do the flush. I mean, the IOTLB entries may not be loaded at all when the
> > permission check failes. Did I miss anything? :)
> >
>
> It's about the paging structure cache instead of IOTLB.
>
> You may look at the discussion in v1 for more background, especially
> the latest reply from Baolu about a detailed example:
>
> https://lore.kernel.org/linux-iommu/2080aaea-0d6e-418e-8391-ddac9b39c109@linux.intel.com/
>
Thank you, Kevin. This really helps.
And by pointing out "this also needs some words about the fact that new
flushes are triggered not just for freeing page tables". Do you mean this
fix is not an optimal one, because iommu_sva_invalidate_kva_range() is
triggered for each unmapping of the vmalloced address?
Do we have any choice, e.g., to not trigger the flush e.g., when the page
table(or directory etc.) is not freed?
B.R.
Yu
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 2:14 ` Baolu Lu
2025-07-10 2:55 ` Tian, Kevin
@ 2025-07-10 12:53 ` Dave Hansen
2025-07-10 13:22 ` Jason Gunthorpe
1 sibling, 1 reply; 38+ messages in thread
From: Dave Hansen @ 2025-07-10 12:53 UTC (permalink / raw)
To: Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai
Cc: iommu, security, linux-kernel, stable
On 7/9/25 19:14, Baolu Lu wrote:
>> If the problem is truly limited to freeing page tables, it needs to be
>> commented appropriately.
>
> Yeah, good comments. It should not be limited to freeing page tables;
> freeing page tables is just a real case that we can see in the vmalloc/
> vfree paths. Theoretically, whenever a kernel page table update is done
> and the CPU TLB needs to be flushed, the secondary TLB (i.e., the caches
> on the IOMMU) should be flushed accordingly. It's assumed that this
> happens in flush_tlb_kernel_range().
Could you elaborate on this a bit further?
I thought that the IOMMU was only ever doing "user" permission walks and
never "supervisor". That's why we had the whole discussion about whether
the IOMMU would stop if it saw an upper-level entry with _PAGE_USER clear.
The reason this matters is that leaf kernel page table entries always
have _PAGE_USER clear. That means an IOMMU doing "user" permission walks
should never be able to do anything with them. Even if an IOTLB entry
was created* for a _PAGE_USER=0 PTE, the "user" permission walk will
stop when it finds that entry.
It doesn't matter if the entry is stale or not. The "user" permission
IOMMU walk can't do anything with the supervisor mapping.
Why does this matter? We flush the CPU TLB in a bunch of different ways,
_especially_ when it's being done for kernel mappings. For example,
__flush_tlb_all() is a non-ranged kernel flush which has a completely
parallel implementation with flush_tlb_kernel_range(). Call sites that
use _it_ are unaffected by the patch here.
Basically, if we're only worried about vmalloc/vfree freeing page
tables, then this patch is OK. If the problem is bigger than that, then
we need a more comprehensive patch.
* I'm not sure if the IOMMU will even create an IOTLB entry for
a supervisor-permission mapping while doing a user-permission walk.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 12:53 ` Dave Hansen
@ 2025-07-10 13:22 ` Jason Gunthorpe
2025-07-10 15:26 ` Dave Hansen
2025-07-11 2:49 ` Tian, Kevin
0 siblings, 2 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2025-07-10 13:22 UTC (permalink / raw)
To: Dave Hansen
Cc: Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
Tested-by : Yi Lai, iommu, security, linux-kernel, stable
On Thu, Jul 10, 2025 at 05:53:16AM -0700, Dave Hansen wrote:
> On 7/9/25 19:14, Baolu Lu wrote:
> >> If the problem is truly limited to freeing page tables, it needs to be
> >> commented appropriately.
> >
> > Yeah, good comments. It should not be limited to freeing page tables;
> > freeing page tables is just a real case that we can see in the vmalloc/
> > vfree paths. Theoretically, whenever a kernel page table update is done
> > and the CPU TLB needs to be flushed, the secondary TLB (i.e., the caches
> > on the IOMMU) should be flushed accordingly. It's assumed that this
> > happens in flush_tlb_kernel_range().
>
> Could you elaborate on this a bit further?
>
> I thought that the IOMMU was only ever doing "user" permission walks and
> never "supervisor". That's why we had the whole discussion about whether
> the IOMMU would stop if it saw an upper-level entry with _PAGE_USER clear.
Yes
> The reason this matters is that leaf kernel page table entries always
> have _PAGE_USER clear. That means an IOMMU doing "user" permission walks
> should never be able to do anything with them. Even if an IOTLB entry
> was created* for a _PAGE_USER=0 PTE, the "user" permission walk will
> stop when it finds that entry.
Yes, if the IOTLB caches a supervisor entry it doesn't matter since if
the cache hits the S/U check will still fail and it will never get
used. It is just wasting IOTLB cache space. No need to invalidate
IOTLB.
> Why does this matter? We flush the CPU TLB in a bunch of different ways,
> _especially_ when it's being done for kernel mappings. For example,
> __flush_tlb_all() is a non-ranged kernel flush which has a completely
> parallel implementation with flush_tlb_kernel_range(). Call sites that
> use _it_ are unaffected by the patch here.
>
> Basically, if we're only worried about vmalloc/vfree freeing page
> tables, then this patch is OK. If the problem is bigger than that, then
> we need a more comprehensive patch.
I think we are worried about any place that frees page tables.
> * I'm not sure if the IOMMU will even create an IOTLB entry for
> a supervisor-permission mapping while doing a user-permission walk.
It doesn't matter if it does or doesn't, it doesn't change the
argument that we don't have to invalidate on PTEs being changed.
Jason
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-09 6:28 [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush Lu Baolu
2025-07-09 15:29 ` Dave Hansen
2025-07-10 3:02 ` Tian, Kevin
@ 2025-07-10 13:54 ` Peter Zijlstra
2025-07-10 15:53 ` Peter Zijlstra
2025-07-11 3:00 ` Baolu Lu
2025-07-16 10:54 ` Yi Liu
3 siblings, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2025-07-10 13:54 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai, iommu, security,
linux-kernel, stable
On Wed, Jul 09, 2025 at 02:28:00PM +0800, Lu Baolu wrote:
> The vmalloc() and vfree() functions manage virtually contiguous, but not
> necessarily physically contiguous, kernel memory regions. When vfree()
> unmaps such a region, it tears down the associated kernel page table
> entries and frees the physical pages.
>
> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
> shares and walks the CPU's page tables. Architectures like x86 share
> static kernel address mappings across all user page tables, allowing the
> IOMMU to access the kernel portion of these tables.
>
> Modern IOMMUs often cache page table entries to optimize walk performance,
> even for intermediate page table levels. If kernel page table mappings are
> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> entries, Use-After-Free (UAF) vulnerability condition arises. If these
> freed page table pages are reallocated for a different purpose, potentially
> by an attacker, the IOMMU could misinterpret the new data as valid page
> table entries. This allows the IOMMU to walk into attacker-controlled
> memory, leading to arbitrary physical memory DMA access or privilege
> escalation.
>
> To mitigate this, introduce a new iommu interface to flush IOMMU caches
> and fence pending page table walks when kernel page mappings are updated.
> This interface should be invoked from architecture-specific code that
> manages combined user and kernel page tables.
I must say I liked the kPTI based idea better. Having to iterate and
invalidate an unspecified number of IOMMUs from non-preemptible context
seems 'unfortunate'.
Why was this approach chosen over the kPTI one, where we keep a
page-table root that simply does not include the kernel bits, and
therefore the IOMMU will never see them (change) and we'll never have to
invalidate?
> @@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
> if (ret)
> goto out_free_domain;
> domain->users = 1;
> - list_add(&domain->next, &mm->iommu_mm->sva_domains);
>
> + if (list_empty(&iommu_mm->sva_domains)) {
> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> + if (list_empty(&iommu_sva_mms))
> + static_branch_enable(&iommu_sva_present);
> + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
> + }
> + }
> + list_add(&domain->next, &iommu_mm->sva_domains);
> out:
> refcount_set(&handle->users, 1);
> mutex_unlock(&iommu_sva_lock);
> @@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
> list_del(&domain->next);
> iommu_domain_free(domain);
> }
> +
> + if (list_empty(&iommu_mm->sva_domains)) {
> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> + list_del(&iommu_mm->mm_list_elm);
> + if (list_empty(&iommu_sva_mms))
> + static_branch_disable(&iommu_sva_present);
> + }
> + }
> +
> mutex_unlock(&iommu_sva_lock);
> kfree(handle);
> }
This seems an odd coding style choice; why the extra unneeded
indentation? That is, what's wrong with:
if (list_empty()) {
guard(spinlock_irqsave)(&iommu_mms_lock);
list_del();
if (list_empty()
static_branch_disable();
}
> @@ -312,3 +332,15 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>
> return domain;
> }
> +
> +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
> +{
> + struct iommu_mm_data *iommu_mm;
> +
> + if (!static_branch_unlikely(&iommu_sva_present))
> + return;
> +
> + guard(spinlock_irqsave)(&iommu_mms_lock);
> + list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
> + mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end);
> +}
This is absolutely the wrong way to use static_branch. You want them in
inline functions guarding the function call, not inside the function
call.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 13:22 ` Jason Gunthorpe
@ 2025-07-10 15:26 ` Dave Hansen
2025-07-11 2:46 ` Tian, Kevin
` (3 more replies)
2025-07-11 2:49 ` Tian, Kevin
1 sibling, 4 replies; 38+ messages in thread
From: Dave Hansen @ 2025-07-10 15:26 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
Tested-by : Yi Lai, iommu, security, linux-kernel, stable
On 7/10/25 06:22, Jason Gunthorpe wrote:
>> Why does this matter? We flush the CPU TLB in a bunch of different ways,
>> _especially_ when it's being done for kernel mappings. For example,
>> __flush_tlb_all() is a non-ranged kernel flush which has a completely
>> parallel implementation with flush_tlb_kernel_range(). Call sites that
>> use _it_ are unaffected by the patch here.
>>
>> Basically, if we're only worried about vmalloc/vfree freeing page
>> tables, then this patch is OK. If the problem is bigger than that, then
>> we need a more comprehensive patch.
> I think we are worried about any place that frees page tables.
The two places that come to mind are the remove_memory() code and
__change_page_attr().
The remove_memory() gunk is in arch/x86/mm/init_64.c. It has a few sites
that do flush_tlb_all(). Now that I'm looking at it, there look to be
some races between freeing page tables pages and flushing the TLB. But,
basically, if you stick to the sites in there that do flush_tlb_all()
after free_pagetable(), you should be good.
As for the __change_page_attr() code, I think the only spot you need to
hit is cpa_collapse_large_pages() and maybe the one in
__split_large_page() as well.
This is all disturbingly ad-hoc, though. The remove_memory() code needs
fixing and I'll probably go try to bring some order to the chaos in the
process of fixing it up. But that's a separate problem than this IOMMU fun.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 13:54 ` Peter Zijlstra
@ 2025-07-10 15:53 ` Peter Zijlstra
2025-07-11 3:09 ` Baolu Lu
2025-07-16 11:57 ` David Laight
2025-07-11 3:00 ` Baolu Lu
1 sibling, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2025-07-10 15:53 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai, iommu, security,
linux-kernel, stable
On Thu, Jul 10, 2025 at 03:54:32PM +0200, Peter Zijlstra wrote:
> > @@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
> > if (ret)
> > goto out_free_domain;
> > domain->users = 1;
> > - list_add(&domain->next, &mm->iommu_mm->sva_domains);
> >
> > + if (list_empty(&iommu_mm->sva_domains)) {
> > + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> > + if (list_empty(&iommu_sva_mms))
> > + static_branch_enable(&iommu_sva_present);
> > + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
> > + }
> > + }
> > + list_add(&domain->next, &iommu_mm->sva_domains);
> > out:
> > refcount_set(&handle->users, 1);
> > mutex_unlock(&iommu_sva_lock);
> > @@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
> > list_del(&domain->next);
> > iommu_domain_free(domain);
> > }
> > +
> > + if (list_empty(&iommu_mm->sva_domains)) {
> > + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> > + list_del(&iommu_mm->mm_list_elm);
> > + if (list_empty(&iommu_sva_mms))
> > + static_branch_disable(&iommu_sva_present);
> > + }
> > + }
> > +
> > mutex_unlock(&iommu_sva_lock);
> > kfree(handle);
> > }
>
> This seems an odd coding style choice; why the extra unneeded
> indentation? That is, what's wrong with:
>
> if (list_empty()) {
> guard(spinlock_irqsave)(&iommu_mms_lock);
> list_del();
> if (list_empty()
> static_branch_disable();
> }
Well, for one, you can't do static_branch_{en,dis}able() from atomic
context...
Was this ever tested?
^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 15:26 ` Dave Hansen
@ 2025-07-11 2:46 ` Tian, Kevin
2025-07-11 2:54 ` Tian, Kevin
` (2 subsequent siblings)
3 siblings, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2025-07-11 2:46 UTC (permalink / raw)
To: Hansen, Dave, Jason Gunthorpe
Cc: Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn,
Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
> From: Hansen, Dave <dave.hansen@intel.com>
> Sent: Thursday, July 10, 2025 11:26 PM
>
> On 7/10/25 06:22, Jason Gunthorpe wrote:
> >> Why does this matter? We flush the CPU TLB in a bunch of different ways,
> >> _especially_ when it's being done for kernel mappings. For example,
> >> __flush_tlb_all() is a non-ranged kernel flush which has a completely
> >> parallel implementation with flush_tlb_kernel_range(). Call sites that
> >> use _it_ are unaffected by the patch here.
> >>
> >> Basically, if we're only worried about vmalloc/vfree freeing page
> >> tables, then this patch is OK. If the problem is bigger than that, then
> >> we need a more comprehensive patch.
> > I think we are worried about any place that frees page tables.
>
> The two places that come to mind are the remove_memory() code and
> __change_page_attr().
>
> The remove_memory() gunk is in arch/x86/mm/init_64.c. It has a few sites
> that do flush_tlb_all(). Now that I'm looking at it, there look to be
> some races between freeing page tables pages and flushing the TLB. But,
> basically, if you stick to the sites in there that do flush_tlb_all()
> after free_pagetable(), you should be good.
>
Isn't doing flush after free_pagetable() leaving a small window for attack?
the page table is freed and may have been repurposed while the stale
paging structure cache still treats it as a page table page...
looks it's not unusual to see similar pattern in other mm code:
vmemmap_split_pmd()
{
if (likely(pmd_leaf(*pmd))) {
...
} else {
pte_free_kernel(&init_mm, pgtable);
}
...
}
then the tlb flush is postponed to vmemmap_remap_range():
walk_page_range_novma();
if (walk->remap_pte && !(walk->flags & VMEMMAP_REMAP_NO_TLB_FLUSH))
flush_tlb_kernel_range(start, end);
or even postponed even later if NO_TLB_FLUSH is set.
those call sites might have been scrutinized to be safe regarding
to CPU execution flows, but not sure the conditions for considering
it safe can also apply to the said attack via device SVA.
somehow we may really need to re-look at the other two options
(kpti or shadow pgd) which solve this problem from another angle...
^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 13:22 ` Jason Gunthorpe
2025-07-10 15:26 ` Dave Hansen
@ 2025-07-11 2:49 ` Tian, Kevin
1 sibling, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2025-07-11 2:49 UTC (permalink / raw)
To: Jason Gunthorpe, Hansen, Dave
Cc: Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn,
Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, July 10, 2025 9:23 PM
>
> On Thu, Jul 10, 2025 at 05:53:16AM -0700, Dave Hansen wrote:
>
> > * I'm not sure if the IOMMU will even create an IOTLB entry for
> > a supervisor-permission mapping while doing a user-permission walk.
Early VT-d platforms may cache faulted translation. But they are
pretty old and way before SVA was introduced in spec.
>
> It doesn't matter if it does or doesn't, it doesn't change the
> argument that we don't have to invalidate on PTEs being changed.
>
but yes, that fact doesn't matter.
^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 15:26 ` Dave Hansen
2025-07-11 2:46 ` Tian, Kevin
@ 2025-07-11 2:54 ` Tian, Kevin
2025-07-11 8:17 ` Yu Zhang
2025-07-24 3:06 ` Baolu Lu
3 siblings, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2025-07-11 2:54 UTC (permalink / raw)
To: Hansen, Dave, Jason Gunthorpe
Cc: Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn,
Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
> From: Tian, Kevin
> Sent: Friday, July 11, 2025 10:46 AM
>
> > From: Hansen, Dave <dave.hansen@intel.com>
> > Sent: Thursday, July 10, 2025 11:26 PM
> >
> > On 7/10/25 06:22, Jason Gunthorpe wrote:
> > >> Why does this matter? We flush the CPU TLB in a bunch of different
> ways,
> > >> _especially_ when it's being done for kernel mappings. For example,
> > >> __flush_tlb_all() is a non-ranged kernel flush which has a completely
> > >> parallel implementation with flush_tlb_kernel_range(). Call sites that
> > >> use _it_ are unaffected by the patch here.
> > >>
> > >> Basically, if we're only worried about vmalloc/vfree freeing page
> > >> tables, then this patch is OK. If the problem is bigger than that, then
> > >> we need a more comprehensive patch.
> > > I think we are worried about any place that frees page tables.
> >
> > The two places that come to mind are the remove_memory() code and
> > __change_page_attr().
> >
> > The remove_memory() gunk is in arch/x86/mm/init_64.c. It has a few sites
> > that do flush_tlb_all(). Now that I'm looking at it, there look to be
> > some races between freeing page tables pages and flushing the TLB. But,
> > basically, if you stick to the sites in there that do flush_tlb_all()
> > after free_pagetable(), you should be good.
> >
>
> Isn't doing flush after free_pagetable() leaving a small window for attack?
> the page table is freed and may have been repurposed while the stale
> paging structure cache still treats it as a page table page...
>
> looks it's not unusual to see similar pattern in other mm code:
>
> vmemmap_split_pmd()
> {
> if (likely(pmd_leaf(*pmd))) {
> ...
> } else {
> pte_free_kernel(&init_mm, pgtable);
> }
> ...
> }
oh, please ignore this comment. It's not about freeing the existing
page table. incautious look. 😊
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 13:54 ` Peter Zijlstra
2025-07-10 15:53 ` Peter Zijlstra
@ 2025-07-11 3:00 ` Baolu Lu
2025-07-11 4:01 ` Tian, Kevin
` (2 more replies)
1 sibling, 3 replies; 38+ messages in thread
From: Baolu Lu @ 2025-07-11 3:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai, iommu, security,
linux-kernel, stable
Hi Peter Z,
On 7/10/25 21:54, Peter Zijlstra wrote:
> On Wed, Jul 09, 2025 at 02:28:00PM +0800, Lu Baolu wrote:
>> The vmalloc() and vfree() functions manage virtually contiguous, but not
>> necessarily physically contiguous, kernel memory regions. When vfree()
>> unmaps such a region, it tears down the associated kernel page table
>> entries and frees the physical pages.
>>
>> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
>> shares and walks the CPU's page tables. Architectures like x86 share
>> static kernel address mappings across all user page tables, allowing the
>> IOMMU to access the kernel portion of these tables.
>>
>> Modern IOMMUs often cache page table entries to optimize walk performance,
>> even for intermediate page table levels. If kernel page table mappings are
>> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
>> entries, Use-After-Free (UAF) vulnerability condition arises. If these
>> freed page table pages are reallocated for a different purpose, potentially
>> by an attacker, the IOMMU could misinterpret the new data as valid page
>> table entries. This allows the IOMMU to walk into attacker-controlled
>> memory, leading to arbitrary physical memory DMA access or privilege
>> escalation.
>>
>> To mitigate this, introduce a new iommu interface to flush IOMMU caches
>> and fence pending page table walks when kernel page mappings are updated.
>> This interface should be invoked from architecture-specific code that
>> manages combined user and kernel page tables.
>
> I must say I liked the kPTI based idea better. Having to iterate and
> invalidate an unspecified number of IOMMUs from non-preemptible context
> seems 'unfortunate'.
The cache invalidation path in IOMMU drivers is already critical and
operates within a non-preemptible context. This approach is, in fact,
already utilized for user-space page table updates since the beginning
of SVA support.
>
> Why was this approach chosen over the kPTI one, where we keep a
> page-table root that simply does not include the kernel bits, and
> therefore the IOMMU will never see them (change) and we'll never have to
> invalidate?
The IOMMU subsystem started supporting the SVA feature in 2019, and it
has been broadly adopted in various production kernels. The issue
described here is fundamentally a software bug related to not
maintaining IOMMU cache coherence. Therefore, we need a quick and simple
fix to address it, and this patch can be easily backported to production
kernels.
While a kPTI-based approach might appear more attractive, I believe some
extra work is still required to properly integrate it into the IOMMU
subsystem. For instance, kPTI is currently an optional mitigation,
enabled via CONFIG_MITIGATION_PAGE_TABLE_ISOLATION and bypassable with
the "nopti" kernel parameter. This optionality is not suitable for the
IOMMU subsystem, as software must always guarantee IOMMU cache coherence
for functional correctness and security.
So, in the short term, let's proceed with a straightforward solution to
resolve this issue and ensure the SVA feature functions correctly. For
the long term, we can explore optimizations and deeper integration
aligned with features like kPTI.
>
>> @@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
>> if (ret)
>> goto out_free_domain;
>> domain->users = 1;
>> - list_add(&domain->next, &mm->iommu_mm->sva_domains);
>>
>> + if (list_empty(&iommu_mm->sva_domains)) {
>> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
>> + if (list_empty(&iommu_sva_mms))
>> + static_branch_enable(&iommu_sva_present);
>> + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
>> + }
>> + }
>> + list_add(&domain->next, &iommu_mm->sva_domains);
>> out:
>> refcount_set(&handle->users, 1);
>> mutex_unlock(&iommu_sva_lock);
>> @@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
>> list_del(&domain->next);
>> iommu_domain_free(domain);
>> }
>> +
>> + if (list_empty(&iommu_mm->sva_domains)) {
>> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
>> + list_del(&iommu_mm->mm_list_elm);
>> + if (list_empty(&iommu_sva_mms))
>> + static_branch_disable(&iommu_sva_present);
>> + }
>> + }
>> +
>> mutex_unlock(&iommu_sva_lock);
>> kfree(handle);
>> }
>
> This seems an odd coding style choice; why the extra unneeded
> indentation? That is, what's wrong with:
>
> if (list_empty()) {
> guard(spinlock_irqsave)(&iommu_mms_lock);
> list_del();
> if (list_empty()
> static_branch_disable();
> }
Perhaps I overlooked or misunderstood something, but my understanding
is,
The lock order in this function is:
mutex_lock(&iommu_sva_lock);
spin_lock(&iommu_mms_lock);
spin_unlock(&iommu_mms_lock);
mutex_unlock(&iommu_sva_lock);
With above change, it is changed to:
mutex_lock(&iommu_sva_lock);
spin_lock(&iommu_mms_lock);
mutex_unlock(&iommu_sva_lock);
spin_unlock(&iommu_mms_lock);
>
>> @@ -312,3 +332,15 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>>
>> return domain;
>> }
>> +
>> +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
>> +{
>> + struct iommu_mm_data *iommu_mm;
>> +
>> + if (!static_branch_unlikely(&iommu_sva_present))
>> + return;
>> +
>> + guard(spinlock_irqsave)(&iommu_mms_lock);
>> + list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
>> + mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end);
>> +}
>
> This is absolutely the wrong way to use static_branch. You want them in
> inline functions guarding the function call, not inside the function
> call.
I don't think a static branch is desirable here, as we have no idea how
often the condition will switch in real-world scenarios. I will remove
it in the next version if there are no objections.
Thanks,
baolu
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 15:53 ` Peter Zijlstra
@ 2025-07-11 3:09 ` Baolu Lu
2025-07-11 8:27 ` Peter Zijlstra
2025-07-16 11:57 ` David Laight
1 sibling, 1 reply; 38+ messages in thread
From: Baolu Lu @ 2025-07-11 3:09 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai, iommu, security,
linux-kernel, stable
On 7/10/25 23:53, Peter Zijlstra wrote:
> On Thu, Jul 10, 2025 at 03:54:32PM +0200, Peter Zijlstra wrote:
>
>>> @@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
>>> if (ret)
>>> goto out_free_domain;
>>> domain->users = 1;
>>> - list_add(&domain->next, &mm->iommu_mm->sva_domains);
>>>
>>> + if (list_empty(&iommu_mm->sva_domains)) {
>>> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
>>> + if (list_empty(&iommu_sva_mms))
>>> + static_branch_enable(&iommu_sva_present);
>>> + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
>>> + }
>>> + }
>>> + list_add(&domain->next, &iommu_mm->sva_domains);
>>> out:
>>> refcount_set(&handle->users, 1);
>>> mutex_unlock(&iommu_sva_lock);
>>> @@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
>>> list_del(&domain->next);
>>> iommu_domain_free(domain);
>>> }
>>> +
>>> + if (list_empty(&iommu_mm->sva_domains)) {
>>> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
>>> + list_del(&iommu_mm->mm_list_elm);
>>> + if (list_empty(&iommu_sva_mms))
>>> + static_branch_disable(&iommu_sva_present);
>>> + }
>>> + }
>>> +
>>> mutex_unlock(&iommu_sva_lock);
>>> kfree(handle);
>>> }
>>
>> This seems an odd coding style choice; why the extra unneeded
>> indentation? That is, what's wrong with:
>>
>> if (list_empty()) {
>> guard(spinlock_irqsave)(&iommu_mms_lock);
>> list_del();
>> if (list_empty()
>> static_branch_disable();
>> }
>
> Well, for one, you can't do static_branch_{en,dis}able() from atomic
> context...
>
> Was this ever tested?
I conducted unit tests for vmalloc()/vfree() scenarios, and Yi performed
fuzzing tests. We have not observed any warning messages. Perhaps
static_branch_disable() is not triggered in the test cases?
Thanks,
baolu
^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-11 3:00 ` Baolu Lu
@ 2025-07-11 4:01 ` Tian, Kevin
2025-07-11 8:32 ` Peter Zijlstra
2025-07-11 11:54 ` Jason Gunthorpe
2 siblings, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2025-07-11 4:01 UTC (permalink / raw)
To: Baolu Lu, Peter Zijlstra
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jason Gunthorpe,
Jann Horn, Vasant Hegde, Hansen, Dave, Alistair Popple,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
Lai, Yi1, iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Friday, July 11, 2025 11:00 AM
>
> On 7/10/25 21:54, Peter Zijlstra wrote:
> > On Wed, Jul 09, 2025 at 02:28:00PM +0800, Lu Baolu wrote:
> >> +
> >> + if (list_empty(&iommu_mm->sva_domains)) {
> >> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> >> + list_del(&iommu_mm->mm_list_elm);
> >> + if (list_empty(&iommu_sva_mms))
> >> + static_branch_disable(&iommu_sva_present);
> >> + }
> >> + }
> >> +
> >> mutex_unlock(&iommu_sva_lock);
> >> kfree(handle);
> >> }
> >
> > This seems an odd coding style choice; why the extra unneeded
> > indentation? That is, what's wrong with:
> >
> > if (list_empty()) {
> > guard(spinlock_irqsave)(&iommu_mms_lock);
> > list_del();
> > if (list_empty()
> > static_branch_disable();
> > }
>
> Perhaps I overlooked or misunderstood something, but my understanding
> is,
>
> The lock order in this function is:
>
> mutex_lock(&iommu_sva_lock);
> spin_lock(&iommu_mms_lock);
> spin_unlock(&iommu_mms_lock);
> mutex_unlock(&iommu_sva_lock);
>
> With above change, it is changed to:
>
> mutex_lock(&iommu_sva_lock);
> spin_lock(&iommu_mms_lock);
> mutex_unlock(&iommu_sva_lock);
> spin_unlock(&iommu_mms_lock);
>
guard() follows the scope of variable declaration. Actually above
is a perfect match to the example in cleanup.h:
* The lifetime of the lock obtained by the guard() helper follows the
* scope of automatic variable declaration. Take the following example::
*
* func(...)
* {
* if (...) {
* ...
* guard(pci_dev)(dev); // pci_dev_lock() invoked here
* ...
* } // <- implied pci_dev_unlock() triggered here
* }
*
* Observe the lock is held for the remainder of the "if ()" block not
* the remainder of "func()".
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 15:26 ` Dave Hansen
2025-07-11 2:46 ` Tian, Kevin
2025-07-11 2:54 ` Tian, Kevin
@ 2025-07-11 8:17 ` Yu Zhang
2025-07-24 3:01 ` Baolu Lu
2025-07-24 3:06 ` Baolu Lu
3 siblings, 1 reply; 38+ messages in thread
From: Yu Zhang @ 2025-07-11 8:17 UTC (permalink / raw)
To: Dave Hansen
Cc: Jason Gunthorpe, Baolu Lu, Joerg Roedel, Will Deacon,
Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Tested-by : Yi Lai, iommu,
security, linux-kernel, stable
On Thu, Jul 10, 2025 at 08:26:06AM -0700, Dave Hansen wrote:
> On 7/10/25 06:22, Jason Gunthorpe wrote:
> >> Why does this matter? We flush the CPU TLB in a bunch of different ways,
> >> _especially_ when it's being done for kernel mappings. For example,
> >> __flush_tlb_all() is a non-ranged kernel flush which has a completely
> >> parallel implementation with flush_tlb_kernel_range(). Call sites that
> >> use _it_ are unaffected by the patch here.
> >>
> >> Basically, if we're only worried about vmalloc/vfree freeing page
> >> tables, then this patch is OK. If the problem is bigger than that, then
> >> we need a more comprehensive patch.
> > I think we are worried about any place that frees page tables.
>
> The two places that come to mind are the remove_memory() code and
> __change_page_attr().
>
> The remove_memory() gunk is in arch/x86/mm/init_64.c. It has a few sites
> that do flush_tlb_all(). Now that I'm looking at it, there look to be
> some races between freeing page tables pages and flushing the TLB. But,
> basically, if you stick to the sites in there that do flush_tlb_all()
> after free_pagetable(), you should be good.
>
> As for the __change_page_attr() code, I think the only spot you need to
> hit is cpa_collapse_large_pages() and maybe the one in
> __split_large_page() as well.
>
> This is all disturbingly ad-hoc, though. The remove_memory() code needs
> fixing and I'll probably go try to bring some order to the chaos in the
> process of fixing it up. But that's a separate problem than this IOMMU fun.
>
Could we consider to split the flush_tlb_kernel_range() into 2 different
versions:
- the one which only flushes the CPU TLB
- the one which flushes the CPU paging structure cache and then notifies
IOMMU to do the same(e.g., in pud_free_pmd_page()/pmd_free_pte_page())?
B.R.
Yu
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-11 3:09 ` Baolu Lu
@ 2025-07-11 8:27 ` Peter Zijlstra
0 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2025-07-11 8:27 UTC (permalink / raw)
To: Baolu Lu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai, iommu, security,
linux-kernel, stable
On Fri, Jul 11, 2025 at 11:09:00AM +0800, Baolu Lu wrote:
> On 7/10/25 23:53, Peter Zijlstra wrote:
> > On Thu, Jul 10, 2025 at 03:54:32PM +0200, Peter Zijlstra wrote:
> >
> > > > @@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
> > > > if (ret)
> > > > goto out_free_domain;
> > > > domain->users = 1;
> > > > - list_add(&domain->next, &mm->iommu_mm->sva_domains);
> > > > + if (list_empty(&iommu_mm->sva_domains)) {
> > > > + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> > > > + if (list_empty(&iommu_sva_mms))
> > > > + static_branch_enable(&iommu_sva_present);
> > > > + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
> > > > + }
> > > > + }
> > > > + list_add(&domain->next, &iommu_mm->sva_domains);
> > > > out:
> > > > refcount_set(&handle->users, 1);
> > > > mutex_unlock(&iommu_sva_lock);
> > > > @@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
> > > > list_del(&domain->next);
> > > > iommu_domain_free(domain);
> > > > }
> > > > +
> > > > + if (list_empty(&iommu_mm->sva_domains)) {
> > > > + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> > > > + list_del(&iommu_mm->mm_list_elm);
> > > > + if (list_empty(&iommu_sva_mms))
> > > > + static_branch_disable(&iommu_sva_present);
> > > > + }
> > > > + }
> > > > +
> > > > mutex_unlock(&iommu_sva_lock);
> > > > kfree(handle);
> > > > }
> > >
> > > This seems an odd coding style choice; why the extra unneeded
> > > indentation? That is, what's wrong with:
> > >
> > > if (list_empty()) {
> > > guard(spinlock_irqsave)(&iommu_mms_lock);
> > > list_del();
> > > if (list_empty()
> > > static_branch_disable();
> > > }
> >
> > Well, for one, you can't do static_branch_{en,dis}able() from atomic
> > context...
> >
> > Was this ever tested?
>
> I conducted unit tests for vmalloc()/vfree() scenarios, and Yi performed
> fuzzing tests. We have not observed any warning messages. Perhaps
> static_branch_disable() is not triggered in the test cases?
Same with static_branch_enable(). These functions start with
cpus_read_lock(), which is percpu_down_read(), which has might_sleep().
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-11 3:00 ` Baolu Lu
2025-07-11 4:01 ` Tian, Kevin
@ 2025-07-11 8:32 ` Peter Zijlstra
2025-07-11 11:58 ` Jason Gunthorpe
2025-07-15 5:55 ` Baolu Lu
2025-07-11 11:54 ` Jason Gunthorpe
2 siblings, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2025-07-11 8:32 UTC (permalink / raw)
To: Baolu Lu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai, iommu, security,
linux-kernel, stable
On Fri, Jul 11, 2025 at 11:00:06AM +0800, Baolu Lu wrote:
> Hi Peter Z,
>
> On 7/10/25 21:54, Peter Zijlstra wrote:
> > On Wed, Jul 09, 2025 at 02:28:00PM +0800, Lu Baolu wrote:
> > > The vmalloc() and vfree() functions manage virtually contiguous, but not
> > > necessarily physically contiguous, kernel memory regions. When vfree()
> > > unmaps such a region, it tears down the associated kernel page table
> > > entries and frees the physical pages.
> > >
> > > In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
> > > shares and walks the CPU's page tables. Architectures like x86 share
> > > static kernel address mappings across all user page tables, allowing the
> > > IOMMU to access the kernel portion of these tables.
> > >
> > > Modern IOMMUs often cache page table entries to optimize walk performance,
> > > even for intermediate page table levels. If kernel page table mappings are
> > > changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> > > entries, Use-After-Free (UAF) vulnerability condition arises. If these
> > > freed page table pages are reallocated for a different purpose, potentially
> > > by an attacker, the IOMMU could misinterpret the new data as valid page
> > > table entries. This allows the IOMMU to walk into attacker-controlled
> > > memory, leading to arbitrary physical memory DMA access or privilege
> > > escalation.
> > >
> > > To mitigate this, introduce a new iommu interface to flush IOMMU caches
> > > and fence pending page table walks when kernel page mappings are updated.
> > > This interface should be invoked from architecture-specific code that
> > > manages combined user and kernel page tables.
> >
> > I must say I liked the kPTI based idea better. Having to iterate and
> > invalidate an unspecified number of IOMMUs from non-preemptible context
> > seems 'unfortunate'.
>
> The cache invalidation path in IOMMU drivers is already critical and
> operates within a non-preemptible context. This approach is, in fact,
> already utilized for user-space page table updates since the beginning
> of SVA support.
OK, fair enough I suppose. What kind of delays are we talking about
here? The fact that you basically have a unbounded list of IOMMUs
(although in practise I suppose it is limited by the amount of GPUs and
other fancy stuff you can stick in your machine) does slightly worry me.
At some point the low latency folks are going to come hunting you down.
Do you have a plan on how to deal with this; or are we throwing up our
hands an say, the hardware sucks, deal with it?
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-11 3:00 ` Baolu Lu
2025-07-11 4:01 ` Tian, Kevin
2025-07-11 8:32 ` Peter Zijlstra
@ 2025-07-11 11:54 ` Jason Gunthorpe
2 siblings, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2025-07-11 11:54 UTC (permalink / raw)
To: Baolu Lu
Cc: Peter Zijlstra, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jann Horn, Vasant Hegde, Dave Hansen, Alistair Popple,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
Tested-by : Yi Lai, iommu, security, linux-kernel, stable
On Fri, Jul 11, 2025 at 11:00:06AM +0800, Baolu Lu wrote:
> > > +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
> > > +{
> > > + struct iommu_mm_data *iommu_mm;
> > > +
> > > + if (!static_branch_unlikely(&iommu_sva_present))
> > > + return;
> > > +
> > > + guard(spinlock_irqsave)(&iommu_mms_lock);
> > > + list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
> > > + mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end);
> > > +}
> >
> > This is absolutely the wrong way to use static_branch. You want them in
> > inline functions guarding the function call, not inside the function
> > call.
>
> I don't think a static branch is desirable here, as we have no idea how
> often the condition will switch in real-world scenarios. I will remove
> it in the next version if there are no objections.
The point of the static branch was to make the 99% normal case where
SVA has never been used have no-cost in the core MM. I think we should
keep that idea
Jason
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-11 8:32 ` Peter Zijlstra
@ 2025-07-11 11:58 ` Jason Gunthorpe
2025-07-15 5:55 ` Baolu Lu
1 sibling, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2025-07-11 11:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jann Horn, Vasant Hegde, Dave Hansen, Alistair Popple,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
Tested-by : Yi Lai, iommu, security, linux-kernel, stable
On Fri, Jul 11, 2025 at 10:32:52AM +0200, Peter Zijlstra wrote:
> At some point the low latency folks are going to come hunting you down.
> Do you have a plan on how to deal with this; or are we throwing up our
> hands an say, the hardware sucks, deal with it?
If you turn on SVA in a process then all this same invalidation work
happens under someone's spinlock even today.. If it becomes a latency
problem for someone then it seems to be much bigger than just this bit.
Most likely the answer is that SVA and realtime are not compatible
ideas. I know I've had this conversation with some of our realtime
people and they have said they don't want to touch SVA because it
makes the device latencies indeterminate with possible faults.
Jason
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-11 8:32 ` Peter Zijlstra
2025-07-11 11:58 ` Jason Gunthorpe
@ 2025-07-15 5:55 ` Baolu Lu
2025-07-15 12:25 ` Jason Gunthorpe
1 sibling, 1 reply; 38+ messages in thread
From: Baolu Lu @ 2025-07-15 5:55 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai, iommu, security,
linux-kernel, stable
On 7/11/25 16:32, Peter Zijlstra wrote:
> On Fri, Jul 11, 2025 at 11:00:06AM +0800, Baolu Lu wrote:
>> Hi Peter Z,
>>
>> On 7/10/25 21:54, Peter Zijlstra wrote:
>>> On Wed, Jul 09, 2025 at 02:28:00PM +0800, Lu Baolu wrote:
>>>> The vmalloc() and vfree() functions manage virtually contiguous, but not
>>>> necessarily physically contiguous, kernel memory regions. When vfree()
>>>> unmaps such a region, it tears down the associated kernel page table
>>>> entries and frees the physical pages.
>>>>
>>>> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
>>>> shares and walks the CPU's page tables. Architectures like x86 share
>>>> static kernel address mappings across all user page tables, allowing the
>>>> IOMMU to access the kernel portion of these tables.
>>>>
>>>> Modern IOMMUs often cache page table entries to optimize walk performance,
>>>> even for intermediate page table levels. If kernel page table mappings are
>>>> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
>>>> entries, Use-After-Free (UAF) vulnerability condition arises. If these
>>>> freed page table pages are reallocated for a different purpose, potentially
>>>> by an attacker, the IOMMU could misinterpret the new data as valid page
>>>> table entries. This allows the IOMMU to walk into attacker-controlled
>>>> memory, leading to arbitrary physical memory DMA access or privilege
>>>> escalation.
>>>>
>>>> To mitigate this, introduce a new iommu interface to flush IOMMU caches
>>>> and fence pending page table walks when kernel page mappings are updated.
>>>> This interface should be invoked from architecture-specific code that
>>>> manages combined user and kernel page tables.
>>>
>>> I must say I liked the kPTI based idea better. Having to iterate and
>>> invalidate an unspecified number of IOMMUs from non-preemptible context
>>> seems 'unfortunate'.
>>
>> The cache invalidation path in IOMMU drivers is already critical and
>> operates within a non-preemptible context. This approach is, in fact,
>> already utilized for user-space page table updates since the beginning
>> of SVA support.
>
> OK, fair enough I suppose. What kind of delays are we talking about
> here? The fact that you basically have a unbounded list of IOMMUs
> (although in practise I suppose it is limited by the amount of GPUs and
> other fancy stuff you can stick in your machine) does slightly worry me.
Yes, the mm (struct mm of processes that are bound to devices) list is
an unbounded list and can theoretically grow indefinitely. This results
in an unpredictable critical region.
I am considering whether this could be relaxed a bit if we manage the
IOMMU device list that is used for SVA. The number of IOMMU hardware
units in a system is limited and bounded, which might make the critical
region deterministic.
If that's reasonable, we can do it like this (compiled but not tested):
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 1a51cfd82808..9ed3be2ffaeb 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -9,6 +9,14 @@
#include "iommu-priv.h"
+struct sva_iommu_device_item {
+ struct iommu_device *iommu;
+ unsigned int users;
+ struct list_head node;
+};
+
+static LIST_HEAD(sva_iommu_device_list);
+static DEFINE_SPINLOCK(sva_iommu_device_lock);
static DEFINE_MUTEX(iommu_sva_lock);
static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
struct mm_struct *mm);
@@ -52,6 +60,71 @@ static struct iommu_mm_data
*iommu_alloc_mm_data(struct mm_struct *mm, struct de
return iommu_mm;
}
+static int iommu_sva_add_iommu_device(struct device *dev)
+{
+ struct iommu_device *iommu = dev->iommu->iommu_dev;
+ struct sva_iommu_device_item *iter;
+
+ struct sva_iommu_device_item *new __free(kfree) =
+ kzalloc(sizeof(*new), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+ new->iommu = iommu;
+ new->users = 1;
+
+ guard(spinlock_irqsave)(&sva_iommu_device_lock);
+ list_for_each_entry(iter, &sva_iommu_device_list, node) {
+ if (iter->iommu == iommu) {
+ iter->users++;
+ return 0;
+ }
+ }
+ list_add(&no_free_ptr(new)->node, &sva_iommu_device_list);
+
+ return 0;
+}
+
+static void iommu_sva_remove_iommu_device(struct device *dev)
+{
+ struct iommu_device *iommu = dev->iommu->iommu_dev;
+ struct sva_iommu_device_item *iter, *tmp;
+
+ guard(spinlock_irqsave)(&sva_iommu_device_lock);
+ list_for_each_entry_safe(iter, tmp, &sva_iommu_device_list, node) {
+ if (iter->iommu != iommu)
+ continue;
+
+ if (--iter->users == 0) {
+ list_del(&iter->node);
+ kfree(iter);
+ }
+ break;
+ }
+}
+
+static int iommu_sva_attach_device(struct iommu_domain *domain, struct
device *dev,
+ ioasid_t pasid, struct iommu_attach_handle *handle)
+{
+ int ret;
+
+ ret = iommu_sva_add_iommu_device(dev);
+ if (ret)
+ return ret;
+
+ ret = iommu_attach_device_pasid(domain, dev, pasid, handle);
+ if (ret)
+ iommu_sva_remove_iommu_device(dev);
+
+ return ret;
+}
+
+static void iommu_sva_detach_device(struct iommu_domain *domain,
+ struct device *dev, ioasid_t pasid)
+{
+ iommu_detach_device_pasid(domain, dev, pasid);
+ iommu_sva_remove_iommu_device(dev);
+}
+
/**
* iommu_sva_bind_device() - Bind a process address space to a device
* @dev: the device
@@ -112,8 +185,7 @@ struct iommu_sva *iommu_sva_bind_device(struct
device *dev, struct mm_struct *mm
/* Search for an existing domain. */
list_for_each_entry(domain, &mm->iommu_mm->sva_domains, next) {
- ret = iommu_attach_device_pasid(domain, dev, iommu_mm->pasid,
- &handle->handle);
+ ret = iommu_sva_attach_device(domain, dev, iommu_mm->pasid,
&handle->handle);
if (!ret) {
domain->users++;
goto out;
@@ -127,8 +199,7 @@ struct iommu_sva *iommu_sva_bind_device(struct
device *dev, struct mm_struct *mm
goto out_free_handle;
}
- ret = iommu_attach_device_pasid(domain, dev, iommu_mm->pasid,
- &handle->handle);
+ ret = iommu_sva_attach_device(domain, dev, iommu_mm->pasid,
&handle->handle);
if (ret)
goto out_free_domain;
domain->users = 1;
@@ -170,7 +241,7 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
return;
}
- iommu_detach_device_pasid(domain, dev, iommu_mm->pasid);
+ iommu_sva_detach_device(domain, dev, iommu_mm->pasid);
if (--domain->users == 0) {
list_del(&domain->next);
iommu_domain_free(domain);
@@ -312,3 +383,15 @@ static struct iommu_domain
*iommu_sva_domain_alloc(struct device *dev,
return domain;
}
+
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
+{
+ struct sva_iommu_device_item *item;
+
+ guard(spinlock_irqsave)(&sva_iommu_device_lock);
+ list_for_each_entry(item, &sva_iommu_device_list, node) {
+ if (!item->iommu->ops->paging_cache_invalidate)
+ continue;
+ item->iommu->ops->paging_cache_invalidate(item->iommu, start, end);
+ }
+}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 156732807994..f3716200cc09 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -595,6 +595,8 @@ iommu_copy_struct_from_full_user_array(void *kdst,
size_t kdst_entry_size,
* - IOMMU_DOMAIN_IDENTITY: must use an identity domain
* - IOMMU_DOMAIN_DMA: must use a dma domain
* - 0: use the default setting
+ * @paging_cache_invalidate: Invalidate paging structure caches that store
+ * intermediate levels of the page table.
* @default_domain_ops: the default ops for domains
* @viommu_alloc: Allocate an iommufd_viommu on a physical IOMMU
instance behind
* the @dev, as the set of virtualization resources
shared/passed
@@ -654,6 +656,9 @@ struct iommu_ops {
int (*def_domain_type)(struct device *dev);
+ void (*paging_cache_invalidate)(struct iommu_device *dev,
+ unsigned long start, unsigned long end);
+
struct iommufd_viommu *(*viommu_alloc)(
struct device *dev, struct iommu_domain *parent_domain,
struct iommufd_ctx *ictx, unsigned int viommu_type);
@@ -1571,6 +1576,7 @@ struct iommu_sva *iommu_sva_bind_device(struct
device *dev,
struct mm_struct *mm);
void iommu_sva_unbind_device(struct iommu_sva *handle);
u32 iommu_sva_get_pasid(struct iommu_sva *handle);
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long
end);
#else
static inline struct iommu_sva *
iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
@@ -1595,6 +1601,7 @@ static inline u32 mm_get_enqcmd_pasid(struct
mm_struct *mm)
}
static inline void mm_pasid_drop(struct mm_struct *mm) {}
+static inline void iommu_sva_invalidate_kva_range(unsigned long start,
unsigned long end) {}
#endif /* CONFIG_IOMMU_SVA */
#ifdef CONFIG_IOMMU_IOPF
--
2.43.0
> At some point the low latency folks are going to come hunting you down.
> Do you have a plan on how to deal with this; or are we throwing up our
> hands an say, the hardware sucks, deal with it?
>
Thanks,
baolu
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-15 5:55 ` Baolu Lu
@ 2025-07-15 12:25 ` Jason Gunthorpe
2025-07-16 6:34 ` Baolu Lu
0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2025-07-15 12:25 UTC (permalink / raw)
To: Baolu Lu
Cc: Peter Zijlstra, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jann Horn, Vasant Hegde, Dave Hansen, Alistair Popple,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
Tested-by : Yi Lai, iommu, security, linux-kernel, stable
On Tue, Jul 15, 2025 at 01:55:01PM +0800, Baolu Lu wrote:
> Yes, the mm (struct mm of processes that are bound to devices) list is
> an unbounded list and can theoretically grow indefinitely. This results
> in an unpredictable critical region.
Every MM has a unique PASID so I don't see how you can avoid this.
> @@ -654,6 +656,9 @@ struct iommu_ops {
>
> int (*def_domain_type)(struct device *dev);
>
> + void (*paging_cache_invalidate)(struct iommu_device *dev,
> + unsigned long start, unsigned long end);
How would you even implement this in a driver?
You either flush the whole iommu, in which case who needs a rage, or
the driver has to iterate over the PASID list, in which case it
doesn't really improve the situation.
If this is a concern I think the better answer is to do a defered free
like the mm can sometimes do where we thread the page tables onto a
linked list, flush the CPU cache and push it all into a work which
will do the iommu flush before actually freeing the memory.
One of the KPTI options might be easier at that point..
Jason
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-15 12:25 ` Jason Gunthorpe
@ 2025-07-16 6:34 ` Baolu Lu
2025-07-16 12:08 ` Jason Gunthorpe
0 siblings, 1 reply; 38+ messages in thread
From: Baolu Lu @ 2025-07-16 6:34 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Peter Zijlstra, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jann Horn, Vasant Hegde, Dave Hansen, Alistair Popple,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
Tested-by : Yi Lai, iommu, security, linux-kernel, stable
On 7/15/25 20:25, Jason Gunthorpe wrote:
> On Tue, Jul 15, 2025 at 01:55:01PM +0800, Baolu Lu wrote:
>> Yes, the mm (struct mm of processes that are bound to devices) list is
>> an unbounded list and can theoretically grow indefinitely. This results
>> in an unpredictable critical region.
>
> Every MM has a unique PASID so I don't see how you can avoid this.
>
>> @@ -654,6 +656,9 @@ struct iommu_ops {
>>
>> int (*def_domain_type)(struct device *dev);
>>
>> + void (*paging_cache_invalidate)(struct iommu_device *dev,
>> + unsigned long start, unsigned long end);
>
> How would you even implement this in a driver?
>
> You either flush the whole iommu, in which case who needs a rage, or
> the driver has to iterate over the PASID list, in which case it
> doesn't really improve the situation.
The Intel iommu driver supports flushing all SVA PASIDs with a single
request in the invalidation queue. I am not sure if other IOMMU
implementations also support this, so you are right, it doesn't
generally improve the situation.
>
> If this is a concern I think the better answer is to do a defered free
> like the mm can sometimes do where we thread the page tables onto a
> linked list, flush the CPU cache and push it all into a work which
> will do the iommu flush before actually freeing the memory.
Is it a workable solution to use schedule_work() to queue the KVA cache
invalidation as a work item in the system workqueue? By doing so, we
wouldn't need the spinlock to protect the list anymore.
Perhaps we would need another interface, perhaps named
iommu_sva_flush_kva_inv_wq(), to guarantee that all flush work is
completed before actually freeing the pages.
> One of the KPTI options might be easier at that point..
>
> Jason
Thanks,
baolu
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-09 6:28 [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush Lu Baolu
` (2 preceding siblings ...)
2025-07-10 13:54 ` Peter Zijlstra
@ 2025-07-16 10:54 ` Yi Liu
2025-07-17 1:51 ` Baolu Lu
3 siblings, 1 reply; 38+ messages in thread
From: Yi Liu @ 2025-07-16 10:54 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Tested-by : Yi Lai
Cc: iommu, security, linux-kernel, stable
On 2025/7/9 14:28, Lu Baolu wrote:
> The vmalloc() and vfree() functions manage virtually contiguous, but not
> necessarily physically contiguous, kernel memory regions. When vfree()
> unmaps such a region, it tears down the associated kernel page table
> entries and frees the physical pages.
>
> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
> shares and walks the CPU's page tables. Architectures like x86 share
> static kernel address mappings across all user page tables, allowing the
> IOMMU to access the kernel portion of these tables.
I remember Jason once clarified that no support for kernel SVA. I don't
think linux has such support so far. If so, may just drop the static
mapping terms. This can be attack surface mainly because the page table
may include both user and kernel mappings.
> Modern IOMMUs often cache page table entries to optimize walk performance,
> even for intermediate page table levels. If kernel page table mappings are
> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> entries, Use-After-Free (UAF) vulnerability condition arises. If these
> freed page table pages are reallocated for a different purpose, potentially
> by an attacker, the IOMMU could misinterpret the new data as valid page
> table entries. This allows the IOMMU to walk into attacker-controlled
> memory, leading to arbitrary physical memory DMA access or privilege
> escalation.
Does this fix cover the page compaction and de-compaction as well? It is
valuable to call out the mm subsystem does not notify iommu per page table
modifications except for the modifications related to user VA, hence SVA is
in risk to use stale intermediate caches due to this.
> To mitigate this, introduce a new iommu interface to flush IOMMU caches
> and fence pending page table walks when kernel page mappings are updated.
> This interface should be invoked from architecture-specific code that
> manages combined user and kernel page tables.
aha, this is what I'm trying to find. Using page tables with both kernel
and user mappings is the prerequisite for this bug. :)
> Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
> Cc: stable@vger.kernel.org
> Reported-by: Jann Horn <jannh@google.com>
> Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
> Tested-by: Yi Lai <yi1.lai@intel.com>
> ---
> arch/x86/mm/tlb.c | 2 ++
> drivers/iommu/iommu-sva.c | 34 +++++++++++++++++++++++++++++++++-
> include/linux/iommu.h | 4 ++++
> 3 files changed, 39 insertions(+), 1 deletion(-)
>
> Change log:
> v2:
> - Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range);
> - Replace the mutex with a spinlock to make the interface usable in the
> critical regions.
>
> v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux.intel.com/
>
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 39f80111e6f1..a41499dfdc3f 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -12,6 +12,7 @@
> #include <linux/task_work.h>
> #include <linux/mmu_notifier.h>
> #include <linux/mmu_context.h>
> +#include <linux/iommu.h>
>
> #include <asm/tlbflush.h>
> #include <asm/mmu_context.h>
> @@ -1540,6 +1541,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
> kernel_tlb_flush_range(info);
>
> put_flush_tlb_info();
> + iommu_sva_invalidate_kva_range(start, end);
> }
>
> /*
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> index 1a51cfd82808..fd76aefa5a88 100644
> --- a/drivers/iommu/iommu-sva.c
> +++ b/drivers/iommu/iommu-sva.c
> @@ -10,6 +10,9 @@
> #include "iommu-priv.h"
>
> static DEFINE_MUTEX(iommu_sva_lock);
> +static DEFINE_STATIC_KEY_FALSE(iommu_sva_present);
> +static LIST_HEAD(iommu_sva_mms);
> +static DEFINE_SPINLOCK(iommu_mms_lock);
> static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
> struct mm_struct *mm);
>
> @@ -42,6 +45,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de
> return ERR_PTR(-ENOSPC);
> }
> iommu_mm->pasid = pasid;
> + iommu_mm->mm = mm;
> INIT_LIST_HEAD(&iommu_mm->sva_domains);
> /*
> * Make sure the write to mm->iommu_mm is not reordered in front of
> @@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
> if (ret)
> goto out_free_domain;
> domain->users = 1;
> - list_add(&domain->next, &mm->iommu_mm->sva_domains);
>
> + if (list_empty(&iommu_mm->sva_domains)) {
> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> + if (list_empty(&iommu_sva_mms))
> + static_branch_enable(&iommu_sva_present);
> + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
> + }
> + }
> + list_add(&domain->next, &iommu_mm->sva_domains);
> out:
> refcount_set(&handle->users, 1);
> mutex_unlock(&iommu_sva_lock);
> @@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
> list_del(&domain->next);
> iommu_domain_free(domain);
> }
> +
> + if (list_empty(&iommu_mm->sva_domains)) {
> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> + list_del(&iommu_mm->mm_list_elm);
> + if (list_empty(&iommu_sva_mms))
> + static_branch_disable(&iommu_sva_present);
> + }
> + }
> +
> mutex_unlock(&iommu_sva_lock);
> kfree(handle);
> }
> @@ -312,3 +332,15 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>
> return domain;
> }
> +
> +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
> +{
> + struct iommu_mm_data *iommu_mm;
> +
> + if (!static_branch_unlikely(&iommu_sva_present))
> + return;
> +
> + guard(spinlock_irqsave)(&iommu_mms_lock);
> + list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
> + mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end);
is it possible the TLB flush side calls this API per mm?
> +}
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 156732807994..31330c12b8ee 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -1090,7 +1090,9 @@ struct iommu_sva {
>
> struct iommu_mm_data {
> u32 pasid;
> + struct mm_struct *mm;
> struct list_head sva_domains;
> + struct list_head mm_list_elm;
> };
>
> int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode);
> @@ -1571,6 +1573,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
> struct mm_struct *mm);
> void iommu_sva_unbind_device(struct iommu_sva *handle);
> u32 iommu_sva_get_pasid(struct iommu_sva *handle);
> +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end);
> #else
> static inline struct iommu_sva *
> iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
> @@ -1595,6 +1598,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm)
> }
>
> static inline void mm_pasid_drop(struct mm_struct *mm) {}
> +static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {}
> #endif /* CONFIG_IOMMU_SVA */
>
> #ifdef CONFIG_IOMMU_IOPF
--
Regards,
Yi Liu
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 15:53 ` Peter Zijlstra
2025-07-11 3:09 ` Baolu Lu
@ 2025-07-16 11:57 ` David Laight
2025-07-17 1:47 ` Baolu Lu
1 sibling, 1 reply; 38+ messages in thread
From: David Laight @ 2025-07-16 11:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai, iommu, security,
linux-kernel, stable
On Thu, 10 Jul 2025 17:53:19 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jul 10, 2025 at 03:54:32PM +0200, Peter Zijlstra wrote:
>
> > > @@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
> > > if (ret)
> > > goto out_free_domain;
> > > domain->users = 1;
> > > - list_add(&domain->next, &mm->iommu_mm->sva_domains);
> > >
> > > + if (list_empty(&iommu_mm->sva_domains)) {
> > > + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> > > + if (list_empty(&iommu_sva_mms))
> > > + static_branch_enable(&iommu_sva_present);
> > > + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
> > > + }
> > > + }
> > > + list_add(&domain->next, &iommu_mm->sva_domains);
> > > out:
> > > refcount_set(&handle->users, 1);
> > > mutex_unlock(&iommu_sva_lock);
> > > @@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
> > > list_del(&domain->next);
> > > iommu_domain_free(domain);
> > > }
> > > +
> > > + if (list_empty(&iommu_mm->sva_domains)) {
> > > + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
> > > + list_del(&iommu_mm->mm_list_elm);
> > > + if (list_empty(&iommu_sva_mms))
> > > + static_branch_disable(&iommu_sva_present);
> > > + }
> > > + }
> > > +
> > > mutex_unlock(&iommu_sva_lock);
> > > kfree(handle);
> > > }
> >
> > This seems an odd coding style choice; why the extra unneeded
> > indentation? That is, what's wrong with:
> >
> > if (list_empty()) {
> > guard(spinlock_irqsave)(&iommu_mms_lock);
> > list_del();
> > if (list_empty()
> > static_branch_disable();
> > }
>
> Well, for one, you can't do static_branch_{en,dis}able() from atomic
> context...
Aren't they also somewhat expensive - so you really want to use them
for configuration options which pretty much don't change.
David
>
> Was this ever tested?
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-16 6:34 ` Baolu Lu
@ 2025-07-16 12:08 ` Jason Gunthorpe
2025-07-17 1:43 ` Baolu Lu
0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2025-07-16 12:08 UTC (permalink / raw)
To: Baolu Lu
Cc: Peter Zijlstra, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jann Horn, Vasant Hegde, Dave Hansen, Alistair Popple,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
Tested-by : Yi Lai, iommu, security, linux-kernel, stable
On Wed, Jul 16, 2025 at 02:34:04PM +0800, Baolu Lu wrote:
> > > @@ -654,6 +656,9 @@ struct iommu_ops {
> > >
> > > int (*def_domain_type)(struct device *dev);
> > >
> > > + void (*paging_cache_invalidate)(struct iommu_device *dev,
> > > + unsigned long start, unsigned long end);
> >
> > How would you even implement this in a driver?
> >
> > You either flush the whole iommu, in which case who needs a rage, or
> > the driver has to iterate over the PASID list, in which case it
> > doesn't really improve the situation.
>
> The Intel iommu driver supports flushing all SVA PASIDs with a single
> request in the invalidation queue.
How? All PASID !=0 ? The HW has no notion about a SVA PASID vs no-SVA
else. This is just flushing almost everything.
> > If this is a concern I think the better answer is to do a defered free
> > like the mm can sometimes do where we thread the page tables onto a
> > linked list, flush the CPU cache and push it all into a work which
> > will do the iommu flush before actually freeing the memory.
>
> Is it a workable solution to use schedule_work() to queue the KVA cache
> invalidation as a work item in the system workqueue? By doing so, we
> wouldn't need the spinlock to protect the list anymore.
Maybe.
MM is also more careful to pull the invalidation out some of the
locks, I don't know what the KVA side is like..
Jason
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-16 12:08 ` Jason Gunthorpe
@ 2025-07-17 1:43 ` Baolu Lu
2025-07-17 11:50 ` Vasant Hegde
0 siblings, 1 reply; 38+ messages in thread
From: Baolu Lu @ 2025-07-17 1:43 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Peter Zijlstra, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jann Horn, Vasant Hegde, Dave Hansen, Alistair Popple,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
Tested-by : Yi Lai, iommu, security, linux-kernel, stable
On 7/16/25 20:08, Jason Gunthorpe wrote:
> On Wed, Jul 16, 2025 at 02:34:04PM +0800, Baolu Lu wrote:
>>>> @@ -654,6 +656,9 @@ struct iommu_ops {
>>>>
>>>> int (*def_domain_type)(struct device *dev);
>>>>
>>>> + void (*paging_cache_invalidate)(struct iommu_device *dev,
>>>> + unsigned long start, unsigned long end);
>>>
>>> How would you even implement this in a driver?
>>>
>>> You either flush the whole iommu, in which case who needs a rage, or
>>> the driver has to iterate over the PASID list, in which case it
>>> doesn't really improve the situation.
>>
>> The Intel iommu driver supports flushing all SVA PASIDs with a single
>> request in the invalidation queue.
>
> How? All PASID !=0 ? The HW has no notion about a SVA PASID vs no-SVA
> else. This is just flushing almost everything.
The intel iommu driver allocates a dedicated domain id for all sva
domains. It can flush all cache entries with that domain id tagged.
>
>>> If this is a concern I think the better answer is to do a defered free
>>> like the mm can sometimes do where we thread the page tables onto a
>>> linked list, flush the CPU cache and push it all into a work which
>>> will do the iommu flush before actually freeing the memory.
>>
>> Is it a workable solution to use schedule_work() to queue the KVA cache
>> invalidation as a work item in the system workqueue? By doing so, we
>> wouldn't need the spinlock to protect the list anymore.
>
> Maybe.
>
> MM is also more careful to pull the invalidation out some of the
> locks, I don't know what the KVA side is like..
How about something like the following? It's compiled but not tested.
struct kva_invalidation_work_data {
struct work_struct work;
unsigned long start;
unsigned long end;
bool free_on_completion;
};
static void invalidate_kva_func(struct work_struct *work)
{
struct kva_invalidation_work_data *data =
container_of(work, struct kva_invalidation_work_data, work);
struct iommu_mm_data *iommu_mm;
guard(mutex)(&iommu_sva_lock);
list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm,
data->start, data->end);
if (data->free_on_completion)
kfree(data);
}
void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
{
struct kva_invalidation_work_data stack_data;
if (!static_branch_unlikely(&iommu_sva_present))
return;
/*
* Since iommu_sva_mms is an unbound list, iterating it in an atomic
* context could introduce significant latency issues.
*/
if (in_atomic()) {
struct kva_invalidation_work_data *data =
kzalloc(sizeof(*data), GFP_ATOMIC);
if (!data)
return;
data->start = start;
data->end = end;
INIT_WORK(&data->work, invalidate_kva_func);
data->free_on_completion = true;
schedule_work(&data->work);
return;
}
stack_data.start = start;
stack_data.end = end;
invalidate_kva_func(&stack_data.work);
}
Thanks,
baolu
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-16 11:57 ` David Laight
@ 2025-07-17 1:47 ` Baolu Lu
0 siblings, 0 replies; 38+ messages in thread
From: Baolu Lu @ 2025-07-17 1:47 UTC (permalink / raw)
To: David Laight, Peter Zijlstra
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai, iommu, security,
linux-kernel, stable
On 7/16/25 19:57, David Laight wrote:
> On Thu, 10 Jul 2025 17:53:19 +0200
> Peter Zijlstra<peterz@infradead.org> wrote:
>
>> On Thu, Jul 10, 2025 at 03:54:32PM +0200, Peter Zijlstra wrote:
>>
>>>> @@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
>>>> if (ret)
>>>> goto out_free_domain;
>>>> domain->users = 1;
>>>> - list_add(&domain->next, &mm->iommu_mm->sva_domains);
>>>>
>>>> + if (list_empty(&iommu_mm->sva_domains)) {
>>>> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
>>>> + if (list_empty(&iommu_sva_mms))
>>>> + static_branch_enable(&iommu_sva_present);
>>>> + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
>>>> + }
>>>> + }
>>>> + list_add(&domain->next, &iommu_mm->sva_domains);
>>>> out:
>>>> refcount_set(&handle->users, 1);
>>>> mutex_unlock(&iommu_sva_lock);
>>>> @@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
>>>> list_del(&domain->next);
>>>> iommu_domain_free(domain);
>>>> }
>>>> +
>>>> + if (list_empty(&iommu_mm->sva_domains)) {
>>>> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
>>>> + list_del(&iommu_mm->mm_list_elm);
>>>> + if (list_empty(&iommu_sva_mms))
>>>> + static_branch_disable(&iommu_sva_present);
>>>> + }
>>>> + }
>>>> +
>>>> mutex_unlock(&iommu_sva_lock);
>>>> kfree(handle);
>>>> }
>>> This seems an odd coding style choice; why the extra unneeded
>>> indentation? That is, what's wrong with:
>>>
>>> if (list_empty()) {
>>> guard(spinlock_irqsave)(&iommu_mms_lock);
>>> list_del();
>>> if (list_empty()
>>> static_branch_disable();
>>> }
>> Well, for one, you can't do static_branch_{en,dis}able() from atomic
>> context...
> Aren't they also somewhat expensive - so you really want to use them
> for configuration options which pretty much don't change.
Yeah! Fair enough.
Thanks,
baolu
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-16 10:54 ` Yi Liu
@ 2025-07-17 1:51 ` Baolu Lu
0 siblings, 0 replies; 38+ messages in thread
From: Baolu Lu @ 2025-07-17 1:51 UTC (permalink / raw)
To: Yi Liu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Tested-by : Yi Lai
Cc: iommu, security, linux-kernel, stable
On 7/16/25 18:54, Yi Liu wrote:
> On 2025/7/9 14:28, Lu Baolu wrote:
>> The vmalloc() and vfree() functions manage virtually contiguous, but not
>> necessarily physically contiguous, kernel memory regions. When vfree()
>> unmaps such a region, it tears down the associated kernel page table
>> entries and frees the physical pages.
>>
>> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
>> shares and walks the CPU's page tables. Architectures like x86 share
>> static kernel address mappings across all user page tables, allowing the
>> IOMMU to access the kernel portion of these tables.
>
> I remember Jason once clarified that no support for kernel SVA. I don't
> think linux has such support so far. If so, may just drop the static
> mapping terms. This can be attack surface mainly because the page table
> may include both user and kernel mappings.
Yes. Kernel SVA has already been removed from the tree.
>> Modern IOMMUs often cache page table entries to optimize walk
>> performance,
>> even for intermediate page table levels. If kernel page table mappings
>> are
>> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
>> entries, Use-After-Free (UAF) vulnerability condition arises. If these
>> freed page table pages are reallocated for a different purpose,
>> potentially
>> by an attacker, the IOMMU could misinterpret the new data as valid page
>> table entries. This allows the IOMMU to walk into attacker-controlled
>> memory, leading to arbitrary physical memory DMA access or privilege
>> escalation.
>
> Does this fix cover the page compaction and de-compaction as well? It is
It should.
> valuable to call out the mm subsystem does not notify iommu per page table
> modifications except for the modifications related to user VA, hence SVA is
> in risk to use stale intermediate caches due to this.
>
>> To mitigate this, introduce a new iommu interface to flush IOMMU caches
>> and fence pending page table walks when kernel page mappings are updated.
>> This interface should be invoked from architecture-specific code that
>> manages combined user and kernel page tables.
>
> aha, this is what I'm trying to find. Using page tables with both kernel
> and user mappings is the prerequisite for this bug. :)
Yes.
>
>> Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
>> Cc: stable@vger.kernel.org
>> Reported-by: Jann Horn <jannh@google.com>
>> Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
>> Tested-by: Yi Lai <yi1.lai@intel.com>
>> ---
>> arch/x86/mm/tlb.c | 2 ++
>> drivers/iommu/iommu-sva.c | 34 +++++++++++++++++++++++++++++++++-
>> include/linux/iommu.h | 4 ++++
>> 3 files changed, 39 insertions(+), 1 deletion(-)
>>
>> Change log:
>> v2:
>> - Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range);
>> - Replace the mutex with a spinlock to make the interface usable in the
>> critical regions.
>>
>> v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-
>> baolu.lu@linux.intel.com/
>>
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 39f80111e6f1..a41499dfdc3f 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -12,6 +12,7 @@
>> #include <linux/task_work.h>
>> #include <linux/mmu_notifier.h>
>> #include <linux/mmu_context.h>
>> +#include <linux/iommu.h>
>> #include <asm/tlbflush.h>
>> #include <asm/mmu_context.h>
>> @@ -1540,6 +1541,7 @@ void flush_tlb_kernel_range(unsigned long start,
>> unsigned long end)
>> kernel_tlb_flush_range(info);
>> put_flush_tlb_info();
>> + iommu_sva_invalidate_kva_range(start, end);
>> }
>> /*
>> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
>> index 1a51cfd82808..fd76aefa5a88 100644
>> --- a/drivers/iommu/iommu-sva.c
>> +++ b/drivers/iommu/iommu-sva.c
>> @@ -10,6 +10,9 @@
>> #include "iommu-priv.h"
>> static DEFINE_MUTEX(iommu_sva_lock);
>> +static DEFINE_STATIC_KEY_FALSE(iommu_sva_present);
>> +static LIST_HEAD(iommu_sva_mms);
>> +static DEFINE_SPINLOCK(iommu_mms_lock);
>> static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>> struct mm_struct *mm);
>> @@ -42,6 +45,7 @@ static struct iommu_mm_data
>> *iommu_alloc_mm_data(struct mm_struct *mm, struct de
>> return ERR_PTR(-ENOSPC);
>> }
>> iommu_mm->pasid = pasid;
>> + iommu_mm->mm = mm;
>> INIT_LIST_HEAD(&iommu_mm->sva_domains);
>> /*
>> * Make sure the write to mm->iommu_mm is not reordered in front of
>> @@ -132,8 +136,15 @@ struct iommu_sva *iommu_sva_bind_device(struct
>> device *dev, struct mm_struct *mm
>> if (ret)
>> goto out_free_domain;
>> domain->users = 1;
>> - list_add(&domain->next, &mm->iommu_mm->sva_domains);
>> + if (list_empty(&iommu_mm->sva_domains)) {
>> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
>> + if (list_empty(&iommu_sva_mms))
>> + static_branch_enable(&iommu_sva_present);
>> + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
>> + }
>> + }
>> + list_add(&domain->next, &iommu_mm->sva_domains);
>> out:
>> refcount_set(&handle->users, 1);
>> mutex_unlock(&iommu_sva_lock);
>> @@ -175,6 +186,15 @@ void iommu_sva_unbind_device(struct iommu_sva
>> *handle)
>> list_del(&domain->next);
>> iommu_domain_free(domain);
>> }
>> +
>> + if (list_empty(&iommu_mm->sva_domains)) {
>> + scoped_guard(spinlock_irqsave, &iommu_mms_lock) {
>> + list_del(&iommu_mm->mm_list_elm);
>> + if (list_empty(&iommu_sva_mms))
>> + static_branch_disable(&iommu_sva_present);
>> + }
>> + }
>> +
>> mutex_unlock(&iommu_sva_lock);
>> kfree(handle);
>> }
>> @@ -312,3 +332,15 @@ static struct iommu_domain
>> *iommu_sva_domain_alloc(struct device *dev,
>> return domain;
>> }
>> +
>> +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned
>> long end)
>> +{
>> + struct iommu_mm_data *iommu_mm;
>> +
>> + if (!static_branch_unlikely(&iommu_sva_present))
>> + return;
>> +
>> + guard(spinlock_irqsave)(&iommu_mms_lock);
>> + list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
>> + mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm,
>> start, end);
>
> is it possible the TLB flush side calls this API per mm?
Nope.
>
>> +}
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 156732807994..31330c12b8ee 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -1090,7 +1090,9 @@ struct iommu_sva {
>> struct iommu_mm_data {
>> u32 pasid;
>> + struct mm_struct *mm;
>> struct list_head sva_domains;
>> + struct list_head mm_list_elm;
>> };
>> int iommu_fwspec_init(struct device *dev, struct fwnode_handle
>> *iommu_fwnode);
>> @@ -1571,6 +1573,7 @@ struct iommu_sva *iommu_sva_bind_device(struct
>> device *dev,
>> struct mm_struct *mm);
>> void iommu_sva_unbind_device(struct iommu_sva *handle);
>> u32 iommu_sva_get_pasid(struct iommu_sva *handle);
>> +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned
>> long end);
>> #else
>> static inline struct iommu_sva *
>> iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
>> @@ -1595,6 +1598,7 @@ static inline u32 mm_get_enqcmd_pasid(struct
>> mm_struct *mm)
>> }
>> static inline void mm_pasid_drop(struct mm_struct *mm) {}
>> +static inline void iommu_sva_invalidate_kva_range(unsigned long
>> start, unsigned long end) {}
>> #endif /* CONFIG_IOMMU_SVA */
>> #ifdef CONFIG_IOMMU_IOPF
>
Thanks,
baolu
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-17 1:43 ` Baolu Lu
@ 2025-07-17 11:50 ` Vasant Hegde
0 siblings, 0 replies; 38+ messages in thread
From: Vasant Hegde @ 2025-07-17 11:50 UTC (permalink / raw)
To: Baolu Lu, Jason Gunthorpe
Cc: Peter Zijlstra, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jann Horn, Dave Hansen, Alistair Popple,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
Tested-by : Yi Lai, iommu, security, linux-kernel, stable
Hi Lu, Jason,
On 7/17/2025 7:13 AM, Baolu Lu wrote:
> On 7/16/25 20:08, Jason Gunthorpe wrote:
>> On Wed, Jul 16, 2025 at 02:34:04PM +0800, Baolu Lu wrote:
>>>>> @@ -654,6 +656,9 @@ struct iommu_ops {
>>>>>
>>>>> int (*def_domain_type)(struct device *dev);
>>>>>
>>>>> + void (*paging_cache_invalidate)(struct iommu_device *dev,
>>>>> + unsigned long start, unsigned long end);
>>>>
>>>> How would you even implement this in a driver?
>>>>
>>>> You either flush the whole iommu, in which case who needs a rage, or
>>>> the driver has to iterate over the PASID list, in which case it
>>>> doesn't really improve the situation.
>>>
>>> The Intel iommu driver supports flushing all SVA PASIDs with a single
>>> request in the invalidation queue.
>>
>> How? All PASID !=0 ? The HW has no notion about a SVA PASID vs no-SVA
>> else. This is just flushing almost everything.
>
> The intel iommu driver allocates a dedicated domain id for all sva
> domains. It can flush all cache entries with that domain id tagged.
AMD IOMMU has INVALIDATE_IOMMU_ALL which flushes everything in IOMMU TLB. This
is heavy hammer. But should be OK for short term solution?
I don't think this command is supported inside the guest (I will double check).
But we don't have HW-vIOMMU support yet. So PASID inside guest is not yet supported.
-Vasant
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-11 8:17 ` Yu Zhang
@ 2025-07-24 3:01 ` Baolu Lu
2025-07-28 17:36 ` Yu Zhang
0 siblings, 1 reply; 38+ messages in thread
From: Baolu Lu @ 2025-07-24 3:01 UTC (permalink / raw)
To: Yu Zhang, Dave Hansen
Cc: Jason Gunthorpe, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Tested-by : Yi Lai, iommu, security,
linux-kernel, stable
On 7/11/25 16:17, Yu Zhang wrote:
> On Thu, Jul 10, 2025 at 08:26:06AM -0700, Dave Hansen wrote:
>> On 7/10/25 06:22, Jason Gunthorpe wrote:
>>>> Why does this matter? We flush the CPU TLB in a bunch of different ways,
>>>> _especially_ when it's being done for kernel mappings. For example,
>>>> __flush_tlb_all() is a non-ranged kernel flush which has a completely
>>>> parallel implementation with flush_tlb_kernel_range(). Call sites that
>>>> use_it_ are unaffected by the patch here.
>>>>
>>>> Basically, if we're only worried about vmalloc/vfree freeing page
>>>> tables, then this patch is OK. If the problem is bigger than that, then
>>>> we need a more comprehensive patch.
>>> I think we are worried about any place that frees page tables.
>> The two places that come to mind are the remove_memory() code and
>> __change_page_attr().
>>
>> The remove_memory() gunk is in arch/x86/mm/init_64.c. It has a few sites
>> that do flush_tlb_all(). Now that I'm looking at it, there look to be
>> some races between freeing page tables pages and flushing the TLB. But,
>> basically, if you stick to the sites in there that do flush_tlb_all()
>> after free_pagetable(), you should be good.
>>
>> As for the __change_page_attr() code, I think the only spot you need to
>> hit is cpa_collapse_large_pages() and maybe the one in
>> __split_large_page() as well.
>>
>> This is all disturbingly ad-hoc, though. The remove_memory() code needs
>> fixing and I'll probably go try to bring some order to the chaos in the
>> process of fixing it up. But that's a separate problem than this IOMMU fun.
>>
> Could we consider to split the flush_tlb_kernel_range() into 2 different
> versions:
> - the one which only flushes the CPU TLB
> - the one which flushes the CPU paging structure cache and then notifies
> IOMMU to do the same(e.g., in pud_free_pmd_page()/pmd_free_pte_page())?
From the perspective of an IOMMU, there is no need to split. IOMMU SVA
only allows the device to access user-space memory with user
permission. Access to kernel address space with privileged permission
is not allowed. Therefore, the IOMMU subsystem only needs a callback to
invalidate the paging structure cache.
Thanks,
baolu
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-10 15:26 ` Dave Hansen
` (2 preceding siblings ...)
2025-07-11 8:17 ` Yu Zhang
@ 2025-07-24 3:06 ` Baolu Lu
3 siblings, 0 replies; 38+ messages in thread
From: Baolu Lu @ 2025-07-24 3:06 UTC (permalink / raw)
To: Dave Hansen, Jason Gunthorpe
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Tested-by : Yi Lai, iommu,
security, linux-kernel, stable
On 7/10/25 23:26, Dave Hansen wrote:
> On 7/10/25 06:22, Jason Gunthorpe wrote:
>>> Why does this matter? We flush the CPU TLB in a bunch of different ways,
>>> _especially_ when it's being done for kernel mappings. For example,
>>> __flush_tlb_all() is a non-ranged kernel flush which has a completely
>>> parallel implementation with flush_tlb_kernel_range(). Call sites that
>>> use _it_ are unaffected by the patch here.
>>>
>>> Basically, if we're only worried about vmalloc/vfree freeing page
>>> tables, then this patch is OK. If the problem is bigger than that, then
>>> we need a more comprehensive patch.
>> I think we are worried about any place that frees page tables.
>
> The two places that come to mind are the remove_memory() code and
> __change_page_attr().
>
> The remove_memory() gunk is in arch/x86/mm/init_64.c. It has a few sites
> that do flush_tlb_all(). Now that I'm looking at it, there look to be
> some races between freeing page tables pages and flushing the TLB. But,
> basically, if you stick to the sites in there that do flush_tlb_all()
> after free_pagetable(), you should be good.
>
> As for the __change_page_attr() code, I think the only spot you need to
> hit is cpa_collapse_large_pages() and maybe the one in
> __split_large_page() as well.
Thank you for the guide. It appears that all paths mentioned above will
eventually call flush_tlb_all() after changing the page table. So, I can
simply put a call site in flush_tlb_all()? Something like this:
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a41499dfdc3f..3b85e7d3ba44 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1479,6 +1479,8 @@ void flush_tlb_all(void)
else
/* Fall back to the IPI-based invalidation. */
on_each_cpu(do_flush_tlb_all, NULL, 1);
+
+ iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
}
>
> This is all disturbingly ad-hoc, though. The remove_memory() code needs
> fixing and I'll probably go try to bring some order to the chaos in the
> process of fixing it up. But that's a separate problem than this IOMMU fun.
Yes. Please.
Thanks,
baolu
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-24 3:01 ` Baolu Lu
@ 2025-07-28 17:36 ` Yu Zhang
2025-07-29 2:08 ` Baolu Lu
0 siblings, 1 reply; 38+ messages in thread
From: Yu Zhang @ 2025-07-28 17:36 UTC (permalink / raw)
To: Baolu Lu
Cc: Dave Hansen, Jason Gunthorpe, Joerg Roedel, Will Deacon,
Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Tested-by : Yi Lai, iommu,
security, linux-kernel, stable
On Thu, Jul 24, 2025 at 11:01:12AM +0800, Baolu Lu wrote:
> On 7/11/25 16:17, Yu Zhang wrote:
> > On Thu, Jul 10, 2025 at 08:26:06AM -0700, Dave Hansen wrote:
> > > On 7/10/25 06:22, Jason Gunthorpe wrote:
> > > > > Why does this matter? We flush the CPU TLB in a bunch of different ways,
> > > > > _especially_ when it's being done for kernel mappings. For example,
> > > > > __flush_tlb_all() is a non-ranged kernel flush which has a completely
> > > > > parallel implementation with flush_tlb_kernel_range(). Call sites that
> > > > > use_it_ are unaffected by the patch here.
> > > > >
> > > > > Basically, if we're only worried about vmalloc/vfree freeing page
> > > > > tables, then this patch is OK. If the problem is bigger than that, then
> > > > > we need a more comprehensive patch.
> > > > I think we are worried about any place that frees page tables.
> > > The two places that come to mind are the remove_memory() code and
> > > __change_page_attr().
> > >
> > > The remove_memory() gunk is in arch/x86/mm/init_64.c. It has a few sites
> > > that do flush_tlb_all(). Now that I'm looking at it, there look to be
> > > some races between freeing page tables pages and flushing the TLB. But,
> > > basically, if you stick to the sites in there that do flush_tlb_all()
> > > after free_pagetable(), you should be good.
> > >
> > > As for the __change_page_attr() code, I think the only spot you need to
> > > hit is cpa_collapse_large_pages() and maybe the one in
> > > __split_large_page() as well.
> > >
> > > This is all disturbingly ad-hoc, though. The remove_memory() code needs
> > > fixing and I'll probably go try to bring some order to the chaos in the
> > > process of fixing it up. But that's a separate problem than this IOMMU fun.
> > >
> > Could we consider to split the flush_tlb_kernel_range() into 2 different
> > versions:
> > - the one which only flushes the CPU TLB
> > - the one which flushes the CPU paging structure cache and then notifies
> > IOMMU to do the same(e.g., in pud_free_pmd_page()/pmd_free_pte_page())?
>
> From the perspective of an IOMMU, there is no need to split. IOMMU SVA
> only allows the device to access user-space memory with user
> permission. Access to kernel address space with privileged permission
> is not allowed. Therefore, the IOMMU subsystem only needs a callback to
> invalidate the paging structure cache.
Thanks Baolu.
Indeed. That's why I was wondering if we could split flush_tlb_kernel_range()
into 2 versions - one used only after a kernal virtual address range is
unmapped, and another one used after a kernel paging structure is freed.
Only the 2nd one needs to notify the IOMMU subsystem.
B.R.
Yu
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
2025-07-28 17:36 ` Yu Zhang
@ 2025-07-29 2:08 ` Baolu Lu
0 siblings, 0 replies; 38+ messages in thread
From: Baolu Lu @ 2025-07-29 2:08 UTC (permalink / raw)
To: Yu Zhang
Cc: Dave Hansen, Jason Gunthorpe, Joerg Roedel, Will Deacon,
Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Tested-by : Yi Lai, iommu,
security, linux-kernel, stable
On 7/29/25 01:36, Yu Zhang wrote:
> On Thu, Jul 24, 2025 at 11:01:12AM +0800, Baolu Lu wrote:
>> On 7/11/25 16:17, Yu Zhang wrote:
>>> On Thu, Jul 10, 2025 at 08:26:06AM -0700, Dave Hansen wrote:
>>>> On 7/10/25 06:22, Jason Gunthorpe wrote:
>>>>>> Why does this matter? We flush the CPU TLB in a bunch of different ways,
>>>>>> _especially_ when it's being done for kernel mappings. For example,
>>>>>> __flush_tlb_all() is a non-ranged kernel flush which has a completely
>>>>>> parallel implementation with flush_tlb_kernel_range(). Call sites that
>>>>>> use_it_ are unaffected by the patch here.
>>>>>>
>>>>>> Basically, if we're only worried about vmalloc/vfree freeing page
>>>>>> tables, then this patch is OK. If the problem is bigger than that, then
>>>>>> we need a more comprehensive patch.
>>>>> I think we are worried about any place that frees page tables.
>>>> The two places that come to mind are the remove_memory() code and
>>>> __change_page_attr().
>>>>
>>>> The remove_memory() gunk is in arch/x86/mm/init_64.c. It has a few sites
>>>> that do flush_tlb_all(). Now that I'm looking at it, there look to be
>>>> some races between freeing page tables pages and flushing the TLB. But,
>>>> basically, if you stick to the sites in there that do flush_tlb_all()
>>>> after free_pagetable(), you should be good.
>>>>
>>>> As for the __change_page_attr() code, I think the only spot you need to
>>>> hit is cpa_collapse_large_pages() and maybe the one in
>>>> __split_large_page() as well.
>>>>
>>>> This is all disturbingly ad-hoc, though. The remove_memory() code needs
>>>> fixing and I'll probably go try to bring some order to the chaos in the
>>>> process of fixing it up. But that's a separate problem than this IOMMU fun.
>>>>
>>> Could we consider to split the flush_tlb_kernel_range() into 2 different
>>> versions:
>>> - the one which only flushes the CPU TLB
>>> - the one which flushes the CPU paging structure cache and then notifies
>>> IOMMU to do the same(e.g., in pud_free_pmd_page()/pmd_free_pte_page())?
>> From the perspective of an IOMMU, there is no need to split. IOMMU SVA
>> only allows the device to access user-space memory with user
>> permission. Access to kernel address space with privileged permission
>> is not allowed. Therefore, the IOMMU subsystem only needs a callback to
>> invalidate the paging structure cache.
> Thanks Baolu.
>
> Indeed. That's why I was wondering if we could split flush_tlb_kernel_range()
> into 2 versions - one used only after a kernal virtual address range is
> unmapped, and another one used after a kernel paging structure is freed.
> Only the 2nd one needs to notify the IOMMU subsystem.
Yeah! That sounds better.
Thanks,
baolu
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2025-07-29 2:11 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 6:28 [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush Lu Baolu
2025-07-09 15:29 ` Dave Hansen
2025-07-10 2:14 ` Baolu Lu
2025-07-10 2:55 ` Tian, Kevin
2025-07-10 12:53 ` Dave Hansen
2025-07-10 13:22 ` Jason Gunthorpe
2025-07-10 15:26 ` Dave Hansen
2025-07-11 2:46 ` Tian, Kevin
2025-07-11 2:54 ` Tian, Kevin
2025-07-11 8:17 ` Yu Zhang
2025-07-24 3:01 ` Baolu Lu
2025-07-28 17:36 ` Yu Zhang
2025-07-29 2:08 ` Baolu Lu
2025-07-24 3:06 ` Baolu Lu
2025-07-11 2:49 ` Tian, Kevin
2025-07-10 3:02 ` Tian, Kevin
2025-07-10 8:11 ` Yu Zhang
2025-07-10 8:15 ` Tian, Kevin
2025-07-10 9:37 ` Yu Zhang
2025-07-10 13:54 ` Peter Zijlstra
2025-07-10 15:53 ` Peter Zijlstra
2025-07-11 3:09 ` Baolu Lu
2025-07-11 8:27 ` Peter Zijlstra
2025-07-16 11:57 ` David Laight
2025-07-17 1:47 ` Baolu Lu
2025-07-11 3:00 ` Baolu Lu
2025-07-11 4:01 ` Tian, Kevin
2025-07-11 8:32 ` Peter Zijlstra
2025-07-11 11:58 ` Jason Gunthorpe
2025-07-15 5:55 ` Baolu Lu
2025-07-15 12:25 ` Jason Gunthorpe
2025-07-16 6:34 ` Baolu Lu
2025-07-16 12:08 ` Jason Gunthorpe
2025-07-17 1:43 ` Baolu Lu
2025-07-17 11:50 ` Vasant Hegde
2025-07-11 11:54 ` Jason Gunthorpe
2025-07-16 10:54 ` Yi Liu
2025-07-17 1:51 ` Baolu Lu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).