[PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
@ 2025-08-06  5:25 Lu Baolu
  2025-08-06 15:03 ` Dave Hansen
  2025-08-14  4:48 ` Ethan Zhao
  0 siblings, 2 replies; 51+ messages in thread
From: Lu Baolu @ 2025-08-06  5:25 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
  Cc: iommu, security, linux-kernel, Lu Baolu, stable

In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The Linux x86 architecture maps
the kernel address space into the upper portion of every process’s page
table. Consequently, in an SVA context, the IOMMU hardware can walk and
cache kernel space mappings. However, the Linux kernel currently lacks
a notification mechanism for kernel space mapping changes. This means
the IOMMU driver is not aware of such changes, leading to a break in
IOMMU cache coherence.

Modern IOMMUs often cache page table entries of the intermediate-level
page table as long as the entry is valid, no matter the permissions, to
optimize walk performance. Currently the iommu driver is notified only
for changes of user VA mappings, so the IOMMU's internal caches may
retain stale entries for kernel VA. When kernel page table mappings are
changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
entries, Use-After-Free (UAF) vulnerability condition arises.

If these freed page table pages are reallocated for a different purpose,
potentially by an attacker, the IOMMU could misinterpret the new data as
valid page table entries. This allows the IOMMU to walk into attacker-
controlled memory, leading to arbitrary physical memory DMA access or
privilege escalation.

To mitigate this, introduce a new iommu interface to flush IOMMU caches.
This interface should be invoked from architecture-specific code that
manages combined user and kernel page tables, whenever a kernel page table
update is done and the CPU TLB needs to be flushed.

Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
Cc: stable@vger.kernel.org
Suggested-by: Jann Horn <jannh@google.com>
Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
---
 arch/x86/mm/tlb.c         |  4 +++
 drivers/iommu/iommu-sva.c | 60 ++++++++++++++++++++++++++++++++++++++-
 include/linux/iommu.h     |  4 +++
 3 files changed, 67 insertions(+), 1 deletion(-)

Change log:
v3:
 - iommu_sva_mms is an unbound list; iterating it in an atomic context
   could introduce significant latency issues. Schedule it in a kernel
   thread and replace the spinlock with a mutex.
 - Replace the static key with a normal bool; it can be brought back if
   data shows the benefit.
 - Invalidate KVA range in the flush_tlb_all() paths.
 - All previous reviewed-bys are preserved. Please let me know if there
   are any objections.

v2:
 - https://lore.kernel.org/linux-iommu/20250709062800.651521-1-baolu.lu@linux.intel.com/
 - Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range);
 - Replace the mutex with a spinlock to make the interface usable in the
   critical regions.

v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux.intel.com/

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 39f80111e6f1..3b85e7d3ba44 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
 #include <linux/task_work.h>
 #include <linux/mmu_notifier.h>
 #include <linux/mmu_context.h>
+#include <linux/iommu.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -1478,6 +1479,8 @@ void flush_tlb_all(void)
 	else
 		/* Fall back to the IPI-based invalidation. */
 		on_each_cpu(do_flush_tlb_all, NULL, 1);
+
+	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
 }
 
 /* Flush an arbitrarily large range of memory with INVLPGB. */
@@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 		kernel_tlb_flush_range(info);
 
 	put_flush_tlb_info();
+	iommu_sva_invalidate_kva_range(start, end);
 }
 
 /*
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 1a51cfd82808..d0da2b3fd64b 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -10,6 +10,8 @@
 #include "iommu-priv.h"
 
 static DEFINE_MUTEX(iommu_sva_lock);
+static bool iommu_sva_present;
+static LIST_HEAD(iommu_sva_mms);
 static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
 						   struct mm_struct *mm);
 
@@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de
 		return ERR_PTR(-ENOSPC);
 	}
 	iommu_mm->pasid = pasid;
+	iommu_mm->mm = mm;
 	INIT_LIST_HEAD(&iommu_mm->sva_domains);
 	/*
 	 * Make sure the write to mm->iommu_mm is not reordered in front of
@@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
 	if (ret)
 		goto out_free_domain;
 	domain->users = 1;
-	list_add(&domain->next, &mm->iommu_mm->sva_domains);
 
+	if (list_empty(&iommu_mm->sva_domains)) {
+		if (list_empty(&iommu_sva_mms))
+			WRITE_ONCE(iommu_sva_present, true);
+		list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
+	}
+	list_add(&domain->next, &iommu_mm->sva_domains);
 out:
 	refcount_set(&handle->users, 1);
 	mutex_unlock(&iommu_sva_lock);
@@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
 		list_del(&domain->next);
 		iommu_domain_free(domain);
 	}
+
+	if (list_empty(&iommu_mm->sva_domains)) {
+		list_del(&iommu_mm->mm_list_elm);
+		if (list_empty(&iommu_sva_mms))
+			WRITE_ONCE(iommu_sva_present, false);
+	}
+
 	mutex_unlock(&iommu_sva_lock);
 	kfree(handle);
 }
@@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
 
 	return domain;
 }
+
+struct kva_invalidation_work_data {
+	struct work_struct work;
+	unsigned long start;
+	unsigned long end;
+};
+
+static void invalidate_kva_func(struct work_struct *work)
+{
+	struct kva_invalidation_work_data *data =
+		container_of(work, struct kva_invalidation_work_data, work);
+	struct iommu_mm_data *iommu_mm;
+
+	guard(mutex)(&iommu_sva_lock);
+	list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
+		mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm,
+				data->start, data->end);
+
+	kfree(data);
+}
+
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
+{
+	struct kva_invalidation_work_data *data;
+
+	if (likely(!READ_ONCE(iommu_sva_present)))
+		return;
+
+	/* will be freed in the task function */
+	data = kzalloc(sizeof(*data), GFP_ATOMIC);
+	if (!data)
+		return;
+
+	data->start = start;
+	data->end = end;
+	INIT_WORK(&data->work, invalidate_kva_func);
+
+	/*
+	 * Since iommu_sva_mms is an unbound list, iterating it in an atomic
+	 * context could introduce significant latency issues.
+	 */
+	schedule_work(&data->work);
+}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c30d12e16473..66e4abb2df0d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1134,7 +1134,9 @@ struct iommu_sva {
 
 struct iommu_mm_data {
 	u32			pasid;
+	struct mm_struct	*mm;
 	struct list_head	sva_domains;
+	struct list_head	mm_list_elm;
 };
 
 int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode);
@@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
 					struct mm_struct *mm);
 void iommu_sva_unbind_device(struct iommu_sva *handle);
 u32 iommu_sva_get_pasid(struct iommu_sva *handle);
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end);
 #else
 static inline struct iommu_sva *
 iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
@@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm)
 }
 
 static inline void mm_pasid_drop(struct mm_struct *mm) {}
+static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {}
 #endif /* CONFIG_IOMMU_SVA */
 
 #ifdef CONFIG_IOMMU_IOPF
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-06  5:25 [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush Lu Baolu
@ 2025-08-06 15:03 ` Dave Hansen
  2025-08-06 15:52   ` Jason Gunthorpe
  2025-08-07  6:53   ` Baolu Lu
  2025-08-14  4:48 ` Ethan Zhao
  1 sibling, 2 replies; 51+ messages in thread
From: Dave Hansen @ 2025-08-06 15:03 UTC (permalink / raw)
  To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai
  Cc: iommu, security, linux-kernel, stable

On 8/5/25 22:25, Lu Baolu wrote:
> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
> shares and walks the CPU's page tables. The Linux x86 architecture maps
> the kernel address space into the upper portion of every process’s page
> table. Consequently, in an SVA context, the IOMMU hardware can walk and
> cache kernel space mappings. However, the Linux kernel currently lacks
> a notification mechanism for kernel space mapping changes. This means
> the IOMMU driver is not aware of such changes, leading to a break in
> IOMMU cache coherence.

FWIW, I wouldn't use the term "cache coherence" in this context. I'd
probably just call them "stale IOTLB entries".

I also think this over states the problem. There is currently no problem
with "kernel space mapping changes". The issue is solely when kernel
page table pages are freed and reused.

> Modern IOMMUs often cache page table entries of the intermediate-level
> page table as long as the entry is valid, no matter the permissions, to
> optimize walk performance. Currently the iommu driver is notified only
> for changes of user VA mappings, so the IOMMU's internal caches may
> retain stale entries for kernel VA. When kernel page table mappings are
> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> entries, Use-After-Free (UAF) vulnerability condition arises.
> 
> If these freed page table pages are reallocated for a different purpose,
> potentially by an attacker, the IOMMU could misinterpret the new data as
> valid page table entries. This allows the IOMMU to walk into attacker-
> controlled memory, leading to arbitrary physical memory DMA access or
> privilege escalation.

Note that it's not just use-after-free. It's literally that the IOMMU
will keep writing Accessed and Dirty bits while it thinks the page is
still a page table. The IOMMU will sit there happily setting bits. So,
it's _write_ after free too.

> To mitigate this, introduce a new iommu interface to flush IOMMU caches.
> This interface should be invoked from architecture-specific code that
> manages combined user and kernel page tables, whenever a kernel page table
> update is done and the CPU TLB needs to be flushed.

There's one tidbit missing from this:

	Currently SVA contexts are all unprivileged. They can only
	access user mappings and never kernel mappings. However, IOMMU
	still walk kernel-only page tables all the way down to the leaf
	where they realize that the entry is a kernel mapping and error
	out.


> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 39f80111e6f1..3b85e7d3ba44 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -12,6 +12,7 @@
>  #include <linux/task_work.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/mmu_context.h>
> +#include <linux/iommu.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/mmu_context.h>
> @@ -1478,6 +1479,8 @@ void flush_tlb_all(void)
>  	else
>  		/* Fall back to the IPI-based invalidation. */
>  		on_each_cpu(do_flush_tlb_all, NULL, 1);
> +
> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
>  }
>  
>  /* Flush an arbitrarily large range of memory with INVLPGB. */
> @@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>  		kernel_tlb_flush_range(info);
>  
>  	put_flush_tlb_info();
> +	iommu_sva_invalidate_kva_range(start, end);
>  }

These desperately need to be commented. They especially need comments
that they are solely for flushing the IOMMU mid-level paging structures
after freeing a page table page.

>  /*
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> index 1a51cfd82808..d0da2b3fd64b 100644
> --- a/drivers/iommu/iommu-sva.c
> +++ b/drivers/iommu/iommu-sva.c
> @@ -10,6 +10,8 @@
>  #include "iommu-priv.h"
>  
>  static DEFINE_MUTEX(iommu_sva_lock);
> +static bool iommu_sva_present;
> +static LIST_HEAD(iommu_sva_mms);
>  static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>  						   struct mm_struct *mm);
>  
> @@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de
>  		return ERR_PTR(-ENOSPC);
>  	}
>  	iommu_mm->pasid = pasid;
> +	iommu_mm->mm = mm;
>  	INIT_LIST_HEAD(&iommu_mm->sva_domains);
>  	/*
>  	 * Make sure the write to mm->iommu_mm is not reordered in front of
> @@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
>  	if (ret)
>  		goto out_free_domain;
>  	domain->users = 1;
> -	list_add(&domain->next, &mm->iommu_mm->sva_domains);
>  
> +	if (list_empty(&iommu_mm->sva_domains)) {
> +		if (list_empty(&iommu_sva_mms))
> +			WRITE_ONCE(iommu_sva_present, true);
> +		list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
> +	}
> +	list_add(&domain->next, &iommu_mm->sva_domains);
>  out:
>  	refcount_set(&handle->users, 1);
>  	mutex_unlock(&iommu_sva_lock);
> @@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
>  		list_del(&domain->next);
>  		iommu_domain_free(domain);
>  	}
> +
> +	if (list_empty(&iommu_mm->sva_domains)) {
> +		list_del(&iommu_mm->mm_list_elm);
> +		if (list_empty(&iommu_sva_mms))
> +			WRITE_ONCE(iommu_sva_present, false);
> +	}
> +
>  	mutex_unlock(&iommu_sva_lock);
>  	kfree(handle);
>  }
> @@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>  
>  	return domain;
>  }
> +
> +struct kva_invalidation_work_data {
> +	struct work_struct work;
> +	unsigned long start;
> +	unsigned long end;
> +};
> +
> +static void invalidate_kva_func(struct work_struct *work)
> +{
> +	struct kva_invalidation_work_data *data =
> +		container_of(work, struct kva_invalidation_work_data, work);
> +	struct iommu_mm_data *iommu_mm;
> +
> +	guard(mutex)(&iommu_sva_lock);
> +	list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
> +		mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm,
> +				data->start, data->end);
> +
> +	kfree(data);
> +}
> +
> +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
> +{
> +	struct kva_invalidation_work_data *data;
> +
> +	if (likely(!READ_ONCE(iommu_sva_present)))
> +		return;

It would be nice to hear a few more words about why this is safe without
a lock.

> +	/* will be freed in the task function */
> +	data = kzalloc(sizeof(*data), GFP_ATOMIC);
> +	if (!data)
> +		return;
> +
> +	data->start = start;
> +	data->end = end;
> +	INIT_WORK(&data->work, invalidate_kva_func);
> +
> +	/*
> +	 * Since iommu_sva_mms is an unbound list, iterating it in an atomic
> +	 * context could introduce significant latency issues.
> +	 */
> +	schedule_work(&data->work);
> +}

Hold on a sec, though. the problematic caller of this looks something
like this (logically):

	pmd_free_pte_page()
	{
	        pte = pmd_page_vaddr(*pmd);
	        pmd_clear(pmd);
		flush_tlb_kernel_range(...); // does schedule_work()
		pte_free_kernel(pte);
	}

It _immediately_ frees the PTE page. The schedule_work() work will
probably happen sometime after the page is freed.

Isn't that still a use-after-free? It's for some arbitrary amount of
time and better than before but it's still a use-after-free.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-06 15:03 ` Dave Hansen
@ 2025-08-06 15:52   ` Jason Gunthorpe
  2025-08-06 16:04     ` Dave Hansen
  2025-08-07  6:53   ` Baolu Lu
  1 sibling, 1 reply; 51+ messages in thread
From: Jason Gunthorpe @ 2025-08-06 15:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable

On Wed, Aug 06, 2025 at 08:03:42AM -0700, Dave Hansen wrote:
> Hold on a sec, though. the problematic caller of this looks something
> like this (logically):
> 
> 	pmd_free_pte_page()
> 	{
> 	        pte = pmd_page_vaddr(*pmd);
> 	        pmd_clear(pmd);
> 		flush_tlb_kernel_range(...); // does schedule_work()
> 		pte_free_kernel(pte);
> 	}
> 
> It _immediately_ frees the PTE page. The schedule_work() work will
> probably happen sometime after the page is freed.
> 
> Isn't that still a use-after-free? It's for some arbitrary amount of
> time and better than before but it's still a use-after-free.

Yes it is.

You can't do this approach without also pushing the pages to freed on
a list and defering the free till the work. This is broadly what the
normal mm user flow is doing..

Jason


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-06 15:52   ` Jason Gunthorpe
@ 2025-08-06 16:04     ` Dave Hansen
  2025-08-06 16:09       ` Jason Gunthorpe
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Hansen @ 2025-08-06 16:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable

On 8/6/25 08:52, Jason Gunthorpe wrote:
>> Isn't that still a use-after-free? It's for some arbitrary amount of
>> time and better than before but it's still a use-after-free.
> Yes it is.
> 
> You can't do this approach without also pushing the pages to freed on
> a list and defering the free till the work. This is broadly what the
> normal mm user flow is doing..

FWIW, I think the simplest way to do this is to plop an unconditional
schedule_work() in pte_free_kernel(). The work function will invalidate
the IOTLBs and then free the page.

Keep the schedule_work() unconditional to keep it simple. The
schedule_work() is way cheaper than all the system-wide TLB invalidation
IPIs that have to get sent as well. No need to add complexity to
optimize out something that's in the noise already.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-06 16:04     ` Dave Hansen
@ 2025-08-06 16:09       ` Jason Gunthorpe
  2025-08-06 16:34         ` Dave Hansen
  0 siblings, 1 reply; 51+ messages in thread
From: Jason Gunthorpe @ 2025-08-06 16:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable

On Wed, Aug 06, 2025 at 09:04:29AM -0700, Dave Hansen wrote:
> On 8/6/25 08:52, Jason Gunthorpe wrote:
> >> Isn't that still a use-after-free? It's for some arbitrary amount of
> >> time and better than before but it's still a use-after-free.
> > Yes it is.
> > 
> > You can't do this approach without also pushing the pages to freed on
> > a list and defering the free till the work. This is broadly what the
> > normal mm user flow is doing..
> 
> FWIW, I think the simplest way to do this is to plop an unconditional
> schedule_work() in pte_free_kernel(). The work function will invalidate
> the IOTLBs and then free the page.
> 
> Keep the schedule_work() unconditional to keep it simple. The
> schedule_work() is way cheaper than all the system-wide TLB invalidation
> IPIs that have to get sent as well. No need to add complexity to
> optimize out something that's in the noise already.

That works also, but now you have to allocate memory or you are
dead.. Is it OK these days, and safe in this code which seems a little
bit linked to memory management?

The MM side avoided this by putting the list and the rcu_head in the
struct page.

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-06 16:09       ` Jason Gunthorpe
@ 2025-08-06 16:34         ` Dave Hansen
  2025-08-06 16:42           ` Jason Gunthorpe
  2025-08-07 14:40           ` Baolu Lu
  0 siblings, 2 replies; 51+ messages in thread
From: Dave Hansen @ 2025-08-06 16:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable

On 8/6/25 09:09, Jason Gunthorpe wrote:
>>>
>>> You can't do this approach without also pushing the pages to freed on
>>> a list and defering the free till the work. This is broadly what the
>>> normal mm user flow is doing..
>> FWIW, I think the simplest way to do this is to plop an unconditional
>> schedule_work() in pte_free_kernel(). The work function will invalidate
>> the IOTLBs and then free the page.
>>
>> Keep the schedule_work() unconditional to keep it simple. The
>> schedule_work() is way cheaper than all the system-wide TLB invalidation
>> IPIs that have to get sent as well. No need to add complexity to
>> optimize out something that's in the noise already.
> That works also, but now you have to allocate memory or you are
> dead.. Is it OK these days, and safe in this code which seems a little
> bit linked to memory management?
> 
> The MM side avoided this by putting the list and the rcu_head in the
> struct page.

I don't think you need to allocate memory. A little static structure
that uses the page->list and has a lock should do. Logically something
like this:

struct kernel_pgtable_work
{
	struct list_head list;
	spinlock_t lock;
	struct work_struct work;
} kernel_pte_work;

pte_free_kernel()
{
	struct page *page = ptdesc_magic();

	guard(spinlock)(&kernel_pte_work.lock);
	
	list_add(&page->list, &kernel_pte_work.list);
	schedule_work(&kernel_pte_work.work);
}

work_func()
{
	iommu_sva_invalidate_kva();

	guard(spinlock)(&kernel_pte_work.lock);

	list_for_each_safe() {
		page = container_of(...);
		free_whatever(page);
	}
}

The only wrinkle is that pte_free_kernel() itself still has a pte and
'ptdesc', not a 'struct page'. But there is ptdesc->pt_list, which
should be unused at this point, especially for non-pgd pages on x86.

So, either go over to the 'struct page' earlier (maybe by open-coding
pagetable_dtor_free()?), or just use the ptdesc.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-06 16:34         ` Dave Hansen
@ 2025-08-06 16:42           ` Jason Gunthorpe
  2025-08-07 14:40           ` Baolu Lu
  1 sibling, 0 replies; 51+ messages in thread
From: Jason Gunthorpe @ 2025-08-06 16:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable

On Wed, Aug 06, 2025 at 09:34:12AM -0700, Dave Hansen wrote:
> struct kernel_pgtable_work
> {
> 	struct list_head list;
> 	spinlock_t lock;
> 	struct work_struct work;
> } kernel_pte_work;
> 
> pte_free_kernel()
> {
> 	struct page *page = ptdesc_magic();
> 
> 	guard(spinlock)(&kernel_pte_work.lock);
> 	
> 	list_add(&page->list, &kernel_pte_work.list);
> 	schedule_work(&kernel_pte_work.work);
> }

Oh, OK, yeah this can work

> The only wrinkle is that pte_free_kernel() itself still has a pte and
> 'ptdesc', not a 'struct page'. But there is ptdesc->pt_list, which
> should be unused at this point, especially for non-pgd pages on x86.

It should all be ptdesc, so lets avoid a struct page here and use the
pt_list instead..

Jason


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-06 16:34         ` Dave Hansen
  2025-08-06 16:42           ` Jason Gunthorpe
@ 2025-08-07 14:40           ` Baolu Lu
  2025-08-07 15:31             ` Dave Hansen
  2025-08-07 19:51             ` Jason Gunthorpe
  1 sibling, 2 replies; 51+ messages in thread
From: Baolu Lu @ 2025-08-07 14:40 UTC (permalink / raw)
  To: Dave Hansen, Jason Gunthorpe
  Cc: baolu.lu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable

On 8/7/2025 12:34 AM, Dave Hansen wrote:
> On 8/6/25 09:09, Jason Gunthorpe wrote:
>>>>
>>>> You can't do this approach without also pushing the pages to freed on
>>>> a list and defering the free till the work. This is broadly what the
>>>> normal mm user flow is doing..
>>> FWIW, I think the simplest way to do this is to plop an unconditional
>>> schedule_work() in pte_free_kernel(). The work function will invalidate
>>> the IOTLBs and then free the page.
>>>
>>> Keep the schedule_work() unconditional to keep it simple. The
>>> schedule_work() is way cheaper than all the system-wide TLB invalidation
>>> IPIs that have to get sent as well. No need to add complexity to
>>> optimize out something that's in the noise already.
>> That works also, but now you have to allocate memory or you are
>> dead.. Is it OK these days, and safe in this code which seems a little
>> bit linked to memory management?
>>
>> The MM side avoided this by putting the list and the rcu_head in the
>> struct page.
> 
> I don't think you need to allocate memory. A little static structure
> that uses the page->list and has a lock should do. Logically something
> like this:
> 
> struct kernel_pgtable_work
> {
> 	struct list_head list;
> 	spinlock_t lock;
> 	struct work_struct work;
> } kernel_pte_work;
> 
> pte_free_kernel()
> {
> 	struct page *page = ptdesc_magic();
> 
> 	guard(spinlock)(&kernel_pte_work.lock);
> 	
> 	list_add(&page->list, &kernel_pte_work.list);
> 	schedule_work(&kernel_pte_work.work);
> }
> 
> work_func()
> {
> 	iommu_sva_invalidate_kva();
> 
> 	guard(spinlock)(&kernel_pte_work.lock);
> 
> 	list_for_each_safe() {
> 		page = container_of(...);
> 		free_whatever(page);
> 	}
> }
> 
> The only wrinkle is that pte_free_kernel() itself still has a pte and
> 'ptdesc', not a 'struct page'. But there is ptdesc->pt_list, which
> should be unused at this point, especially for non-pgd pages on x86.
> 
> So, either go over to the 'struct page' earlier (maybe by open-coding
> pagetable_dtor_free()?), or just use the ptdesc.

I refactored the code above as follows. It compiles but hasn't been
tested yet. Does it look good to you?

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index c88691b15f3c..d9307dd09f67 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -10,9 +10,11 @@

  #define __HAVE_ARCH_PTE_ALLOC_ONE
  #define __HAVE_ARCH_PGD_FREE
+#define __HAVE_ARCH_PTE_FREE_KERNEL
  #include <asm-generic/pgalloc.h>

  static inline int  __paravirt_pgd_alloc(struct mm_struct *mm) { return 
0; }
+void pte_free_kernel(struct mm_struct *mm, pte_t *pte);

  #ifdef CONFIG_PARAVIRT_XXL
  #include <asm/paravirt.h>
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ddf248c3ee7d..f9f6738dd3cc 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -2,6 +2,7 @@
  #include <linux/mm.h>
  #include <linux/gfp.h>
  #include <linux/hugetlb.h>
+#include <linux/iommu.h>
  #include <asm/pgalloc.h>
  #include <asm/tlb.h>
  #include <asm/fixmap.h>
@@ -844,3 +845,42 @@ void arch_check_zapped_pud(struct vm_area_struct 
*vma, pud_t pud)
  	/* See note in arch_check_zapped_pte() */
  	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && pud_shstk(pud));
  }
+
+static void kernel_pte_work_func(struct work_struct *work);
+
+static struct {
+	struct list_head list;
+	spinlock_t lock;
+	struct work_struct work;
+} kernel_pte_work = {
+	.list = LIST_HEAD_INIT(kernel_pte_work.list),
+	.lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock),
+	.work = __WORK_INITIALIZER(kernel_pte_work.work, kernel_pte_work_func),
+};
+
+static void kernel_pte_work_func(struct work_struct *work)
+{
+	struct page *page, *next;
+
+	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
+
+	guard(spinlock)(&kernel_pte_work.lock);
+	list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
+		list_del_init(&page->lru);
+		pagetable_dtor_free(page_ptdesc(page));
+	}
+}
+
+/**
+ * pte_free_kernel - free PTE-level kernel page table memory
+ * @mm: the mm_struct of the current context
+ * @pte: pointer to the memory containing the page table
+ */
+void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+{
+	struct page *page = virt_to_page(pte);
+
+	guard(spinlock)(&kernel_pte_work.lock);
+	list_add(&page->lru, &kernel_pte_work.list);
+	schedule_work(&kernel_pte_work.work);
+}
diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 3c8ec3bfea44..716ebab67636 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -46,6 +46,7 @@ static inline pte_t 
*pte_alloc_one_kernel_noprof(struct mm_struct *mm)
  #define pte_alloc_one_kernel(...) 
alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__))
  #endif

+#ifndef __HAVE_ARCH_PTE_FREE_KERNEL
  /**
   * pte_free_kernel - free PTE-level kernel page table memory
   * @mm: the mm_struct of the current context
@@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct 
*mm, pte_t *pte)
  {
  	pagetable_dtor_free(virt_to_ptdesc(pte));
  }
+#endif

  /**
   * __pte_alloc_one - allocate memory for a PTE-level user page table
-- 
2.43.0

Thanks,
baolu

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-07 14:40           ` Baolu Lu
@ 2025-08-07 15:31             ` Dave Hansen
  2025-08-08  5:15               ` Baolu Lu
  2025-08-07 19:51             ` Jason Gunthorpe
  1 sibling, 1 reply; 51+ messages in thread
From: Dave Hansen @ 2025-08-07 15:31 UTC (permalink / raw)
  To: Baolu Lu, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
	Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security,
	linux-kernel, stable

On 8/7/25 07:40, Baolu Lu wrote:
...
> I refactored the code above as follows. It compiles but hasn't been
> tested yet. Does it look good to you?

As in, it takes the non-compiling gunk I spewed into my email client and
makes it compile, yeah. Sure. ;)

> diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/
> pgalloc.h
> index c88691b15f3c..d9307dd09f67 100644
> --- a/arch/x86/include/asm/pgalloc.h
> +++ b/arch/x86/include/asm/pgalloc.h
> @@ -10,9 +10,11 @@
> 
>  #define __HAVE_ARCH_PTE_ALLOC_ONE
>  #define __HAVE_ARCH_PGD_FREE
> +#define __HAVE_ARCH_PTE_FREE_KERNEL

But I think it really muddies the waters down here.

This kinda reads like "x86 has its own per-arch pte_free_kernel() that
it always needs". Which is far from accurate.

> @@ -844,3 +845,42 @@ void arch_check_zapped_pud(struct vm_area_struct
> *vma, pud_t pud)
>      /* See note in arch_check_zapped_pte() */
>      VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && pud_shstk(pud));
>  }
> +
> +static void kernel_pte_work_func(struct work_struct *work);
> +
> +static struct {
> +    struct list_head list;
> +    spinlock_t lock;
> +    struct work_struct work;
> +} kernel_pte_work = {
> +    .list = LIST_HEAD_INIT(kernel_pte_work.list),
> +    .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock),
> +    .work = __WORK_INITIALIZER(kernel_pte_work.work,
> kernel_pte_work_func),
> +};
> +
> +static void kernel_pte_work_func(struct work_struct *work)
> +{
> +    struct page *page, *next;
> +
> +    iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> +
> +    guard(spinlock)(&kernel_pte_work.lock);
> +    list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
> +        list_del_init(&page->lru);
> +        pagetable_dtor_free(page_ptdesc(page));
> +    }
> +}
> +
> +/**
> + * pte_free_kernel - free PTE-level kernel page table memory
> + * @mm: the mm_struct of the current context
> + * @pte: pointer to the memory containing the page table
> + */

The kerneldoc here is just wasted bytes, IMNHO. Why not use those bytes
to actually explain what the heck is going on here?

> +void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
> +{
> +    struct page *page = virt_to_page(pte);
> +
> +    guard(spinlock)(&kernel_pte_work.lock);
> +    list_add(&page->lru, &kernel_pte_work.list);
> +    schedule_work(&kernel_pte_work.work);
> +}
> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
> index 3c8ec3bfea44..716ebab67636 100644
> --- a/include/asm-generic/pgalloc.h
> +++ b/include/asm-generic/pgalloc.h
> @@ -46,6 +46,7 @@ static inline pte_t
> *pte_alloc_one_kernel_noprof(struct mm_struct *mm)
>  #define pte_alloc_one_kernel(...)
> alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__))
>  #endif
> 
> +#ifndef __HAVE_ARCH_PTE_FREE_KERNEL
>  /**
>   * pte_free_kernel - free PTE-level kernel page table memory
>   * @mm: the mm_struct of the current context
> @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct
> *mm, pte_t *pte)
>  {
>      pagetable_dtor_free(virt_to_ptdesc(pte));
>  }
> +#endif
> 
>  /**
>   * __pte_alloc_one - allocate memory for a PTE-level user page table

I'd much rather the arch-generic code looked like this:

#ifdef CONFIG_ASYNC_PGTABLE_FREE
// code and struct here, or dump them over in some
// other file and do this in a header
#else
static void pte_free_kernel_async(struct page *page) {}
#endif

void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
{
    struct page *page = virt_to_page(pte);

    if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) {
	pte_free_kernel_async(page);
    else
	pagetable_dtor_free(page_ptdesc(page));
}

Then in Kconfig, you end up with something like:

config ASYNC_PGTABLE_FREE
	def_bool y
	depends on INTEL_IOMMU_WHATEVER

That very much tells much more of the whole story in code. It also gives
the x86 folks that compile out the IOMMU the exact same code as the
arch-generic folks. It _also_ makes it dirt simple and obvious for the
x86 folks to optimize out the async behavior if they don't like it in
the future by replacing the compile-time IOMMU check with a runtime one.

Also, if another crazy IOMMU implementation comes along that happens to
do what the x86 IOMMUs do, then they have a single Kconfig switch to
flip. If they follow what this patch tries to do, they'll start by
copying and pasting the x86 implementation.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-07 15:31             ` Dave Hansen
@ 2025-08-08  5:15               ` Baolu Lu
  2025-08-10  7:19                 ` Ethan Zhao
  2025-08-11 12:57                 ` Jason Gunthorpe
  0 siblings, 2 replies; 51+ messages in thread
From: Baolu Lu @ 2025-08-08  5:15 UTC (permalink / raw)
  To: Dave Hansen, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
	Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security,
	linux-kernel, stable

On 8/7/25 23:31, Dave Hansen wrote:
>> +void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
>> +{
>> +    struct page *page = virt_to_page(pte);
>> +
>> +    guard(spinlock)(&kernel_pte_work.lock);
>> +    list_add(&page->lru, &kernel_pte_work.list);
>> +    schedule_work(&kernel_pte_work.work);
>> +}
>> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
>> index 3c8ec3bfea44..716ebab67636 100644
>> --- a/include/asm-generic/pgalloc.h
>> +++ b/include/asm-generic/pgalloc.h
>> @@ -46,6 +46,7 @@ static inline pte_t
>> *pte_alloc_one_kernel_noprof(struct mm_struct *mm)
>>   #define pte_alloc_one_kernel(...)
>> alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__))
>>   #endif
>>
>> +#ifndef __HAVE_ARCH_PTE_FREE_KERNEL
>>   /**
>>    * pte_free_kernel - free PTE-level kernel page table memory
>>    * @mm: the mm_struct of the current context
>> @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct
>> *mm, pte_t *pte)
>>   {
>>       pagetable_dtor_free(virt_to_ptdesc(pte));
>>   }
>> +#endif
>>
>>   /**
>>    * __pte_alloc_one - allocate memory for a PTE-level user page table
> I'd much rather the arch-generic code looked like this:
> 
> #ifdef CONFIG_ASYNC_PGTABLE_FREE
> // code and struct here, or dump them over in some
> // other file and do this in a header
> #else
> static void pte_free_kernel_async(struct page *page) {}
> #endif
> 
> void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
> {
>      struct page *page = virt_to_page(pte);
> 
>      if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) {
> 	pte_free_kernel_async(page);
>      else
> 	pagetable_dtor_free(page_ptdesc(page));
> }
> 
> Then in Kconfig, you end up with something like:
> 
> config ASYNC_PGTABLE_FREE
> 	def_bool y
> 	depends on INTEL_IOMMU_WHATEVER
> 
> That very much tells much more of the whole story in code. It also gives
> the x86 folks that compile out the IOMMU the exact same code as the
> arch-generic folks. It_also_ makes it dirt simple and obvious for the
> x86 folks to optimize out the async behavior if they don't like it in
> the future by replacing the compile-time IOMMU check with a runtime one.
> 
> Also, if another crazy IOMMU implementation comes along that happens to
> do what the x86 IOMMUs do, then they have a single Kconfig switch to
> flip. If they follow what this patch tries to do, they'll start by
> copying and pasting the x86 implementation.

I'll do it like this.  Does that look good to you?

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 70d29b14d851..6f1113e024fa 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -160,6 +160,7 @@ config IOMMU_DMA
  # Shared Virtual Addressing
  config IOMMU_SVA
  	select IOMMU_MM_DATA
+	select ASYNC_PGTABLE_FREE if X86
  	bool

  config IOMMU_IOPF
diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 3c8ec3bfea44..dbddacdca2ce 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -46,6 +46,19 @@ static inline pte_t 
*pte_alloc_one_kernel_noprof(struct mm_struct *mm)
  #define pte_alloc_one_kernel(...) 
alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__))
  #endif

+#ifdef CONFIG_ASYNC_PGTABLE_FREE
+struct pgtable_free_work {
+	struct list_head list;
+	spinlock_t lock;
+	struct work_struct work;
+};
+extern struct pgtable_free_work kernel_pte_work;
+
+void pte_free_kernel_async(struct ptdesc *ptdesc);
+#else
+static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {}
+#endif
+
  /**
   * pte_free_kernel - free PTE-level kernel page table memory
   * @mm: the mm_struct of the current context
@@ -53,7 +66,12 @@ static inline pte_t 
*pte_alloc_one_kernel_noprof(struct mm_struct *mm)
   */
  static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
  {
-	pagetable_dtor_free(virt_to_ptdesc(pte));
+	struct ptdesc *ptdesc = virt_to_ptdesc(pte);
+
+	if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
+		pte_free_kernel_async(ptdesc);
+	else
+		pagetable_dtor_free(ptdesc);
  }

  /**
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf..528550cfa7fe 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1346,6 +1346,13 @@ config LOCK_MM_AND_FIND_VMA
  config IOMMU_MM_DATA
  	bool

+config ASYNC_PGTABLE_FREE
+	bool "Asynchronous kernel page table freeing"
+	help
+	  Perform kernel page table freeing asynchronously. This is required
+	  for systems with IOMMU Shared Virtual Address (SVA) to flush IOTLB
+	  paging structure caches.
+
  config EXECMEM
  	bool

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..6639ee6641d4 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
  #include <linux/swap.h>
  #include <linux/swapops.h>
  #include <linux/mm_inline.h>
+#include <linux/iommu.h>
  #include <asm/pgalloc.h>
  #include <asm/tlb.h>

@@ -406,3 +407,32 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, 
pmd_t *pmd,
  	pte_unmap_unlock(pte, ptl);
  	goto again;
  }
+
+#ifdef CONFIG_ASYNC_PGTABLE_FREE
+static void kernel_pte_work_func(struct work_struct *work);
+struct pgtable_free_work kernel_pte_work = {
+	.list = LIST_HEAD_INIT(kernel_pte_work.list),
+	.lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock),
+	.work = __WORK_INITIALIZER(kernel_pte_work.work, kernel_pte_work_func),
+};
+
+static void kernel_pte_work_func(struct work_struct *work)
+{
+	struct ptdesc *ptdesc, *next;
+
+	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
+
+	guard(spinlock)(&kernel_pte_work.lock);
+	list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) {
+		list_del_init(&ptdesc->pt_list);
+		pagetable_dtor_free(ptdesc);
+	}
+}
+
+void pte_free_kernel_async(struct ptdesc *ptdesc)
+{
+	guard(spinlock)(&kernel_pte_work.lock);
+	list_add(&ptdesc->pt_list, &kernel_pte_work.list);
+	schedule_work(&kernel_pte_work.work);
+}
+#endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-08  5:15               ` Baolu Lu
@ 2025-08-10  7:19                 ` Ethan Zhao
  2025-08-11  9:15                   ` Uladzislau Rezki
  2025-08-11 12:57                 ` Jason Gunthorpe
  1 sibling, 1 reply; 51+ messages in thread
From: Ethan Zhao @ 2025-08-10  7:19 UTC (permalink / raw)
  To: Baolu Lu, Dave Hansen, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
	Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security,
	linux-kernel, stable



On 8/8/2025 1:15 PM, Baolu Lu wrote:
> On 8/7/25 23:31, Dave Hansen wrote:
>>> +void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
>>> +{
>>> +    struct page *page = virt_to_page(pte);
>>> +
>>> +    guard(spinlock)(&kernel_pte_work.lock);
>>> +    list_add(&page->lru, &kernel_pte_work.list);
>>> +    schedule_work(&kernel_pte_work.work);
>>> +}
>>> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/ 
>>> pgalloc.h
>>> index 3c8ec3bfea44..716ebab67636 100644
>>> --- a/include/asm-generic/pgalloc.h
>>> +++ b/include/asm-generic/pgalloc.h
>>> @@ -46,6 +46,7 @@ static inline pte_t
>>> *pte_alloc_one_kernel_noprof(struct mm_struct *mm)
>>>   #define pte_alloc_one_kernel(...)
>>> alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__))
>>>   #endif
>>>
>>> +#ifndef __HAVE_ARCH_PTE_FREE_KERNEL
>>>   /**
>>>    * pte_free_kernel - free PTE-level kernel page table memory
>>>    * @mm: the mm_struct of the current context
>>> @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct
>>> *mm, pte_t *pte)
>>>   {
>>>       pagetable_dtor_free(virt_to_ptdesc(pte));
>>>   }
>>> +#endif
>>>
>>>   /**
>>>    * __pte_alloc_one - allocate memory for a PTE-level user page table
>> I'd much rather the arch-generic code looked like this:
>>
>> #ifdef CONFIG_ASYNC_PGTABLE_FREE
>> // code and struct here, or dump them over in some
>> // other file and do this in a header
>> #else
>> static void pte_free_kernel_async(struct page *page) {}
>> #endif
>>
>> void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
>> {
>>      struct page *page = virt_to_page(pte);
>>
>>      if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) {
>>     pte_free_kernel_async(page);
>>      else
>>     pagetable_dtor_free(page_ptdesc(page));
>> }
>>
>> Then in Kconfig, you end up with something like:
>>
>> config ASYNC_PGTABLE_FREE
>>     def_bool y
>>     depends on INTEL_IOMMU_WHATEVER
>>
>> That very much tells much more of the whole story in code. It also gives
>> the x86 folks that compile out the IOMMU the exact same code as the
>> arch-generic folks. It_also_ makes it dirt simple and obvious for the
>> x86 folks to optimize out the async behavior if they don't like it in
>> the future by replacing the compile-time IOMMU check with a runtime one.
>>
>> Also, if another crazy IOMMU implementation comes along that happens to
>> do what the x86 IOMMUs do, then they have a single Kconfig switch to
>> flip. If they follow what this patch tries to do, they'll start by
>> copying and pasting the x86 implementation.
> 
> I'll do it like this.  Does that look good to you?
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 70d29b14d851..6f1113e024fa 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -160,6 +160,7 @@ config IOMMU_DMA
>   # Shared Virtual Addressing
>   config IOMMU_SVA
>       select IOMMU_MM_DATA
> +    select ASYNC_PGTABLE_FREE if X86
>       bool
> 
>   config IOMMU_IOPF
> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
> index 3c8ec3bfea44..dbddacdca2ce 100644
> --- a/include/asm-generic/pgalloc.h
> +++ b/include/asm-generic/pgalloc.h
> @@ -46,6 +46,19 @@ static inline pte_t 
> *pte_alloc_one_kernel_noprof(struct mm_struct *mm)
>   #define pte_alloc_one_kernel(...) 
> alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__))
>   #endif
> 
> +#ifdef CONFIG_ASYNC_PGTABLE_FREE
> +struct pgtable_free_work {
> +    struct list_head list;
> +    spinlock_t lock;
> +    struct work_struct work;
> +};
> +extern struct pgtable_free_work kernel_pte_work;
> +
> +void pte_free_kernel_async(struct ptdesc *ptdesc);
> +#else
> +static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {}
> +#endif
> +
>   /**
>    * pte_free_kernel - free PTE-level kernel page table memory
>    * @mm: the mm_struct of the current context
> @@ -53,7 +66,12 @@ static inline pte_t 
> *pte_alloc_one_kernel_noprof(struct mm_struct *mm)
>    */
>   static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
>   {
> -    pagetable_dtor_free(virt_to_ptdesc(pte));
> +    struct ptdesc *ptdesc = virt_to_ptdesc(pte);
> +
> +    if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
> +        pte_free_kernel_async(ptdesc);
> +    else
> +        pagetable_dtor_free(ptdesc);
>   }
> 
>   /**
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e443fe8cd6cf..528550cfa7fe 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1346,6 +1346,13 @@ config LOCK_MM_AND_FIND_VMA
>   config IOMMU_MM_DATA
>       bool
> 
> +config ASYNC_PGTABLE_FREE
> +    bool "Asynchronous kernel page table freeing"
> +    help
> +      Perform kernel page table freeing asynchronously. This is required
> +      for systems with IOMMU Shared Virtual Address (SVA) to flush IOTLB
> +      paging structure caches.
> +
>   config EXECMEM
>       bool
> 
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 567e2d084071..6639ee6641d4 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -13,6 +13,7 @@
>   #include <linux/swap.h>
>   #include <linux/swapops.h>
>   #include <linux/mm_inline.h>
> +#include <linux/iommu.h>
>   #include <asm/pgalloc.h>
>   #include <asm/tlb.h>
> 
> @@ -406,3 +407,32 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, 
> pmd_t *pmd,
>       pte_unmap_unlock(pte, ptl);
>       goto again;
>   }
> +
> +#ifdef CONFIG_ASYNC_PGTABLE_FREE
> +static void kernel_pte_work_func(struct work_struct *work);
> +struct pgtable_free_work kernel_pte_work = {
> +    .list = LIST_HEAD_INIT(kernel_pte_work.list),
> +    .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock),
> +    .work = __WORK_INITIALIZER(kernel_pte_work.work, 
> kernel_pte_work_func),
> +};
> +
> +static void kernel_pte_work_func(struct work_struct *work)
> +{
> +    struct ptdesc *ptdesc, *next;
> +
> +    iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> +
> +    guard(spinlock)(&kernel_pte_work.lock);
> +    list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, 
> pt_list) {
> +        list_del_init(&ptdesc->pt_list);
> +        pagetable_dtor_free(ptdesc);
> +    }
> +}
> +
> +void pte_free_kernel_async(struct ptdesc *ptdesc)
> +{
> +    guard(spinlock)(&kernel_pte_work.lock);
> +    list_add(&ptdesc->pt_list, &kernel_pte_work.list);
> +    schedule_work(&kernel_pte_work.work);
> +}
kernel_pte_work.list is global shared var, it would make the producer
pte_free_kernel() and the consumer kernel_pte_work_func() to operate in 
serialized timing. In a large system, I don't think you design this 
deliberately :)

Thanks,
Ethan
  > +#endif


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-10  7:19                 ` Ethan Zhao
@ 2025-08-11  9:15                   ` Uladzislau Rezki
  2025-08-11 12:55                     ` Jason Gunthorpe
  2025-08-11 13:55                     ` Dave Hansen
  0 siblings, 2 replies; 51+ messages in thread
From: Uladzislau Rezki @ 2025-08-11  9:15 UTC (permalink / raw)
  To: Ethan Zhao, Baolu Lu
  Cc: Baolu Lu, Dave Hansen, Jason Gunthorpe, Joerg Roedel, Will Deacon,
	Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security,
	linux-kernel, stable

On Sun, Aug 10, 2025 at 03:19:58PM +0800, Ethan Zhao wrote:
> 
> 
> On 8/8/2025 1:15 PM, Baolu Lu wrote:
> > On 8/7/25 23:31, Dave Hansen wrote:
> > > > +void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
> > > > +{
> > > > +    struct page *page = virt_to_page(pte);
> > > > +
> > > > +    guard(spinlock)(&kernel_pte_work.lock);
> > > > +    list_add(&page->lru, &kernel_pte_work.list);
> > > > +    schedule_work(&kernel_pte_work.work);
> > > > +}
> > > > diff --git a/include/asm-generic/pgalloc.h
> > > > b/include/asm-generic/ pgalloc.h
> > > > index 3c8ec3bfea44..716ebab67636 100644
> > > > --- a/include/asm-generic/pgalloc.h
> > > > +++ b/include/asm-generic/pgalloc.h
> > > > @@ -46,6 +46,7 @@ static inline pte_t
> > > > *pte_alloc_one_kernel_noprof(struct mm_struct *mm)
> > > >   #define pte_alloc_one_kernel(...)
> > > > alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__))
> > > >   #endif
> > > > 
> > > > +#ifndef __HAVE_ARCH_PTE_FREE_KERNEL
> > > >   /**
> > > >    * pte_free_kernel - free PTE-level kernel page table memory
> > > >    * @mm: the mm_struct of the current context
> > > > @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct
> > > > *mm, pte_t *pte)
> > > >   {
> > > >       pagetable_dtor_free(virt_to_ptdesc(pte));
> > > >   }
> > > > +#endif
> > > > 
> > > >   /**
> > > >    * __pte_alloc_one - allocate memory for a PTE-level user page table
> > > I'd much rather the arch-generic code looked like this:
> > > 
> > > #ifdef CONFIG_ASYNC_PGTABLE_FREE
> > > // code and struct here, or dump them over in some
> > > // other file and do this in a header
> > > #else
> > > static void pte_free_kernel_async(struct page *page) {}
> > > #endif
> > > 
> > > void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
> > > {
> > >      struct page *page = virt_to_page(pte);
> > > 
> > >      if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) {
> > >     pte_free_kernel_async(page);
> > >      else
> > >     pagetable_dtor_free(page_ptdesc(page));
> > > }
> > > 
> > > Then in Kconfig, you end up with something like:
> > > 
> > > config ASYNC_PGTABLE_FREE
> > >     def_bool y
> > >     depends on INTEL_IOMMU_WHATEVER
> > > 
> > > That very much tells much more of the whole story in code. It also gives
> > > the x86 folks that compile out the IOMMU the exact same code as the
> > > arch-generic folks. It_also_ makes it dirt simple and obvious for the
> > > x86 folks to optimize out the async behavior if they don't like it in
> > > the future by replacing the compile-time IOMMU check with a runtime one.
> > > 
> > > Also, if another crazy IOMMU implementation comes along that happens to
> > > do what the x86 IOMMUs do, then they have a single Kconfig switch to
> > > flip. If they follow what this patch tries to do, they'll start by
> > > copying and pasting the x86 implementation.
> > 
> > I'll do it like this.  Does that look good to you?
> > 
> > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> > index 70d29b14d851..6f1113e024fa 100644
> > --- a/drivers/iommu/Kconfig
> > +++ b/drivers/iommu/Kconfig
> > @@ -160,6 +160,7 @@ config IOMMU_DMA
> >   # Shared Virtual Addressing
> >   config IOMMU_SVA
> >       select IOMMU_MM_DATA
> > +    select ASYNC_PGTABLE_FREE if X86
> >       bool
> > 
> >   config IOMMU_IOPF
> > diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
> > index 3c8ec3bfea44..dbddacdca2ce 100644
> > --- a/include/asm-generic/pgalloc.h
> > +++ b/include/asm-generic/pgalloc.h
> > @@ -46,6 +46,19 @@ static inline pte_t
> > *pte_alloc_one_kernel_noprof(struct mm_struct *mm)
> >   #define pte_alloc_one_kernel(...)
> > alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__))
> >   #endif
> > 
> > +#ifdef CONFIG_ASYNC_PGTABLE_FREE
> > +struct pgtable_free_work {
> > +    struct list_head list;
> > +    spinlock_t lock;
> > +    struct work_struct work;
> > +};
> > +extern struct pgtable_free_work kernel_pte_work;
> > +
> > +void pte_free_kernel_async(struct ptdesc *ptdesc);
> > +#else
> > +static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {}
> > +#endif
> > +
> >   /**
> >    * pte_free_kernel - free PTE-level kernel page table memory
> >    * @mm: the mm_struct of the current context
> > @@ -53,7 +66,12 @@ static inline pte_t
> > *pte_alloc_one_kernel_noprof(struct mm_struct *mm)
> >    */
> >   static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
> >   {
> > -    pagetable_dtor_free(virt_to_ptdesc(pte));
> > +    struct ptdesc *ptdesc = virt_to_ptdesc(pte);
> > +
> > +    if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
> > +        pte_free_kernel_async(ptdesc);
> > +    else
> > +        pagetable_dtor_free(ptdesc);
> >   }
> > 
> >   /**
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index e443fe8cd6cf..528550cfa7fe 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1346,6 +1346,13 @@ config LOCK_MM_AND_FIND_VMA
> >   config IOMMU_MM_DATA
> >       bool
> > 
> > +config ASYNC_PGTABLE_FREE
> > +    bool "Asynchronous kernel page table freeing"
> > +    help
> > +      Perform kernel page table freeing asynchronously. This is required
> > +      for systems with IOMMU Shared Virtual Address (SVA) to flush IOTLB
> > +      paging structure caches.
> > +
> >   config EXECMEM
> >       bool
> > 
> > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> > index 567e2d084071..6639ee6641d4 100644
> > --- a/mm/pgtable-generic.c
> > +++ b/mm/pgtable-generic.c
> > @@ -13,6 +13,7 @@
> >   #include <linux/swap.h>
> >   #include <linux/swapops.h>
> >   #include <linux/mm_inline.h>
> > +#include <linux/iommu.h>
> >   #include <asm/pgalloc.h>
> >   #include <asm/tlb.h>
> > 
> > @@ -406,3 +407,32 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm,
> > pmd_t *pmd,
> >       pte_unmap_unlock(pte, ptl);
> >       goto again;
> >   }
> > +
> > +#ifdef CONFIG_ASYNC_PGTABLE_FREE
> > +static void kernel_pte_work_func(struct work_struct *work);
> > +struct pgtable_free_work kernel_pte_work = {
> > +    .list = LIST_HEAD_INIT(kernel_pte_work.list),
> > +    .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock),
> > +    .work = __WORK_INITIALIZER(kernel_pte_work.work,
> > kernel_pte_work_func),
> > +};
> > +
> > +static void kernel_pte_work_func(struct work_struct *work)
> > +{
> > +    struct ptdesc *ptdesc, *next;
> > +
> > +    iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> > +
> > +    guard(spinlock)(&kernel_pte_work.lock);
> > +    list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list,
> > pt_list) {
> > +        list_del_init(&ptdesc->pt_list);
> > +        pagetable_dtor_free(ptdesc);
> > +    }
> > +}
> > +
> > +void pte_free_kernel_async(struct ptdesc *ptdesc)
> > +{
> > +    guard(spinlock)(&kernel_pte_work.lock);
> > +    list_add(&ptdesc->pt_list, &kernel_pte_work.list);
> > +    schedule_work(&kernel_pte_work.work);
> > +}
> kernel_pte_work.list is global shared var, it would make the producer
> pte_free_kernel() and the consumer kernel_pte_work_func() to operate in
> serialized timing. In a large system, I don't think you design this
> deliberately :)
>
Sorry for jumping.

Agree, unless it is never considered as a hot path or something that can
be really contented. It looks like you can use just a per-cpu llist to drain
thinks.

As for reference you can have a look at how vfree_atomic() handles deferred
freeing.

Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-11  9:15                   ` Uladzislau Rezki
@ 2025-08-11 12:55                     ` Jason Gunthorpe
  2025-08-15  9:23                       ` Baolu Lu
  2025-08-11 13:55                     ` Dave Hansen
  1 sibling, 1 reply; 51+ messages in thread
From: Jason Gunthorpe @ 2025-08-11 12:55 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Ethan Zhao, Baolu Lu, Dave Hansen, Joerg Roedel, Will Deacon,
	Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable

On Mon, Aug 11, 2025 at 11:15:58AM +0200, Uladzislau Rezki wrote:

> Agree, unless it is never considered as a hot path or something that can
> be really contented. It looks like you can use just a per-cpu llist to drain
> thinks.

I think we already largely agreed this was not a true fast path, I
don't think we need a per-cpu here.

But I would not make it extern unless there is a user?

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-11 12:55                     ` Jason Gunthorpe
@ 2025-08-15  9:23                       ` Baolu Lu
  0 siblings, 0 replies; 51+ messages in thread
From: Baolu Lu @ 2025-08-15  9:23 UTC (permalink / raw)
  To: Jason Gunthorpe, Uladzislau Rezki
  Cc: baolu.lu, Ethan Zhao, Dave Hansen, Joerg Roedel, Will Deacon,
	Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable

On 8/11/2025 8:55 PM, Jason Gunthorpe wrote:
> On Mon, Aug 11, 2025 at 11:15:58AM +0200, Uladzislau Rezki wrote:
> 
>> Agree, unless it is never considered as a hot path or something that can
>> be really contented. It looks like you can use just a per-cpu llist to drain
>> thinks.
> I think we already largely agreed this was not a true fast path, I
> don't think we need a per-cpu here.
> 
> But I would not make it extern unless there is a user?

Fixed. Thanks!

Thanks,
baolu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-11  9:15                   ` Uladzislau Rezki
  2025-08-11 12:55                     ` Jason Gunthorpe
@ 2025-08-11 13:55                     ` Dave Hansen
  2025-08-11 14:56                       ` Uladzislau Rezki
  2025-08-12  1:17                       ` Ethan Zhao
  1 sibling, 2 replies; 51+ messages in thread
From: Dave Hansen @ 2025-08-11 13:55 UTC (permalink / raw)
  To: Uladzislau Rezki, Ethan Zhao, Baolu Lu
  Cc: Jason Gunthorpe, Joerg Roedel, Will Deacon, Robin Murphy,
	Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple,
	Peter Zijlstra, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable

On 8/11/25 02:15, Uladzislau Rezki wrote:
>> kernel_pte_work.list is global shared var, it would make the producer
>> pte_free_kernel() and the consumer kernel_pte_work_func() to operate in
>> serialized timing. In a large system, I don't think you design this
>> deliberately 🙂
>>
> Sorry for jumping.
> 
> Agree, unless it is never considered as a hot path or something that can
> be really contented. It looks like you can use just a per-cpu llist to drain
> thinks.

Remember, the code that has to run just before all this sent an IPI to
every single CPU on the system to have them do a (on x86 at least)
pretty expensive TLB flush.

If this is a hot path, we have bigger problems on our hands: the full
TLB flush on every CPU.

So, sure, there are a million ways to make this deferred freeing more
scalable. But the code that's here is dirt simple and self contained. If
someone has some ideas for something that's simpler and more scalable,
then I'm totally open to it.

But this is _not_ the place to add complexity to get scalability.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-11 13:55                     ` Dave Hansen
@ 2025-08-11 14:56                       ` Uladzislau Rezki
  2025-08-12  1:17                       ` Ethan Zhao
  1 sibling, 0 replies; 51+ messages in thread
From: Uladzislau Rezki @ 2025-08-11 14:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Uladzislau Rezki, Ethan Zhao, Baolu Lu, Jason Gunthorpe,
	Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
	Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security,
	linux-kernel, stable

On Mon, Aug 11, 2025 at 06:55:52AM -0700, Dave Hansen wrote:
> On 8/11/25 02:15, Uladzislau Rezki wrote:
> >> kernel_pte_work.list is global shared var, it would make the producer
> >> pte_free_kernel() and the consumer kernel_pte_work_func() to operate in
> >> serialized timing. In a large system, I don't think you design this
> >> deliberately 🙂
> >>
> > Sorry for jumping.
> > 
> > Agree, unless it is never considered as a hot path or something that can
> > be really contented. It looks like you can use just a per-cpu llist to drain
> > thinks.
> 
> Remember, the code that has to run just before all this sent an IPI to
> every single CPU on the system to have them do a (on x86 at least)
> pretty expensive TLB flush.
> 
> If this is a hot path, we have bigger problems on our hands: the full
> TLB flush on every CPU.
> 
> So, sure, there are a million ways to make this deferred freeing more
> scalable. But the code that's here is dirt simple and self contained. If
> someone has some ideas for something that's simpler and more scalable,
> then I'm totally open to it.
> 
You could also have a look toward removing the &kernel_pte_work.lock.
Replace it by llist_add() on adding side and llist_for_each_safe(n, t, llist_del_all(&list))
on removing side. So you do not need guard(spinlock) stuff. 

If i do not miss anything.

>
> But this is _not_ the place to add complexity to get scalability.
>
OK.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-11 13:55                     ` Dave Hansen
  2025-08-11 14:56                       ` Uladzislau Rezki
@ 2025-08-12  1:17                       ` Ethan Zhao
  2025-08-15 14:35                         ` Dave Hansen
  1 sibling, 1 reply; 51+ messages in thread
From: Ethan Zhao @ 2025-08-12  1:17 UTC (permalink / raw)
  To: Dave Hansen, Uladzislau Rezki, Baolu Lu
  Cc: Jason Gunthorpe, Joerg Roedel, Will Deacon, Robin Murphy,
	Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple,
	Peter Zijlstra, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable



On 8/11/2025 9:55 PM, Dave Hansen wrote:
> On 8/11/25 02:15, Uladzislau Rezki wrote:
>>> kernel_pte_work.list is global shared var, it would make the producer
>>> pte_free_kernel() and the consumer kernel_pte_work_func() to operate in
>>> serialized timing. In a large system, I don't think you design this
>>> deliberately 🙂
>>>
>> Sorry for jumping.
>>
>> Agree, unless it is never considered as a hot path or something that can
>> be really contented. It looks like you can use just a per-cpu llist to drain
>> thinks.
> 
> Remember, the code that has to run just before all this sent an IPI to
> every single CPU on the system to have them do a (on x86 at least)
> pretty expensive TLB flush.
> 
It can be easily identified as a bottleneck by multi-CPU stress testing 
programs involving frequent process creation and destruction, similar to 
the operation of a heavily loaded multi-process Apache web server. 
Hot/cold path ?

> If this is a hot path, we have bigger problems on our hands: the full
> TLB flush on every CPU.
Perhaps not "WE", IPI driven TLB flush seems not the shared mechanism of
all CPUs, at least not for ARM as far as I know.

> 
> So, sure, there are a million ways to make this deferred freeing more
> scalable. But the code that's here is dirt simple and self contained. If
> someone has some ideas for something that's simpler and more scalable,
> then I'm totally open to it.
> 
> But this is _not_ the place to add complexity to get scalability.
At least, please dont add bottleneck, how complex to do that ?

Thanks,
Ethan



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-12  1:17                       ` Ethan Zhao
@ 2025-08-15 14:35                         ` Dave Hansen
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Hansen @ 2025-08-15 14:35 UTC (permalink / raw)
  To: Ethan Zhao, Uladzislau Rezki, Baolu Lu
  Cc: Jason Gunthorpe, Joerg Roedel, Will Deacon, Robin Murphy,
	Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple,
	Peter Zijlstra, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable

On 8/11/25 18:17, Ethan Zhao wrote:
>> But this is _not_ the place to add complexity to get scalability.
> At least, please dont add bottleneck, how complex to do that ?

Very good question! If you're really interested and concerned about
this, I'd encourage you to demonstrate where the contention becomes a
problem in practice, then post a patch to fix it.

If it's simple, I'll happily ack it!

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-08  5:15               ` Baolu Lu
  2025-08-10  7:19                 ` Ethan Zhao
@ 2025-08-11 12:57                 ` Jason Gunthorpe
  2025-08-13  3:17                   ` Ethan Zhao
  2025-08-18  1:34                   ` Baolu Lu
  1 sibling, 2 replies; 51+ messages in thread
From: Jason Gunthorpe @ 2025-08-11 12:57 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable

On Fri, Aug 08, 2025 at 01:15:12PM +0800, Baolu Lu wrote:
> +static void kernel_pte_work_func(struct work_struct *work)
> +{
> +	struct ptdesc *ptdesc, *next;
> +
> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> +
> +	guard(spinlock)(&kernel_pte_work.lock);
> +	list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) {
> +		list_del_init(&ptdesc->pt_list);
> +		pagetable_dtor_free(ptdesc);
> +	}

Do a list_move from kernel_pte_work.list to an on-stack list head and
then immediately release the lock. No reason to hold the spinock while
doing frees, also no reason to do list_del_init, that memory probably
gets zerod in pagetable_dtor_free()

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-11 12:57                 ` Jason Gunthorpe
@ 2025-08-13  3:17                   ` Ethan Zhao
  2025-08-18  1:34                   ` Baolu Lu
  1 sibling, 0 replies; 51+ messages in thread
From: Ethan Zhao @ 2025-08-13  3:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Baolu Lu
  Cc: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable



On 8/11/2025 8:57 PM, Jason Gunthorpe wrote:
> On Fri, Aug 08, 2025 at 01:15:12PM +0800, Baolu Lu wrote:
>> +static void kernel_pte_work_func(struct work_struct *work)
>> +{
>> +	struct ptdesc *ptdesc, *next;
>> +
>> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
>> +
>> +	guard(spinlock)(&kernel_pte_work.lock);
>> +	list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) {
>> +		list_del_init(&ptdesc->pt_list);
>> +		pagetable_dtor_free(ptdesc);
>> +	}
> 
> Do a list_move from kernel_pte_work.list to an on-stack list head and
> then immediately release the lock. No reason to hold the spinock while
> doing frees, also no reason to do list_del_init, that memory probably
> gets zerod in pagetable_dtor_free() 
Yep，using guard(spinlock)() for scope-bound lock management sacrifices
fine-grained control over the protected area. If offers convenience at
the cost of precision.

Out of my bias, calling it sluggard(spinlock)() might be proper.

Thanks，
Ethan
> 
> Jason
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-11 12:57                 ` Jason Gunthorpe
  2025-08-13  3:17                   ` Ethan Zhao
@ 2025-08-18  1:34                   ` Baolu Lu
  1 sibling, 0 replies; 51+ messages in thread
From: Baolu Lu @ 2025-08-18  1:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable

On 8/11/25 20:57, Jason Gunthorpe wrote:
> On Fri, Aug 08, 2025 at 01:15:12PM +0800, Baolu Lu wrote:
>> +static void kernel_pte_work_func(struct work_struct *work)
>> +{
>> +	struct ptdesc *ptdesc, *next;
>> +
>> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
>> +
>> +	guard(spinlock)(&kernel_pte_work.lock);
>> +	list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) {
>> +		list_del_init(&ptdesc->pt_list);
>> +		pagetable_dtor_free(ptdesc);
>> +	}
> Do a list_move from kernel_pte_work.list to an on-stack list head and
> then immediately release the lock. No reason to hold the spinock while
> doing frees, also no reason to do list_del_init, that memory probably
> gets zerod in pagetable_dtor_free()

Done.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-07 14:40           ` Baolu Lu
  2025-08-07 15:31             ` Dave Hansen
@ 2025-08-07 19:51             ` Jason Gunthorpe
  2025-08-08  2:57               ` Tian, Kevin
  2025-08-08  5:08               ` Baolu Lu
  1 sibling, 2 replies; 51+ messages in thread
From: Jason Gunthorpe @ 2025-08-07 19:51 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable

On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
> +static void kernel_pte_work_func(struct work_struct *work)
> +{
> +	struct page *page, *next;
> +
> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> +
> +	guard(spinlock)(&kernel_pte_work.lock);
> +	list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
> +		list_del_init(&page->lru);

Please don't add new usages of lru, we are trying to get rid of this. :(

I think the memory should be struct ptdesc, use that..

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-07 19:51             ` Jason Gunthorpe
@ 2025-08-08  2:57               ` Tian, Kevin
  2025-08-15  9:16                 ` Baolu Lu
  2025-08-18  6:21                 ` Baolu Lu
  2025-08-08  5:08               ` Baolu Lu
  1 sibling, 2 replies; 51+ messages in thread
From: Tian, Kevin @ 2025-08-08  2:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Baolu Lu
  Cc: Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn,
	Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, August 8, 2025 3:52 AM
> 
> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
> > +static void kernel_pte_work_func(struct work_struct *work)
> > +{
> > +	struct page *page, *next;
> > +
> > +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> > +
> > +	guard(spinlock)(&kernel_pte_work.lock);
> > +	list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
> > +		list_del_init(&page->lru);
> 
> Please don't add new usages of lru, we are trying to get rid of this. :(
> 
> I think the memory should be struct ptdesc, use that..
> 

btw with this change we should also defer free of the pmd page:

pud_free_pmd_page()
	...
	for (i = 0; i < PTRS_PER_PMD; i++) {
		if (!pmd_none(pmd_sv[i])) {
			pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
			pte_free_kernel(&init_mm, pte);
		}
	}

	free_page((unsigned long)pmd_sv);

Otherwise the risk still exists if the pmd page is repurposed before the
pte work is scheduled.

another observation - pte_free_kernel is not used in remove_pagetable ()
and __change_page_attr(). Is it straightforward to put it in those paths
or do we need duplicate some deferring logic there? 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-08  2:57               ` Tian, Kevin
@ 2025-08-15  9:16                 ` Baolu Lu
  2025-08-15  9:46                   ` Tian, Kevin
  2025-08-15 14:31                   ` Dave Hansen
  2025-08-18  6:21                 ` Baolu Lu
  1 sibling, 2 replies; 51+ messages in thread
From: Baolu Lu @ 2025-08-15  9:16 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: baolu.lu, Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
	Lai, Yi1, iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On 8/8/2025 10:57 AM, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Friday, August 8, 2025 3:52 AM
>>
>> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
>>> +static void kernel_pte_work_func(struct work_struct *work)
>>> +{
>>> +	struct page *page, *next;
>>> +
>>> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
>>> +
>>> +	guard(spinlock)(&kernel_pte_work.lock);
>>> +	list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
>>> +		list_del_init(&page->lru);
>>
>> Please don't add new usages of lru, we are trying to get rid of this. :(
>>
>> I think the memory should be struct ptdesc, use that..
>>
> 
> btw with this change we should also defer free of the pmd page:
> 
> pud_free_pmd_page()
> 	...
> 	for (i = 0; i < PTRS_PER_PMD; i++) {
> 		if (!pmd_none(pmd_sv[i])) {
> 			pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
> 			pte_free_kernel(&init_mm, pte);
> 		}
> 	}
> 
> 	free_page((unsigned long)pmd_sv);
> 
> Otherwise the risk still exists if the pmd page is repurposed before the
> pte work is scheduled.

You're right that freeing high-level page table pages also requires an
IOTLB flush before the pages are freed. But I question the practical
risk of the race given the extremely small time window. If this is a
real concern, a potential mitigation would be to clear the U/S bits in
all page table entries for kernel address space? But I am not confident
in making that change at this time as I am unsure of the side effects it
might cause.

> 
> another observation - pte_free_kernel is not used in remove_pagetable ()
> and __change_page_attr(). Is it straightforward to put it in those paths
> or do we need duplicate some deferring logic there?

The remove_pagetable() function is called in the path where memory is
hot-removed from the system, right? If so, there should be no issue, as
the threat model here is a page table page being freed and repurposed
while it's still cached in the IOTLB. In the hot-remove case, the memory
is removed and will not be reused, so that's fine as far as I can see.

The same to __change_page_attr(), which only changes the attributes of a
page table entry while the underlying page remains in use.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* RE: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-15  9:16                 ` Baolu Lu
@ 2025-08-15  9:46                   ` Tian, Kevin
  2025-08-18  5:58                     ` Baolu Lu
  2025-08-15 14:31                   ` Dave Hansen
  1 sibling, 1 reply; 51+ messages in thread
From: Tian, Kevin @ 2025-08-15  9:46 UTC (permalink / raw)
  To: Baolu Lu, Jason Gunthorpe
  Cc: Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn,
	Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Friday, August 15, 2025 5:17 PM
> 
> On 8/8/2025 10:57 AM, Tian, Kevin wrote:
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> Sent: Friday, August 8, 2025 3:52 AM
> >>
> >> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
> >>> +static void kernel_pte_work_func(struct work_struct *work)
> >>> +{
> >>> +	struct page *page, *next;
> >>> +
> >>> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> >>> +
> >>> +	guard(spinlock)(&kernel_pte_work.lock);
> >>> +	list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
> >>> +		list_del_init(&page->lru);
> >>
> >> Please don't add new usages of lru, we are trying to get rid of this. :(
> >>
> >> I think the memory should be struct ptdesc, use that..
> >>
> >
> > btw with this change we should also defer free of the pmd page:
> >
> > pud_free_pmd_page()
> > 	...
> > 	for (i = 0; i < PTRS_PER_PMD; i++) {
> > 		if (!pmd_none(pmd_sv[i])) {
> > 			pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
> > 			pte_free_kernel(&init_mm, pte);
> > 		}
> > 	}
> >
> > 	free_page((unsigned long)pmd_sv);
> >
> > Otherwise the risk still exists if the pmd page is repurposed before the
> > pte work is scheduled.
> 
> You're right that freeing high-level page table pages also requires an
> IOTLB flush before the pages are freed. But I question the practical
> risk of the race given the extremely small time window. If this is a

It's already extremely difficult to conduct a real attack even w/o this
fix. I'm not sure the criteria how small we consider acceptable in this
specific case. but leaving an incomplete fix in code doesn't sound clean...

> real concern, a potential mitigation would be to clear the U/S bits in
> all page table entries for kernel address space? But I am not confident
> in making that change at this time as I am unsure of the side effects it
> might cause.

I think there was already consensus that clearing U/S bits in all entries
doesn't prevent the IOMMU caching them and setting A/D bits on
the freed pagetable.

> 
> >
> > another observation - pte_free_kernel is not used in remove_pagetable ()
> > and __change_page_attr(). Is it straightforward to put it in those paths
> > or do we need duplicate some deferring logic there?
> 
> The remove_pagetable() function is called in the path where memory is
> hot-removed from the system, right? If so, there should be no issue, as
> the threat model here is a page table page being freed and repurposed
> while it's still cached in the IOTLB. In the hot-remove case, the memory
> is removed and will not be reused, so that's fine as far as I can see.

what about the page is hot-added back while the stale entry pointing to
it is still valid in the IOMMU, theoretically? 😊

> 
> The same to __change_page_attr(), which only changes the attributes of a
> page table entry while the underlying page remains in use.
> 

it may lead to cpa_collapse_large_pages() if changing attribute leads to
all adjacent 4k pages in 2M range are with same attribute. Then page
table might be freed:

cpa_collapse_large_pages():
        list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
                list_del(&ptdesc->pt_list);
                __free_page(ptdesc_page(ptdesc));
        }

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-15  9:46                   ` Tian, Kevin
@ 2025-08-18  5:58                     ` Baolu Lu
  0 siblings, 0 replies; 51+ messages in thread
From: Baolu Lu @ 2025-08-18  5:58 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn,
	Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On 8/15/25 17:46, Tian, Kevin wrote:
>> From: Baolu Lu <baolu.lu@linux.intel.com>
>> Sent: Friday, August 15, 2025 5:17 PM
>>
>> On 8/8/2025 10:57 AM, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Friday, August 8, 2025 3:52 AM
>>>>
>>>> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
>>>>> +static void kernel_pte_work_func(struct work_struct *work)
>>>>> +{
>>>>> +	struct page *page, *next;
>>>>> +
>>>>> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
>>>>> +
>>>>> +	guard(spinlock)(&kernel_pte_work.lock);
>>>>> +	list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
>>>>> +		list_del_init(&page->lru);
>>>>
>>>> Please don't add new usages of lru, we are trying to get rid of this. :(
>>>>
>>>> I think the memory should be struct ptdesc, use that..
>>>>
>>>
>>> btw with this change we should also defer free of the pmd page:
>>>
>>> pud_free_pmd_page()
>>> 	...
>>> 	for (i = 0; i < PTRS_PER_PMD; i++) {
>>> 		if (!pmd_none(pmd_sv[i])) {
>>> 			pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
>>> 			pte_free_kernel(&init_mm, pte);
>>> 		}
>>> 	}
>>>
>>> 	free_page((unsigned long)pmd_sv);
>>>
>>> Otherwise the risk still exists if the pmd page is repurposed before the
>>> pte work is scheduled.
>>
>> You're right that freeing high-level page table pages also requires an
>> IOTLB flush before the pages are freed. But I question the practical
>> risk of the race given the extremely small time window. If this is a
> 
> It's already extremely difficult to conduct a real attack even w/o this
> fix. I'm not sure the criteria how small we consider acceptable in this
> specific case. but leaving an incomplete fix in code doesn't sound clean...
> 
>> real concern, a potential mitigation would be to clear the U/S bits in
>> all page table entries for kernel address space? But I am not confident
>> in making that change at this time as I am unsure of the side effects it
>> might cause.
> 
> I think there was already consensus that clearing U/S bits in all entries
> doesn't prevent the IOMMU caching them and setting A/D bits on
> the freed pagetable.
> 
>>
>>>
>>> another observation - pte_free_kernel is not used in remove_pagetable ()
>>> and __change_page_attr(). Is it straightforward to put it in those paths
>>> or do we need duplicate some deferring logic there?
>>
>> The remove_pagetable() function is called in the path where memory is
>> hot-removed from the system, right? If so, there should be no issue, as
>> the threat model here is a page table page being freed and repurposed
>> while it's still cached in the IOTLB. In the hot-remove case, the memory
>> is removed and will not be reused, so that's fine as far as I can see.
> 
> what about the page is hot-added back while the stale entry pointing to
> it is still valid in the IOMMU, theoretically? 😊
> 
>>
>> The same to __change_page_attr(), which only changes the attributes of a
>> page table entry while the underlying page remains in use.
>>
> 
> it may lead to cpa_collapse_large_pages() if changing attribute leads to
> all adjacent 4k pages in 2M range are with same attribute. Then page
> table might be freed:
> 
> cpa_collapse_large_pages():
>          list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
>                  list_del(&ptdesc->pt_list);
>                  __free_page(ptdesc_page(ptdesc));
>          }

All look fair enough to me. I will handle all the cases and make it
complete.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-15  9:16                 ` Baolu Lu
  2025-08-15  9:46                   ` Tian, Kevin
@ 2025-08-15 14:31                   ` Dave Hansen
  2025-08-18  6:08                     ` Baolu Lu
  1 sibling, 1 reply; 51+ messages in thread
From: Dave Hansen @ 2025-08-15 14:31 UTC (permalink / raw)
  To: Baolu Lu, Tian, Kevin, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On 8/15/25 02:16, Baolu Lu wrote:
> On 8/8/2025 10:57 AM, Tian, Kevin wrote:
>> pud_free_pmd_page()
>>     ...
>>     for (i = 0; i < PTRS_PER_PMD; i++) {
>>         if (!pmd_none(pmd_sv[i])) {
>>             pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
>>             pte_free_kernel(&init_mm, pte);
>>         }
>>     }
>>
>>     free_page((unsigned long)pmd_sv);
>>
>> Otherwise the risk still exists if the pmd page is repurposed before the
>> pte work is scheduled.
> 
> You're right that freeing high-level page table pages also requires an
> IOTLB flush before the pages are freed. But I question the practical
> risk of the race given the extremely small time window.

I hear that Linux is gaining popularity these days. There might even be
dozens of users! Given that large scale of dozens (or even hundreds??)
of users, I would suggest exercising some care. The race might be small
but it only needs to happen once to cause chaos.

Seriously, though... A race is a race. Preemption or interrupts or SMIs
or VMExits or a million other things can cause a "small time window" to
become a big time window.

Even perceived small races need to be fixed.

> If this is a real concern, a potential mitigation would be to clear
> the U/S bits in all page table entries for kernel address space? But
> I am not confident in making that change at this time as I am unsure
> of the side effects it might cause.

That doesn't do any good. I even went as far as double-checking months
ago with the IOMMU hardware folks to confirm the actual implementation.
I'm really surprised this is being brought up again.

>> another observation - pte_free_kernel is not used in remove_pagetable ()
>> and __change_page_attr(). Is it straightforward to put it in those paths
>> or do we need duplicate some deferring logic there?
> 
> The remove_pagetable() function is called in the path where memory is
> hot-removed from the system, right?

No. Not right.

This is in the vmalloc() code: the side of things that _creates_
mappings for new allocations, not tears them down.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-15 14:31                   ` Dave Hansen
@ 2025-08-18  6:08                     ` Baolu Lu
  0 siblings, 0 replies; 51+ messages in thread
From: Baolu Lu @ 2025-08-18  6:08 UTC (permalink / raw)
  To: Dave Hansen, Tian, Kevin, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On 8/15/25 22:31, Dave Hansen wrote:
> On 8/15/25 02:16, Baolu Lu wrote:
>> On 8/8/2025 10:57 AM, Tian, Kevin wrote:
>>> pud_free_pmd_page()
>>>      ...
>>>      for (i = 0; i < PTRS_PER_PMD; i++) {
>>>          if (!pmd_none(pmd_sv[i])) {
>>>              pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
>>>              pte_free_kernel(&init_mm, pte);
>>>          }
>>>      }
>>>
>>>      free_page((unsigned long)pmd_sv);
>>>
>>> Otherwise the risk still exists if the pmd page is repurposed before the
>>> pte work is scheduled.
>>
>> You're right that freeing high-level page table pages also requires an
>> IOTLB flush before the pages are freed. But I question the practical
>> risk of the race given the extremely small time window.
> 
> I hear that Linux is gaining popularity these days. There might even be
> dozens of users! Given that large scale of dozens (or even hundreds??)
> of users, I would suggest exercising some care. The race might be small
> but it only needs to happen once to cause chaos.
> 
> Seriously, though... A race is a race. Preemption or interrupts or SMIs
> or VMExits or a million other things can cause a "small time window" to
> become a big time window.
> 
> Even perceived small races need to be fixed.

Yes, agreed.

> 
>> If this is a real concern, a potential mitigation would be to clear
>> the U/S bits in all page table entries for kernel address space? But
>> I am not confident in making that change at this time as I am unsure
>> of the side effects it might cause.
> 
> That doesn't do any good. I even went as far as double-checking months
> ago with the IOMMU hardware folks to confirm the actual implementation.
> I'm really surprised this is being brought up again.

It should not be brought up again. Sorry about it.

I was thinking about deferring pte page table page free plus clearing
U/S bit. That's also not desirable anyway, so let's ignore it.

> 
>>> another observation - pte_free_kernel is not used in remove_pagetable ()
>>> and __change_page_attr(). Is it straightforward to put it in those paths
>>> or do we need duplicate some deferring logic there?
>>
>> The remove_pagetable() function is called in the path where memory is
>> hot-removed from the system, right?
> 
> No. Not right.
> 
> This is in the vmalloc() code: the side of things that _creates_
> mappings for new allocations, not tears them down.

Yeah, let me look into it.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-08  2:57               ` Tian, Kevin
  2025-08-15  9:16                 ` Baolu Lu
@ 2025-08-18  6:21                 ` Baolu Lu
  2025-08-21  7:05                   ` Tian, Kevin
  1 sibling, 1 reply; 51+ messages in thread
From: Baolu Lu @ 2025-08-18  6:21 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn,
	Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On 8/8/25 10:57, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Friday, August 8, 2025 3:52 AM
>>
>> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
>>> +static void kernel_pte_work_func(struct work_struct *work)
>>> +{
>>> +	struct page *page, *next;
>>> +
>>> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
>>> +
>>> +	guard(spinlock)(&kernel_pte_work.lock);
>>> +	list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
>>> +		list_del_init(&page->lru);
>>
>> Please don't add new usages of lru, we are trying to get rid of this. :(
>>
>> I think the memory should be struct ptdesc, use that..
>>
> 
> btw with this change we should also defer free of the pmd page:
> 
> pud_free_pmd_page()
> 	...
> 	for (i = 0; i < PTRS_PER_PMD; i++) {
> 		if (!pmd_none(pmd_sv[i])) {
> 			pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
> 			pte_free_kernel(&init_mm, pte);
> 		}
> 	}
> 
> 	free_page((unsigned long)pmd_sv);
> 
> Otherwise the risk still exists if the pmd page is repurposed before the
> pte work is scheduled.


My understanding is that you were talking about pmd_free(), correct? It
appears that both pmd_free() and pte_free_kernel() call
pagetable_dtor_free(). If so, how about the following change?

diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index dbddacdca2ce..04f6b5bc212c 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -172,10 +172,8 @@ static inline pmd_t *pmd_alloc_one_noprof(struct 
mm_struct *mm, unsigned long ad
  #ifndef __HAVE_ARCH_PMD_FREE
  static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
  {
-       struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
-
         BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
-       pagetable_dtor_free(ptdesc);
+       pte_free_kernel(mm, (pte_t *)pmd);
  }

> 
> another observation - pte_free_kernel is not used in remove_pagetable ()
> and __change_page_attr(). Is it straightforward to put it in those paths
> or do we need duplicate some deferring logic there?
> 

In both of these cases, pages in an array or list require deferred
freeing. I don't believe the previous approach, which calls
pagetable_dtor_free(), will still work here. Perhaps a separate list is
needed? What about something like the following?

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 76e33bd7c556..2e887463c165 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1013,7 +1013,10 @@ static void __meminit free_pagetable(struct page 
*page, int order)
                 free_reserved_pages(page, nr_pages);
  #endif
         } else {
-               free_pages((unsigned long)page_address(page), order);
+               if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
+                       kernel_pgtable_free_pages_async(page, order);
+               else
+                       free_pages((unsigned long)page_address(page), 
order);
         }
  }

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 8834c76f91c9..7e567bdfddce 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -436,9 +436,13 @@ static void cpa_collapse_large_pages(struct 
cpa_data *cpa)

         flush_tlb_all();

-       list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
-               list_del(&ptdesc->pt_list);
-               __free_page(ptdesc_page(ptdesc));
+       if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) {
+               kernel_pgtable_free_list_async(&pgtables);
+       } else {
+               list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
+                       list_del(&ptdesc->pt_list);
+                       __free_page(ptdesc_page(ptdesc));
+               }
         }
  }

diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 4938eff5b482..13bbe1588872 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -48,8 +48,12 @@ static inline pte_t 
*pte_alloc_one_kernel_noprof(struct mm_struct *mm)

  #ifdef CONFIG_ASYNC_PGTABLE_FREE
  void pte_free_kernel_async(struct ptdesc *ptdesc);
+void kernel_pgtable_free_list_async(struct list_head *list);
+void kernel_pgtable_free_pages_async(struct page *pages, int order);
  #else
  static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {}
+static inline void kernel_pgtable_free_list_async(struct list_head 
*list) {}
+void kernel_pgtable_free_pages_async(struct page *pages, int order) {}
  #endif

  /**
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b9a785dd6b8d..d1d19132bbe8 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -413,6 +413,7 @@ static void kernel_pte_work_func(struct work_struct 
*work);

  struct {
         struct list_head list;
+       struct list_head pages;
         spinlock_t lock;
         struct work_struct work;
  } kernel_pte_work = {
@@ -425,17 +426,24 @@ static void kernel_pte_work_func(struct 
work_struct *work)
  {
         struct ptdesc *ptdesc, *next;
         LIST_HEAD(free_list);
+       LIST_HEAD(pages_list);

         iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);

         spin_lock(&kernel_pte_work.lock);
         list_move(&kernel_pte_work.list, &free_list);
+       list_move(&kernel_pte_work.pages, &pages_list);
         spin_unlock(&kernel_pte_work.lock);

         list_for_each_entry_safe(ptdesc, next, &free_list, pt_list) {
                 list_del(&ptdesc->pt_list);
                 pagetable_dtor_free(ptdesc);
         }
+
+       list_for_each_entry_safe(ptdesc, next, &pages_list, pt_list) {
+               list_del(&ptdesc->pt_list);
+               __free_page(ptdesc_page(ptdesc));
+       }
  }

  void pte_free_kernel_async(struct ptdesc *ptdesc)
@@ -444,4 +452,25 @@ void pte_free_kernel_async(struct ptdesc *ptdesc)
         list_add(&ptdesc->pt_list, &kernel_pte_work.list);
         schedule_work(&kernel_pte_work.work);
  }
+
+void kernel_pgtable_free_list_async(struct list_head *list)
+{
+       guard(spinlock)(&kernel_pte_work.lock);
+       list_splice_tail(list, &kernel_pte_work.pages);
+       schedule_work(&kernel_pte_work.work);
+}
+
+void kernel_pgtable_free_pages_async(struct page *pages, int order)
+{
+       unsigned long nr_pages = 1 << order;
+       struct ptdesc *ptdesc;
+       int i;
+
+       guard(spinlock)(&kernel_pte_work.lock);
+       for (i = 0; i < nr_pages; i++) {
+               ptdesc = page_ptdesc(&pages[i]);
+               list_add(&ptdesc->pt_list, &kernel_pte_work.pages);
+       }
+       schedule_work(&kernel_pte_work.work);
+}
  #endif


Thanks,
baolu

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* RE: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-18  6:21                 ` Baolu Lu
@ 2025-08-21  7:05                   ` Tian, Kevin
  2025-08-23  3:26                     ` Baolu Lu
  0 siblings, 1 reply; 51+ messages in thread
From: Tian, Kevin @ 2025-08-21  7:05 UTC (permalink / raw)
  To: Baolu Lu, Jason Gunthorpe
  Cc: Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn,
	Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Monday, August 18, 2025 2:22 PM
> 
> On 8/8/25 10:57, Tian, Kevin wrote:
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> Sent: Friday, August 8, 2025 3:52 AM
> >>
> >> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
> >>> +static void kernel_pte_work_func(struct work_struct *work)
> >>> +{
> >>> +	struct page *page, *next;
> >>> +
> >>> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> >>> +
> >>> +	guard(spinlock)(&kernel_pte_work.lock);
> >>> +	list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
> >>> +		list_del_init(&page->lru);
> >>
> >> Please don't add new usages of lru, we are trying to get rid of this. :(
> >>
> >> I think the memory should be struct ptdesc, use that..
> >>
> >
> > btw with this change we should also defer free of the pmd page:
> >
> > pud_free_pmd_page()
> > 	...
> > 	for (i = 0; i < PTRS_PER_PMD; i++) {
> > 		if (!pmd_none(pmd_sv[i])) {
> > 			pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
> > 			pte_free_kernel(&init_mm, pte);
> > 		}
> > 	}
> >
> > 	free_page((unsigned long)pmd_sv);
> >
> > Otherwise the risk still exists if the pmd page is repurposed before the
> > pte work is scheduled.
> 
> 
> My understanding is that you were talking about pmd_free(), correct? It

yes

> appears that both pmd_free() and pte_free_kernel() call
> pagetable_dtor_free(). If so, how about the following change?
> 
> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
> index dbddacdca2ce..04f6b5bc212c 100644
> --- a/include/asm-generic/pgalloc.h
> +++ b/include/asm-generic/pgalloc.h
> @@ -172,10 +172,8 @@ static inline pmd_t *pmd_alloc_one_noprof(struct
> mm_struct *mm, unsigned long ad
>   #ifndef __HAVE_ARCH_PMD_FREE
>   static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
>   {
> -       struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
> -
>          BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
> -       pagetable_dtor_free(ptdesc);
> +       pte_free_kernel(mm, (pte_t *)pmd);
>   }

better to rename pte_free_kernel_async() to pagetable_free_kernel_async()
and call it directly here like you did in pte_free_kernel(). otherwise it's
a bit weird to call a pte helper for pmd.

accordingly do we need a new helper pmd_free_kernel()? Now there is
no restriction that pmd_free() can only be called on kernel entries. e.g.
mop_up_one_pmd() (only in PAE and KPTI), destroy_args() if 
CONFIG_DEBUG_VM_PGTABLE, etc.

> 
> >
> > another observation - pte_free_kernel is not used in remove_pagetable ()
> > and __change_page_attr(). Is it straightforward to put it in those paths
> > or do we need duplicate some deferring logic there?
> >
> 
> In both of these cases, pages in an array or list require deferred
> freeing. I don't believe the previous approach, which calls
> pagetable_dtor_free(), will still work here. Perhaps a separate list is

this is the part which I don't quite understand. Is there guarantee that
page tables removed in those path are not allocated with any
pagetable_ctor helpers? Otherwise some state might be broken w/o
proper pagetable_dtor().

Knowing little here, so likely I missed some basic context.

> needed? What about something like the following?
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 76e33bd7c556..2e887463c165 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1013,7 +1013,10 @@ static void __meminit free_pagetable(struct page
> *page, int order)
>                  free_reserved_pages(page, nr_pages);
>   #endif
>          } else {
> -               free_pages((unsigned long)page_address(page), order);
> +               if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
> +                       kernel_pgtable_free_pages_async(page, order);
> +               else
> +                       free_pages((unsigned long)page_address(page),
> order);
>          }
>   }
> 
> diff --git a/arch/x86/mm/pat/set_memory.c
> b/arch/x86/mm/pat/set_memory.c
> index 8834c76f91c9..7e567bdfddce 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -436,9 +436,13 @@ static void cpa_collapse_large_pages(struct
> cpa_data *cpa)
> 
>          flush_tlb_all();
> 
> -       list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
> -               list_del(&ptdesc->pt_list);
> -               __free_page(ptdesc_page(ptdesc));
> +       if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) {
> +               kernel_pgtable_free_list_async(&pgtables);
> +       } else {
> +               list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
> +                       list_del(&ptdesc->pt_list);
> +                       __free_page(ptdesc_page(ptdesc));
> +               }
>          }
>   }
> 
> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
> index 4938eff5b482..13bbe1588872 100644
> --- a/include/asm-generic/pgalloc.h
> +++ b/include/asm-generic/pgalloc.h
> @@ -48,8 +48,12 @@ static inline pte_t
> *pte_alloc_one_kernel_noprof(struct mm_struct *mm)
> 
>   #ifdef CONFIG_ASYNC_PGTABLE_FREE
>   void pte_free_kernel_async(struct ptdesc *ptdesc);
> +void kernel_pgtable_free_list_async(struct list_head *list);
> +void kernel_pgtable_free_pages_async(struct page *pages, int order);
>   #else
>   static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {}
> +static inline void kernel_pgtable_free_list_async(struct list_head
> *list) {}
> +void kernel_pgtable_free_pages_async(struct page *pages, int order) {}
>   #endif
> 
>   /**
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index b9a785dd6b8d..d1d19132bbe8 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -413,6 +413,7 @@ static void kernel_pte_work_func(struct work_struct
> *work);
> 
>   struct {
>          struct list_head list;
> +       struct list_head pages;

the real difference between the two lists is about whether to use 
pagetable_dtor_free() or __free_page(). Then clearer to call them
'pages_dtor' and 'pages'?

>          spinlock_t lock;
>          struct work_struct work;
>   } kernel_pte_work = {
> @@ -425,17 +426,24 @@ static void kernel_pte_work_func(struct
> work_struct *work)
>   {
>          struct ptdesc *ptdesc, *next;
>          LIST_HEAD(free_list);
> +       LIST_HEAD(pages_list);
> 
>          iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> 
>          spin_lock(&kernel_pte_work.lock);
>          list_move(&kernel_pte_work.list, &free_list);
> +       list_move(&kernel_pte_work.pages, &pages_list);
>          spin_unlock(&kernel_pte_work.lock);
> 
>          list_for_each_entry_safe(ptdesc, next, &free_list, pt_list) {
>                  list_del(&ptdesc->pt_list);
>                  pagetable_dtor_free(ptdesc);
>          }
> +
> +       list_for_each_entry_safe(ptdesc, next, &pages_list, pt_list) {
> +               list_del(&ptdesc->pt_list);
> +               __free_page(ptdesc_page(ptdesc));
> +       }
>   }
> 
>   void pte_free_kernel_async(struct ptdesc *ptdesc)
> @@ -444,4 +452,25 @@ void pte_free_kernel_async(struct ptdesc *ptdesc)
>          list_add(&ptdesc->pt_list, &kernel_pte_work.list);
>          schedule_work(&kernel_pte_work.work);
>   }
> +
> +void kernel_pgtable_free_list_async(struct list_head *list)
> +{
> +       guard(spinlock)(&kernel_pte_work.lock);
> +       list_splice_tail(list, &kernel_pte_work.pages);
> +       schedule_work(&kernel_pte_work.work);
> +}
> +
> +void kernel_pgtable_free_pages_async(struct page *pages, int order)
> +{
> +       unsigned long nr_pages = 1 << order;
> +       struct ptdesc *ptdesc;
> +       int i;
> +
> +       guard(spinlock)(&kernel_pte_work.lock);
> +       for (i = 0; i < nr_pages; i++) {
> +               ptdesc = page_ptdesc(&pages[i]);
> +               list_add(&ptdesc->pt_list, &kernel_pte_work.pages);
> +       }

there is a side-effect here. Now a contiguous range of pages (order=N)
are freed one-by-one, so they are first fed back to the order=0 free list
with possibility of getting split due to small order allocations before
having chance to lift back to the order=N list. then the number of
available huge pages is unnecessarily reduced.

but look at the code probably we don't need to handle the order>0 case?

now only free_hugepage_table() may free a huge page, called from
remove_pmd_table() when a pmd is a *leaf* entry pointing to a
vmemmap huge page. So it's not a real page table. I'm not sure why
it's piggybacked in a pagetable helper, but sounds like a case we
can ignore in this mitigation...

> +       schedule_work(&kernel_pte_work.work);
> +}
>   #endif
> 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-21  7:05                   ` Tian, Kevin
@ 2025-08-23  3:26                     ` Baolu Lu
  2025-08-25 22:36                       ` Dave Hansen
  0 siblings, 1 reply; 51+ messages in thread
From: Baolu Lu @ 2025-08-23  3:26 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: baolu.lu, Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
	Lai, Yi1, iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On 8/21/2025 3:05 PM, Tian, Kevin wrote:
>> From: Baolu Lu<baolu.lu@linux.intel.com>
>> Sent: Monday, August 18, 2025 2:22 PM
>>
>> On 8/8/25 10:57, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe<jgg@nvidia.com>
>>>> Sent: Friday, August 8, 2025 3:52 AM
>>>>
>>>> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
>>>>> +static void kernel_pte_work_func(struct work_struct *work)
>>>>> +{
>>>>> +	struct page *page, *next;
>>>>> +
>>>>> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
>>>>> +
>>>>> +	guard(spinlock)(&kernel_pte_work.lock);
>>>>> +	list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
>>>>> +		list_del_init(&page->lru);
>>>> Please don't add new usages of lru, we are trying to get rid of this. 🙁
>>>>
>>>> I think the memory should be struct ptdesc, use that..
>>>>
>>> btw with this change we should also defer free of the pmd page:
>>>
>>> pud_free_pmd_page()
>>> 	...
>>> 	for (i = 0; i < PTRS_PER_PMD; i++) {
>>> 		if (!pmd_none(pmd_sv[i])) {
>>> 			pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
>>> 			pte_free_kernel(&init_mm, pte);
>>> 		}
>>> 	}
>>>
>>> 	free_page((unsigned long)pmd_sv);
>>>
>>> Otherwise the risk still exists if the pmd page is repurposed before the
>>> pte work is scheduled.
>>
>> My understanding is that you were talking about pmd_free(), correct? It
> yes
> 
>> appears that both pmd_free() and pte_free_kernel() call
>> pagetable_dtor_free(). If so, how about the following change?
>>
>> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
>> index dbddacdca2ce..04f6b5bc212c 100644
>> --- a/include/asm-generic/pgalloc.h
>> +++ b/include/asm-generic/pgalloc.h
>> @@ -172,10 +172,8 @@ static inline pmd_t *pmd_alloc_one_noprof(struct
>> mm_struct *mm, unsigned long ad
>>    #ifndef __HAVE_ARCH_PMD_FREE
>>    static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
>>    {
>> -       struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
>> -
>>           BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
>> -       pagetable_dtor_free(ptdesc);
>> +       pte_free_kernel(mm, (pte_t *)pmd);
>>    }
> better to rename pte_free_kernel_async() to pagetable_free_kernel_async()
> and call it directly here like you did in pte_free_kernel(). otherwise it's
> a bit weird to call a pte helper for pmd.
> 
> accordingly do we need a new helper pmd_free_kernel()? Now there is
> no restriction that pmd_free() can only be called on kernel entries. e.g.
> mop_up_one_pmd() (only in PAE and KPTI), destroy_args() if
> CONFIG_DEBUG_VM_PGTABLE, etc.
> 
>>> another observation - pte_free_kernel is not used in remove_pagetable ()
>>> and __change_page_attr(). Is it straightforward to put it in those paths
>>> or do we need duplicate some deferring logic there?
>>>
>> In both of these cases, pages in an array or list require deferred
>> freeing. I don't believe the previous approach, which calls
>> pagetable_dtor_free(), will still work here. Perhaps a separate list is
> this is the part which I don't quite understand. Is there guarantee that
> page tables removed in those path are not allocated with any
> pagetable_ctor helpers? Otherwise some state might be broken w/o
> proper pagetable_dtor().
> 
> Knowing little here, so likely I missed some basic context.
> 
>> needed? What about something like the following?
>>
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 76e33bd7c556..2e887463c165 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -1013,7 +1013,10 @@ static void __meminit free_pagetable(struct page
>> *page, int order)
>>                   free_reserved_pages(page, nr_pages);
>>    #endif
>>           } else {
>> -               free_pages((unsigned long)page_address(page), order);
>> +               if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
>> +                       kernel_pgtable_free_pages_async(page, order);
>> +               else
>> +                       free_pages((unsigned long)page_address(page),
>> order);
>>           }
>>    }
>>
>> diff --git a/arch/x86/mm/pat/set_memory.c
>> b/arch/x86/mm/pat/set_memory.c
>> index 8834c76f91c9..7e567bdfddce 100644
>> --- a/arch/x86/mm/pat/set_memory.c
>> +++ b/arch/x86/mm/pat/set_memory.c
>> @@ -436,9 +436,13 @@ static void cpa_collapse_large_pages(struct
>> cpa_data *cpa)
>>
>>           flush_tlb_all();
>>
>> -       list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
>> -               list_del(&ptdesc->pt_list);
>> -               __free_page(ptdesc_page(ptdesc));
>> +       if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) {
>> +               kernel_pgtable_free_list_async(&pgtables);
>> +       } else {
>> +               list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
>> +                       list_del(&ptdesc->pt_list);
>> +                       __free_page(ptdesc_page(ptdesc));
>> +               }
>>           }
>>    }
>>
>> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
>> index 4938eff5b482..13bbe1588872 100644
>> --- a/include/asm-generic/pgalloc.h
>> +++ b/include/asm-generic/pgalloc.h
>> @@ -48,8 +48,12 @@ static inline pte_t
>> *pte_alloc_one_kernel_noprof(struct mm_struct *mm)
>>
>>    #ifdef CONFIG_ASYNC_PGTABLE_FREE
>>    void pte_free_kernel_async(struct ptdesc *ptdesc);
>> +void kernel_pgtable_free_list_async(struct list_head *list);
>> +void kernel_pgtable_free_pages_async(struct page *pages, int order);
>>    #else
>>    static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {}
>> +static inline void kernel_pgtable_free_list_async(struct list_head
>> *list) {}
>> +void kernel_pgtable_free_pages_async(struct page *pages, int order) {}
>>    #endif
>>
>>    /**
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index b9a785dd6b8d..d1d19132bbe8 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -413,6 +413,7 @@ static void kernel_pte_work_func(struct work_struct
>> *work);
>>
>>    struct {
>>           struct list_head list;
>> +       struct list_head pages;
> the real difference between the two lists is about whether to use
> pagetable_dtor_free() or __free_page(). Then clearer to call them
> 'pages_dtor' and 'pages'?
> 
>>           spinlock_t lock;
>>           struct work_struct work;
>>    } kernel_pte_work = {
>> @@ -425,17 +426,24 @@ static void kernel_pte_work_func(struct
>> work_struct *work)
>>    {
>>           struct ptdesc *ptdesc, *next;
>>           LIST_HEAD(free_list);
>> +       LIST_HEAD(pages_list);
>>
>>           iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
>>
>>           spin_lock(&kernel_pte_work.lock);
>>           list_move(&kernel_pte_work.list, &free_list);
>> +       list_move(&kernel_pte_work.pages, &pages_list);
>>           spin_unlock(&kernel_pte_work.lock);
>>
>>           list_for_each_entry_safe(ptdesc, next, &free_list, pt_list) {
>>                   list_del(&ptdesc->pt_list);
>>                   pagetable_dtor_free(ptdesc);
>>           }
>> +
>> +       list_for_each_entry_safe(ptdesc, next, &pages_list, pt_list) {
>> +               list_del(&ptdesc->pt_list);
>> +               __free_page(ptdesc_page(ptdesc));
>> +       }
>>    }
>>
>>    void pte_free_kernel_async(struct ptdesc *ptdesc)
>> @@ -444,4 +452,25 @@ void pte_free_kernel_async(struct ptdesc *ptdesc)
>>           list_add(&ptdesc->pt_list, &kernel_pte_work.list);
>>           schedule_work(&kernel_pte_work.work);
>>    }
>> +
>> +void kernel_pgtable_free_list_async(struct list_head *list)
>> +{
>> +       guard(spinlock)(&kernel_pte_work.lock);
>> +       list_splice_tail(list, &kernel_pte_work.pages);
>> +       schedule_work(&kernel_pte_work.work);
>> +}
>> +
>> +void kernel_pgtable_free_pages_async(struct page *pages, int order)
>> +{
>> +       unsigned long nr_pages = 1 << order;
>> +       struct ptdesc *ptdesc;
>> +       int i;
>> +
>> +       guard(spinlock)(&kernel_pte_work.lock);
>> +       for (i = 0; i < nr_pages; i++) {
>> +               ptdesc = page_ptdesc(&pages[i]);
>> +               list_add(&ptdesc->pt_list, &kernel_pte_work.pages);
>> +       }
> there is a side-effect here. Now a contiguous range of pages (order=N)
> are freed one-by-one, so they are first fed back to the order=0 free list
> with possibility of getting split due to small order allocations before
> having chance to lift back to the order=N list. then the number of
> available huge pages is unnecessarily reduced.
> 
> but look at the code probably we don't need to handle the order>0 case?
> 
> now only free_hugepage_table() may free a huge page, called from
> remove_pmd_table() when a pmd is a*leaf* entry pointing to a
> vmemmap huge page. So it's not a real page table. I'm not sure why
> it's piggybacked in a pagetable helper, but sounds like a case we
> can ignore in this mitigation...

I am not sure about this context either. My understanding is that
"order > 0" is used for the batch deallocation of page table pages that
were created to map large, contiguous blocks of memory.

Thanks for your comments. I have created a new patch to introduce the
asynchronous page table page free mechanism, taking the above comments
into consideration. Let me post it below for further review.

mm: Conditionally defer freeing of kernel page table pages

On x86 and other architectures that map the kernel's virtual address space
into the upper portion of every process's page table, the IOMMU's paging
structure caches can become stale. This occurs when a page used for the
kernel's page tables is freed and reused without the IOMMU being notified.

While the IOMMU driver is notified of changes to user virtual address
mappings, there is no similar notification mechanism for kernel page
table changes. This can lead to data corruption or system instability
when Shared Virtual Address (SVA) is enabled, as the IOMMU's internal
caches may retain stale entries for kernel virtual addresses.

This introduces a conditional asynchronous mechanism, enabled by
CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the freeing
of pages that are used as page tables for kernel address mappings. These
pages are now queued to a work struct instead of being freed immediately.

This deferred freeing provides a safe context for a future patch to add
an IOMMU-specific callback, which might be expensive on large-scale
systems. This ensures the necessary IOMMU cache invalidation is performed
before the page is finally returned to the page allocator outside of any
critical, non-sleepable path.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
  arch/x86/mm/init_64.c         |   5 +-
  arch/x86/mm/pat/set_memory.c  |  10 +++-
  arch/x86/mm/pgtable.c         |   5 +-
  include/asm-generic/pgalloc.h |  17 +++++-
  mm/Kconfig                    |   3 +
  mm/pgtable-generic.c          | 106 ++++++++++++++++++++++++++++++++++
  6 files changed, 140 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 76e33bd7c556..bf06d815d214 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1013,7 +1013,10 @@ static void __meminit free_pagetable(struct page 
*page, int order)
  		free_reserved_pages(page, nr_pages);
  #endif
  	} else {
-		free_pages((unsigned long)page_address(page), order);
+		if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
+			kernel_pgtable_async_free_pages(page, order);
+		else
+			free_pages((unsigned long)page_address(page), order);
  	}
  }

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 8834c76f91c9..97d4f39cae7d 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -436,9 +436,13 @@ static void cpa_collapse_large_pages(struct 
cpa_data *cpa)

  	flush_tlb_all();

-	list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
-		list_del(&ptdesc->pt_list);
-		__free_page(ptdesc_page(ptdesc));
+	if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) {
+		kernel_pgtable_async_free_page_list(&pgtables);
+	} else {
+		list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
+			list_del(&ptdesc->pt_list);
+			__free_page(ptdesc_page(ptdesc));
+		}
  	}
  }

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ddf248c3ee7d..45268f009cb0 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -757,7 +757,10 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)

  	free_page((unsigned long)pmd_sv);

-	pmd_free(&init_mm, pmd);
+	if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
+		kernel_pgtable_async_free_dtor(virt_to_ptdesc(pmd));
+	else
+		pmd_free(&init_mm, pmd);

  	return 1;
  }
diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 3c8ec3bfea44..baa9780cad71 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -46,6 +46,16 @@ static inline pte_t 
*pte_alloc_one_kernel_noprof(struct mm_struct *mm)
  #define pte_alloc_one_kernel(...) 
alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__))
  #endif

+#ifdef CONFIG_ASYNC_PGTABLE_FREE
+void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc);
+void kernel_pgtable_async_free_page_list(struct list_head *list);
+void kernel_pgtable_async_free_pages(struct page *pages, int order);
+#else
+static inline void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc) {}
+static inline void kernel_pgtable_async_free_page_list(struct list_head 
*list) {}
+static inline void kernel_pgtable_async_free_pages(struct page *pages, 
int order) {}
+#endif
+
  /**
   * pte_free_kernel - free PTE-level kernel page table memory
   * @mm: the mm_struct of the current context
@@ -53,7 +63,12 @@ static inline pte_t 
*pte_alloc_one_kernel_noprof(struct mm_struct *mm)
   */
  static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
  {
-	pagetable_dtor_free(virt_to_ptdesc(pte));
+	struct ptdesc *ptdesc = virt_to_ptdesc(pte);
+
+	if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
+		kernel_pgtable_async_free_dtor(ptdesc);
+	else
+		pagetable_dtor_free(ptdesc);
  }

  /**
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf..369f3259e6da 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1346,6 +1346,9 @@ config LOCK_MM_AND_FIND_VMA
  config IOMMU_MM_DATA
  	bool

+config ASYNC_PGTABLE_FREE
+	bool
+
  config EXECMEM
  	bool

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..d2751da58c94 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -406,3 +406,109 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, 
pmd_t *pmd,
  	pte_unmap_unlock(pte, ptl);
  	goto again;
  }
+
+#ifdef CONFIG_ASYNC_PGTABLE_FREE
+static void kernel_pgtable_work_func(struct work_struct *work);
+
+static struct {
+	/* list for pagetable_dtor_free() */
+	struct list_head dtor;
+	/* list for __free_page() */
+	struct list_head page;
+	/* list for free_pages() */
+	struct list_head pages;
+	/* protect all the ptdesc lists */
+	spinlock_t lock;
+	struct work_struct work;
+} kernel_pgtable_work = {
+	.dtor = LIST_HEAD_INIT(kernel_pgtable_work.dtor),
+	.page = LIST_HEAD_INIT(kernel_pgtable_work.page),
+	.pages = LIST_HEAD_INIT(kernel_pgtable_work.pages),
+	.lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock),
+	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, 
kernel_pgtable_work_func),
+};
+
+struct async_free_data {
+	struct list_head node;
+	unsigned long addr;
+	int order;
+};
+
+static void kernel_pgtable_work_func(struct work_struct *work)
+{
+	struct async_free_data *data, *temp;
+	struct ptdesc *ptdesc, *next;
+	LIST_HEAD(pages_list);
+	LIST_HEAD(dtor_list);
+	LIST_HEAD(page_list);
+
+	spin_lock(&kernel_pgtable_work.lock);
+	list_splice_tail_init(&kernel_pgtable_work.dtor, &dtor_list);
+	list_splice_tail_init(&kernel_pgtable_work.page, &page_list);
+	list_splice_tail_init(&kernel_pgtable_work.pages, &pages_list);
+	spin_unlock(&kernel_pgtable_work.lock);
+
+	list_for_each_entry_safe(ptdesc, next, &dtor_list, pt_list) {
+		list_del(&ptdesc->pt_list);
+		pagetable_dtor_free(ptdesc);
+	}
+
+	list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) {
+		list_del(&ptdesc->pt_list);
+		__free_page(ptdesc_page(ptdesc));
+	}
+
+	list_for_each_entry_safe(data, temp, &pages_list, node) {
+		list_del(&data->node);
+		free_pages(data->addr, data->order);
+		kfree(data);
+	}
+}
+
+void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc)
+{
+	spin_lock(&kernel_pgtable_work.lock);
+	list_add(&ptdesc->pt_list, &kernel_pgtable_work.dtor);
+	spin_unlock(&kernel_pgtable_work.lock);
+
+	schedule_work(&kernel_pgtable_work.work);
+}
+
+void kernel_pgtable_async_free_page_list(struct list_head *list)
+{
+	spin_lock(&kernel_pgtable_work.lock);
+	list_splice_tail(list, &kernel_pgtable_work.page);
+	spin_unlock(&kernel_pgtable_work.lock);
+
+	schedule_work(&kernel_pgtable_work.work);
+}
+
+void kernel_pgtable_async_free_pages(struct page *pages, int order)
+{
+	struct async_free_data *data;
+
+	if (order == 0) {
+		spin_lock(&kernel_pgtable_work.lock);
+		list_add(&page_ptdesc(pages)->pt_list, &kernel_pgtable_work.page);
+		spin_unlock(&kernel_pgtable_work.lock);
+
+		goto out;
+	}
+
+	data = kzalloc(sizeof(*data), GFP_ATOMIC);
+	if (!data) {
+		free_pages((unsigned long)page_address(pages), order);
+		goto out;
+	}
+
+	data->addr = (unsigned long)page_address(pages);
+	data->order = order;
+
+	spin_lock(&kernel_pgtable_work.lock);
+	list_add(&data->node, &kernel_pgtable_work.pages);
+	spin_unlock(&kernel_pgtable_work.lock);
+
+out:
+	schedule_work(&kernel_pgtable_work.work);
+}
+#endif
-- 
2.43.0

Thanks,
baolu

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-23  3:26                     ` Baolu Lu
@ 2025-08-25 22:36                       ` Dave Hansen
  2025-08-26  1:25                         ` Baolu Lu
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Hansen @ 2025-08-25 22:36 UTC (permalink / raw)
  To: Baolu Lu, Tian, Kevin, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On 8/22/25 20:26, Baolu Lu wrote:
> +static struct {
> +    /* list for pagetable_dtor_free() */
> +    struct list_head dtor;
> +    /* list for __free_page() */
> +    struct list_head page;
> +    /* list for free_pages() */
> +    struct list_head pages;
> +    /* protect all the ptdesc lists */
> +    spinlock_t lock;
> +    struct work_struct work;

Could you explain a bit why this now needs three separate lists? Seems
like pure overkill.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-25 22:36                       ` Dave Hansen
@ 2025-08-26  1:25                         ` Baolu Lu
  2025-08-26  2:49                           ` Baolu Lu
  0 siblings, 1 reply; 51+ messages in thread
From: Baolu Lu @ 2025-08-26  1:25 UTC (permalink / raw)
  To: Dave Hansen, Tian, Kevin, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On 8/26/25 06:36, Dave Hansen wrote:
> On 8/22/25 20:26, Baolu Lu wrote:
>> +static struct {
>> +    /* list for pagetable_dtor_free() */
>> +    struct list_head dtor;
>> +    /* list for __free_page() */
>> +    struct list_head page;
>> +    /* list for free_pages() */
>> +    struct list_head pages;
>> +    /* protect all the ptdesc lists */
>> +    spinlock_t lock;
>> +    struct work_struct work;
> 
> Could you explain a bit why this now needs three separate lists? Seems
> like pure overkill.

Yes, sure.

The three separate lists are needed because we're handling three
distinct types of page deallocation. Grouping the pages this way allows
the workqueue handler to free each type using the correct function.

- pagetable_dtor_free(): This is for freeing PTE pages, which require
   specific cleanup of a ptdesc structure.

  - __free_page(): This is for freeing a single page.

  - free_pages(): This is for freeing a contiguous block of pages that
    were allocated together.

Using separate lists for each type ensures that every page is handled
correctly without having to check the page's type at runtime.

This seems like overkill, it was chosen to ensure functional
correctness. Any better solution?

Thanks,
baolu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-26  1:25                         ` Baolu Lu
@ 2025-08-26  2:49                           ` Baolu Lu
  2025-08-26 14:22                             ` Dave Hansen
  2025-08-26 16:21                             ` Dave Hansen
  0 siblings, 2 replies; 51+ messages in thread
From: Baolu Lu @ 2025-08-26  2:49 UTC (permalink / raw)
  To: Dave Hansen, Tian, Kevin, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On 8/26/25 09:25, Baolu Lu wrote:
> On 8/26/25 06:36, Dave Hansen wrote:
>> On 8/22/25 20:26, Baolu Lu wrote:
>>> +static struct {
>>> +    /* list for pagetable_dtor_free() */
>>> +    struct list_head dtor;
>>> +    /* list for __free_page() */
>>> +    struct list_head page;
>>> +    /* list for free_pages() */
>>> +    struct list_head pages;
>>> +    /* protect all the ptdesc lists */
>>> +    spinlock_t lock;
>>> +    struct work_struct work;
>>
>> Could you explain a bit why this now needs three separate lists? Seems
>> like pure overkill.
> 
> Yes, sure.
> 
> The three separate lists are needed because we're handling three
> distinct types of page deallocation. Grouping the pages this way allows
> the workqueue handler to free each type using the correct function.

Please allow me to add more details.

> 
> - pagetable_dtor_free(): This is for freeing PTE pages, which require
>    specific cleanup of a ptdesc structure.

This is used in

static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)

and

int pud_free_pmd_page(pud_t *pud, unsigned long addr)

> 
>   - __free_page(): This is for freeing a single page.

This is used in

static void cpa_collapse_large_pages(struct cpa_data *cpa)
{
         ... ...

	list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
                 list_del(&ptdesc->pt_list);
                 __free_page(ptdesc_page(ptdesc));
         }
}

> 
>   - free_pages(): This is for freeing a contiguous block of pages that
>     were allocated together.

This is used in

static void __meminit free_pagetable(struct page *page, int order)
{
	... ...

	free_pages((unsigned long)page_address(page), order);
}

What's strange is that order is almost always 0, except in the path of
remove_pmd_table() -> free_hugepage_table(), where order can be greater
than 0. However, in this context path, free_hugepage_table() isn't used
to free a page table page itself. Instead, it's used to free the actual
pages that a leaf PMD is pointing to.

static void __meminit
remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
                  bool direct, struct vmem_altmap *altmap)
{
         ... ...

         if (pmd_leaf(*pmd)) {
                 if (IS_ALIGNED(addr, PMD_SIZE) &&
                     IS_ALIGNED(next, PMD_SIZE)) {
                         if (!direct)
                                 free_hugepage_table(pmd_page(*pmd),
                                                     altmap);

                         spin_lock(&init_mm.page_table_lock);
                         pmd_clear(pmd);
                         spin_unlock(&init_mm.page_table_lock);
                         pages++;
                 } else if (vmemmap_pmd_is_unused(addr, next)) {
                                 free_hugepage_table(pmd_page(*pmd),
                                                     altmap);
                                 spin_lock(&init_mm.page_table_lock);
                                 pmd_clear(pmd);
                                 spin_unlock(&init_mm.page_table_lock);
                 }
                 continue;

         ... ...
}

Is this a misuse of free_pagetable() or anything overlooked?

Thanks,
baolu


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-26  2:49                           ` Baolu Lu
@ 2025-08-26 14:22                             ` Dave Hansen
  2025-08-26 14:33                               ` Matthew Wilcox
  2025-08-27 10:58                               ` Baolu Lu
  2025-08-26 16:21                             ` Dave Hansen
  1 sibling, 2 replies; 51+ messages in thread
From: Dave Hansen @ 2025-08-26 14:22 UTC (permalink / raw)
  To: Baolu Lu, Tian, Kevin, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	vishal.moola, Matthew Wilcox

On 8/25/25 19:49, Baolu Lu wrote:
>> The three separate lists are needed because we're handling three
>> distinct types of page deallocation. Grouping the pages this way allows
>> the workqueue handler to free each type using the correct function.
> 
> Please allow me to add more details.

Right, I know why it got added this way: it was the quickest way to hack
together a patch that fixes the IOMMU issue without refactoring anything.

I agree that you have three cases:
1. A full on 'struct ptdesc' that needs its destructor run
2. An order-0 'struct page'
3. A higher-order 'struct page'

Long-term, #2 and #3 probably need to get converted over to 'struct
ptdesc'. They don't look _that_ hard to convert to me. Willy, Vishal,
any other mm folks: do you agree?

Short-term, I'd just consolidate your issue down to a single list.

#1: For 'struct ptdesc', modify pte_free_kernel() to pass information in
    to pagetable_dtor_free() to tell it to use the deferred page table
    free list. Do this with a bit in the ptdesc or a new argument to
    pagetable_dtor_free().
#2. Just append these to the deferred page table free list. Easy.
#3. The biggest hacky way to do this is to just treat the higher-order
    non-compound page and put the pages on the deferred page table
    free list one at a time. The other way to do it is to track down how
    this thing got allocated in the first place and make sure it's got
    __GFP_COMP metadata. If so, you can just use __free_pages() for
    everything.

Yeah, it'll take a couple patches up front to refactor some things. But
that refactoring will make things more consistent instead of adding
adding complexity to deal with the inconsistency.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-26 14:22                             ` Dave Hansen
@ 2025-08-26 14:33                               ` Matthew Wilcox
  2025-08-26 14:57                                 ` Dave Hansen
  2025-08-27 10:58                               ` Baolu Lu
  1 sibling, 1 reply; 51+ messages in thread
From: Matthew Wilcox @ 2025-08-26 14:33 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Baolu Lu, Tian, Kevin, Jason Gunthorpe, Joerg Roedel, Will Deacon,
	Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev,
	security@kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org, vishal.moola

On Tue, Aug 26, 2025 at 07:22:06AM -0700, Dave Hansen wrote:
> On 8/25/25 19:49, Baolu Lu wrote:
> >> The three separate lists are needed because we're handling three
> >> distinct types of page deallocation. Grouping the pages this way allows
> >> the workqueue handler to free each type using the correct function.
> > 
> > Please allow me to add more details.
> 
> Right, I know why it got added this way: it was the quickest way to hack
> together a patch that fixes the IOMMU issue without refactoring anything.
> 
> I agree that you have three cases:
> 1. A full on 'struct ptdesc' that needs its destructor run
> 2. An order-0 'struct page'
> 3. A higher-order 'struct page'
> 
> Long-term, #2 and #3 probably need to get converted over to 'struct
> ptdesc'. They don't look _that_ hard to convert to me. Willy, Vishal,
> any other mm folks: do you agree?

Uhh.  I'm still quite ignorant about iommu page tables.  Let me make
some general observations that may be helpful.

We are attempting to shrink struct page down to 8 bytes.  That's going
to proceed in stages over the next two to three years.  Depending how
much information you need to keep around, you may be able to keep using
struct page for a while, but eventually (unless you only need a few bits
of information), you're going to need a memdesc for your allocations.

One memdesc type already assigned is for page tables.  Maybe iommu page
tables are the same / similar enough to a CPU page table that we keep
the same data structure.  Maybe they'll need their own data structure.
I lack the knowledge to make that decision.

For more on memdescs, please see here:
https://kernelnewbies.org/MatthewWilcox/Memdescs

> Short-term, I'd just consolidate your issue down to a single list.
> 
> #1: For 'struct ptdesc', modify pte_free_kernel() to pass information in
>     to pagetable_dtor_free() to tell it to use the deferred page table
>     free list. Do this with a bit in the ptdesc or a new argument to
>     pagetable_dtor_free().

We should be able to reuse one of the flags in __page_flags or there
are 24 unused bits in __page_type.

> #2. Just append these to the deferred page table free list. Easy.
> #3. The biggest hacky way to do this is to just treat the higher-order
>     non-compound page and put the pages on the deferred page table
>     free list one at a time. The other way to do it is to track down how
>     this thing got allocated in the first place and make sure it's got
>     __GFP_COMP metadata. If so, you can just use __free_pages() for
>     everything.

Non-compound allocations are bad news.  Is there a good reason these
can't be made into compound allocations?

> Yeah, it'll take a couple patches up front to refactor some things. But
> that refactoring will make things more consistent instead of adding
> adding complexity to deal with the inconsistency.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-26 14:33                               ` Matthew Wilcox
@ 2025-08-26 14:57                                 ` Dave Hansen
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Hansen @ 2025-08-26 14:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Baolu Lu, Tian, Kevin, Jason Gunthorpe, Joerg Roedel, Will Deacon,
	Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev,
	security@kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org, vishal.moola

On 8/26/25 07:33, Matthew Wilcox wrote:
...
> One memdesc type already assigned is for page tables.  Maybe iommu page
> tables are the same

One bit of context: This is an IOMMU problem but the IOMMU is walking
the normal x86 page tables in this case, specifically kernel page
tables. There are separate IOMMU page tables, of course, but those
aren't at issue here.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-26 14:22                             ` Dave Hansen
  2025-08-26 14:33                               ` Matthew Wilcox
@ 2025-08-27 10:58                               ` Baolu Lu
  2025-08-27 23:31                                 ` Dave Hansen
  1 sibling, 1 reply; 51+ messages in thread
From: Baolu Lu @ 2025-08-27 10:58 UTC (permalink / raw)
  To: Dave Hansen, Tian, Kevin, Jason Gunthorpe
  Cc: baolu.lu, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn,
	Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	vishal.moola, Matthew Wilcox

On 8/26/2025 10:22 PM, Dave Hansen wrote:
> On 8/25/25 19:49, Baolu Lu wrote:
>>> The three separate lists are needed because we're handling three
>>> distinct types of page deallocation. Grouping the pages this way allows
>>> the workqueue handler to free each type using the correct function.
>>
>> Please allow me to add more details.
> 
> Right, I know why it got added this way: it was the quickest way to hack
> together a patch that fixes the IOMMU issue without refactoring anything.
> 
> I agree that you have three cases:
> 1. A full on 'struct ptdesc' that needs its destructor run
> 2. An order-0 'struct page'
> 3. A higher-order 'struct page'
> 
> Long-term, #2 and #3 probably need to get converted over to 'struct
> ptdesc'. They don't look _that_ hard to convert to me. Willy, Vishal,
> any other mm folks: do you agree?
> 
> Short-term, I'd just consolidate your issue down to a single list.
> 
> #1: For 'struct ptdesc', modify pte_free_kernel() to pass information in
>      to pagetable_dtor_free() to tell it to use the deferred page table
>      free list. Do this with a bit in the ptdesc or a new argument to
>      pagetable_dtor_free().
> #2. Just append these to the deferred page table free list. Easy.
> #3. The biggest hacky way to do this is to just treat the higher-order
>      non-compound page and put the pages on the deferred page table
>      free list one at a time. The other way to do it is to track down how
>      this thing got allocated in the first place and make sure it's got
>      __GFP_COMP metadata. If so, you can just use __free_pages() for
>      everything.
> 
> Yeah, it'll take a couple patches up front to refactor some things. But
> that refactoring will make things more consistent instead of adding
> adding complexity to deal with the inconsistency.

Following the insights above, I wrote the code as follows. Does it look
good?

mm/pgtable-generic.c:

#ifdef CONFIG_ASYNC_PGTABLE_FREE
/* a 'struct ptdesc' that needs its destructor run */
#define ASYNC_PGTABLE_FREE_DTOR	BIT(NR_PAGEFLAGS)

static void kernel_pgtable_work_func(struct work_struct *work);

static struct {
	struct list_head list;
	/* protect above ptdesc lists */
	spinlock_t lock;
	struct work_struct work;
} kernel_pgtable_work = {
	.list = LIST_HEAD_INIT(kernel_pgtable_work.list),
	.lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock),
	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, 
kernel_pgtable_work_func),
};

static void kernel_pgtable_work_func(struct work_struct *work)
{
	struct ptdesc *ptdesc, *next;
	LIST_HEAD(page_list);

	spin_lock(&kernel_pgtable_work.lock);
	list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
	spin_unlock(&kernel_pgtable_work.lock);

	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);

	list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) {
		list_del(&ptdesc->pt_list);
		if (ptdesc->__page_flags & ASYNC_PGTABLE_FREE_DTOR)
			pagetable_dtor_free(ptdesc);
		else
			__free_page(ptdesc_page(ptdesc));
	}
}

void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc)
{
	spin_lock(&kernel_pgtable_work.lock);
	ptdesc->__page_flags |= ASYNC_PGTABLE_FREE_DTOR;
	list_add(&ptdesc->pt_list, &kernel_pgtable_work.list);
	spin_unlock(&kernel_pgtable_work.lock);

	schedule_work(&kernel_pgtable_work.work);
}

void kernel_pgtable_async_free_page_list(struct list_head *list)
{
	spin_lock(&kernel_pgtable_work.lock);
	list_splice_tail(list, &kernel_pgtable_work.list);
	spin_unlock(&kernel_pgtable_work.lock);

	schedule_work(&kernel_pgtable_work.work);
}

void kernel_pgtable_async_free_page(struct page *page)
{
	spin_lock(&kernel_pgtable_work.lock);
	list_add(&page_ptdesc(page)->pt_list, &kernel_pgtable_work.list);
	spin_unlock(&kernel_pgtable_work.lock);

	schedule_work(&kernel_pgtable_work.work);
}
#endif

Thanks,
baolu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-27 10:58                               ` Baolu Lu
@ 2025-08-27 23:31                                 ` Dave Hansen
  2025-08-28  5:31                                   ` Baolu Lu
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Hansen @ 2025-08-27 23:31 UTC (permalink / raw)
  To: Baolu Lu, Tian, Kevin, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	vishal.moola, Matthew Wilcox

On 8/27/25 03:58, Baolu Lu wrote:
> Following the insights above, I wrote the code as follows. Does it look
> good?

I'd really prefer an actual diff that compiles. Because:

> #ifdef CONFIG_ASYNC_PGTABLE_FREE
> /* a 'struct ptdesc' that needs its destructor run */
> #define ASYNC_PGTABLE_FREE_DTOR    BIT(NR_PAGEFLAGS)

This doesn't work, does it? Don't you need to _allocate_ a new bit?
Wouldn't this die a hilariously horrible death if NR_PAGEFLAGS==64? ;)

Also, I'm pretty sure you can't just go setting random bits in this
because it aliases with page flags.

...
> static void kernel_pgtable_work_func(struct work_struct *work)
> {
>     struct ptdesc *ptdesc, *next;
>     LIST_HEAD(page_list);
> 
>     spin_lock(&kernel_pgtable_work.lock);
>     list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
>     spin_unlock(&kernel_pgtable_work.lock);
> 
>     iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> 
>     list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) {
>         list_del(&ptdesc->pt_list);
>         if (ptdesc->__page_flags & ASYNC_PGTABLE_FREE_DTOR)
>             pagetable_dtor_free(ptdesc);
>         else
>             __free_page(ptdesc_page(ptdesc));
>     }
> }
What I had in mind was that kernel_pgtable_work_func() only does:

	__free_pages(page, compound_order(page));

All of the things that queue gunk to the list do the legwork of making
the work_func() simple. This also guides them toward proving a single,
compound page because _they_ have to do the work if they don't want
something simple.

> void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc)
> {
>     spin_lock(&kernel_pgtable_work.lock);
>     ptdesc->__page_flags |= ASYNC_PGTABLE_FREE_DTOR;
>     list_add(&ptdesc->pt_list, &kernel_pgtable_work.list);
>     spin_unlock(&kernel_pgtable_work.lock);
> 
>     schedule_work(&kernel_pgtable_work.work);
> }
> 
> void kernel_pgtable_async_free_page_list(struct list_head *list)
> {
>     spin_lock(&kernel_pgtable_work.lock);
>     list_splice_tail(list, &kernel_pgtable_work.list);
>     spin_unlock(&kernel_pgtable_work.lock);
> 
>     schedule_work(&kernel_pgtable_work.work);
> }
> 
> void kernel_pgtable_async_free_page(struct page *page)
> {
>     spin_lock(&kernel_pgtable_work.lock);
>     list_add(&page_ptdesc(page)->pt_list, &kernel_pgtable_work.list);
>     spin_unlock(&kernel_pgtable_work.lock);
> 
>     schedule_work(&kernel_pgtable_work.work);
> }

I wouldn't have three of these, I'd have _one_. If you want to free a
bunch of pages, then just call it a bunch of times. This is not
performance critical.

Oh, and the ptdesc flag shouldn't be for "I need to be asynchronously
freed". All kernel PTE pages should ideally set it at allocation so you
can do this:

static inline void pagetable_free(struct ptdesc *pt)
{
        struct page *page = ptdesc_page(pt);

	if (ptdesc->some_field | PTDESC_KERNEL)
		kernel_pgtable_async_free_page(page);
	else
	        __free_pages(page, compound_order(page));
}

The folks that, today, call pagetable_dtor_free() don't have to do
_anything_ at free time. They just set the PTDESC_KERNEL bit at
allocation time.

See how that code reads? "If you have a kernel page table page, you must
asynchronously free it." That's actually meaningful.

I'm pretty sure the lower 24 bits of ptdesc->__page_type are free. So
I'd just do this:

#define PTDESC_TYPE_KERNEL BIT(0)

Then, something needs to set:

	ptdesc->__page_type |= PTDESC_TYPE_KERNEL;

That could happen as late as here:

static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
{
	struct ptdesc *ptdesc = virt_to_ptdesc(pte);

	// here 	

        pagetable_dtor_free(ptdesc);
}

Or as early as ptdesc allocation/constructor time. Then you check for
PTDESC_TYPE_KERNEL in pagetable_free().

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-27 23:31                                 ` Dave Hansen
@ 2025-08-28  5:31                                   ` Baolu Lu
  2025-08-28  7:08                                     ` Tian, Kevin
  0 siblings, 1 reply; 51+ messages in thread
From: Baolu Lu @ 2025-08-28  5:31 UTC (permalink / raw)
  To: Dave Hansen, Tian, Kevin, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	vishal.moola, Matthew Wilcox

On 8/28/25 07:31, Dave Hansen wrote:
> On 8/27/25 03:58, Baolu Lu wrote:
>> Following the insights above, I wrote the code as follows. Does it look
>> good?
> 
> I'd really prefer an actual diff that compiles. Because:
> 
>> #ifdef CONFIG_ASYNC_PGTABLE_FREE
>> /* a 'struct ptdesc' that needs its destructor run */
>> #define ASYNC_PGTABLE_FREE_DTOR    BIT(NR_PAGEFLAGS)
> 
> This doesn't work, does it? Don't you need to _allocate_ a new bit?
> Wouldn't this die a hilariously horrible death if NR_PAGEFLAGS==64? ;)
> 
> Also, I'm pretty sure you can't just go setting random bits in this
> because it aliases with page flags.
> 
> ...
>> static void kernel_pgtable_work_func(struct work_struct *work)
>> {
>>      struct ptdesc *ptdesc, *next;
>>      LIST_HEAD(page_list);
>>
>>      spin_lock(&kernel_pgtable_work.lock);
>>      list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
>>      spin_unlock(&kernel_pgtable_work.lock);
>>
>>      iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
>>
>>      list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) {
>>          list_del(&ptdesc->pt_list);
>>          if (ptdesc->__page_flags & ASYNC_PGTABLE_FREE_DTOR)
>>              pagetable_dtor_free(ptdesc);
>>          else
>>              __free_page(ptdesc_page(ptdesc));
>>      }
>> }
> What I had in mind was that kernel_pgtable_work_func() only does:
> 
> 	__free_pages(page, compound_order(page));
> 
> All of the things that queue gunk to the list do the legwork of making
> the work_func() simple. This also guides them toward proving a single,
> compound page because _they_ have to do the work if they don't want
> something simple.
> 
>> void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc)
>> {
>>      spin_lock(&kernel_pgtable_work.lock);
>>      ptdesc->__page_flags |= ASYNC_PGTABLE_FREE_DTOR;
>>      list_add(&ptdesc->pt_list, &kernel_pgtable_work.list);
>>      spin_unlock(&kernel_pgtable_work.lock);
>>
>>      schedule_work(&kernel_pgtable_work.work);
>> }
>>
>> void kernel_pgtable_async_free_page_list(struct list_head *list)
>> {
>>      spin_lock(&kernel_pgtable_work.lock);
>>      list_splice_tail(list, &kernel_pgtable_work.list);
>>      spin_unlock(&kernel_pgtable_work.lock);
>>
>>      schedule_work(&kernel_pgtable_work.work);
>> }
>>
>> void kernel_pgtable_async_free_page(struct page *page)
>> {
>>      spin_lock(&kernel_pgtable_work.lock);
>>      list_add(&page_ptdesc(page)->pt_list, &kernel_pgtable_work.list);
>>      spin_unlock(&kernel_pgtable_work.lock);
>>
>>      schedule_work(&kernel_pgtable_work.work);
>> }
> 
> I wouldn't have three of these, I'd have _one_. If you want to free a
> bunch of pages, then just call it a bunch of times. This is not
> performance critical.
> 
> Oh, and the ptdesc flag shouldn't be for "I need to be asynchronously
> freed". All kernel PTE pages should ideally set it at allocation so you
> can do this:
> 
> static inline void pagetable_free(struct ptdesc *pt)
> {
>          struct page *page = ptdesc_page(pt);
> 
> 	if (ptdesc->some_field | PTDESC_KERNEL)
> 		kernel_pgtable_async_free_page(page);
> 	else
> 	        __free_pages(page, compound_order(page));
> }
> 
> The folks that, today, call pagetable_dtor_free() don't have to do
> _anything_ at free time. They just set the PTDESC_KERNEL bit at
> allocation time.
> 
> See how that code reads? "If you have a kernel page table page, you must
> asynchronously free it." That's actually meaningful.
> 
> I'm pretty sure the lower 24 bits of ptdesc->__page_type are free. So
> I'd just do this:
> 
> #define PTDESC_TYPE_KERNEL BIT(0)
> 
> Then, something needs to set:
> 
> 	ptdesc->__page_type |= PTDESC_TYPE_KERNEL;
> 
> That could happen as late as here:
> 
> static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
> {
> 	struct ptdesc *ptdesc = virt_to_ptdesc(pte);
> 
> 	// here 	
> 
>          pagetable_dtor_free(ptdesc);
> }
> 
> Or as early as ptdesc allocation/constructor time. Then you check for
> PTDESC_TYPE_KERNEL in pagetable_free().

Very appreciated for your review comments. I changed the patch
accordingly like this:

[PATCH 1/2] mm: Conditionally defer freeing of kernel page table pages

On x86 and other architectures that map the kernel's virtual address space
into the upper portion of every process's page table, the IOMMU's paging
structure caches can become stale when the CPU page table is shared with
IOMMU in the Shared Virtual Address (SVA) context. This occurs when a page
used for the kernel's page tables is freed and reused without the IOMMU
being notified.

While the IOMMU driver is notified of changes to user virtual address
mappings, there is no similar notification mechanism for kernel page
table changes. This can lead to data corruption or system instability
when Shared Virtual Address (SVA) is enabled, as the IOMMU's internal
caches may retain stale entries for kernel virtual addresses.

This introduces a conditional asynchronous mechanism, enabled by
CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the freeing
of pages that are used as page tables for kernel address mappings. These
pages are now queued to a work struct instead of being freed immediately.

This deferred freeing provides a safe context for a future patch to add
an IOMMU-specific callback, which might be expensive on large-scale
systems. This ensures the necessary IOMMU cache invalidation is performed
before the page is finally returned to the page allocator outside of any
critical, non-sleepable path.

In the current kernel, some page table pages are allocated with an
associated struct ptdesc, while others are not. Those without a ptdesc are
freed using free_pages() and its variants, which bypasses the destructor
that pagetable_dtor_free() would run. While the long-term plan is to
convert all page table pages to use struct ptdesc, this uses a temporary
flag within ptdesc to indicate whether a page needs a destructor,
considering that this aims to fix a potential security issue in IOMMU SVA.
The flag and its associated logic can be removed once the conversion is
complete.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
  arch/x86/mm/init_64.c         |  9 ++++++-
  arch/x86/mm/pat/set_memory.c  |  5 +++-
  arch/x86/mm/pgtable.c         |  9 ++++++-
  include/asm-generic/pgalloc.h | 15 +++++++++++-
  include/linux/mm_types.h      |  8 ++++++-
  mm/Kconfig                    |  3 +++
  mm/pgtable-generic.c          | 45 +++++++++++++++++++++++++++++++++++
  7 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 76e33bd7c556..8a050cfa0f8d 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1013,7 +1013,14 @@ static void __meminit free_pagetable(struct page 
*page, int order)
  		free_reserved_pages(page, nr_pages);
  #endif
  	} else {
-		free_pages((unsigned long)page_address(page), order);
+		/*
+		 * 'order == 0' means a kernel page table page, otherwise it's
+		 * data pages pointed by a huge page leaf pte.
+		 */
+		if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE) && !order)
+			kernel_pgtable_async_free(page_ptdesc(page));
+		else
+			free_pages((unsigned long)page_address(page), order);
  	}
  }

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 8834c76f91c9..9a552e93d1f1 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -438,7 +438,10 @@ static void cpa_collapse_large_pages(struct 
cpa_data *cpa)

  	list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
  		list_del(&ptdesc->pt_list);
-		__free_page(ptdesc_page(ptdesc));
+		if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
+			kernel_pgtable_async_free(ptdesc);
+		else
+			__free_page(ptdesc_page(ptdesc));
  	}
  }

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ddf248c3ee7d..6f400cff2ad4 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -757,7 +757,14 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)

  	free_page((unsigned long)pmd_sv);

-	pmd_free(&init_mm, pmd);
+	if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) {
+		struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
+
+		ptdesc->__page_type |= PTDESC_TYPE_KERNEL;
+		kernel_pgtable_async_free(ptdesc);
+	} else {
+		pmd_free(&init_mm, pmd);
+	}

  	return 1;
  }
diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 3c8ec3bfea44..d92dc5289ef4 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -46,6 +46,12 @@ static inline pte_t 
*pte_alloc_one_kernel_noprof(struct mm_struct *mm)
  #define pte_alloc_one_kernel(...) 
alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__))
  #endif

+#ifdef CONFIG_ASYNC_PGTABLE_FREE
+void kernel_pgtable_async_free(struct ptdesc *ptdesc);
+#else
+static inline void kernel_pgtable_async_free(struct ptdesc *ptdesc) {}
+#endif
+
  /**
   * pte_free_kernel - free PTE-level kernel page table memory
   * @mm: the mm_struct of the current context
@@ -53,7 +59,14 @@ static inline pte_t 
*pte_alloc_one_kernel_noprof(struct mm_struct *mm)
   */
  static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
  {
-	pagetable_dtor_free(virt_to_ptdesc(pte));
+	struct ptdesc *ptdesc = virt_to_ptdesc(pte);
+
+	ptdesc->__page_type |= PTDESC_TYPE_KERNEL;
+
+	if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
+		kernel_pgtable_async_free(ptdesc);
+	else
+		pagetable_dtor_free(ptdesc);
  }

  /**
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 08bc2442db93..f74c6b167e6a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -539,7 +539,7 @@ FOLIO_MATCH(compound_head, _head_3);
   * @pt_share_count:   Used for HugeTLB PMD page table share count.
   * @_pt_pad_2:        Padding to ensure proper alignment.
   * @ptl:              Lock for the page table.
- * @__page_type:      Same as page->page_type. Unused for page tables.
+ * @__page_type:      Same as page->page_type.
   * @__page_refcount:  Same as page refcount.
   * @pt_memcg_data:    Memcg data. Tracked for page tables here.
   *
@@ -637,6 +637,12 @@ static inline void ptdesc_pmd_pts_init(struct 
ptdesc *ptdesc)
  }
  #endif

+/*
+ * Used by ptdesc::__page_type to indicate a page for kernel address page
+ * table mapping.
+ */
+#define PTDESC_TYPE_KERNEL	BIT(0)
+
  /*
   * Used for sizing the vmemmap region on some architectures
   */
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf..369f3259e6da 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1346,6 +1346,9 @@ config LOCK_MM_AND_FIND_VMA
  config IOMMU_MM_DATA
  	bool

+config ASYNC_PGTABLE_FREE
+	bool
+
  config EXECMEM
  	bool

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..024d42480c51 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -406,3 +406,48 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, 
pmd_t *pmd,
  	pte_unmap_unlock(pte, ptl);
  	goto again;
  }
+
+#ifdef CONFIG_ASYNC_PGTABLE_FREE
+static void kernel_pgtable_work_func(struct work_struct *work);
+
+static struct {
+	struct list_head list;
+	/* protect above ptdesc lists */
+	spinlock_t lock;
+	struct work_struct work;
+} kernel_pgtable_work = {
+	.list = LIST_HEAD_INIT(kernel_pgtable_work.list),
+	.lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock),
+	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, 
kernel_pgtable_work_func),
+};
+
+static void kernel_pgtable_work_func(struct work_struct *work)
+{
+	struct ptdesc *ptdesc, *next;
+	LIST_HEAD(page_list);
+
+	spin_lock(&kernel_pgtable_work.lock);
+	list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
+	spin_unlock(&kernel_pgtable_work.lock);
+
+	list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) {
+		list_del(&ptdesc->pt_list);
+		if (ptdesc->__page_type & PTDESC_TYPE_KERNEL) {
+			pagetable_dtor_free(ptdesc);
+		} else {
+			struct page *page = ptdesc_page(ptdesc);
+
+			__free_pages(page, compound_order(page));
+		}
+	}
+}
+
+void kernel_pgtable_async_free(struct ptdesc *ptdesc)
+{
+	spin_lock(&kernel_pgtable_work.lock);
+	list_add(&ptdesc->pt_list, &kernel_pgtable_work.list);
+	spin_unlock(&kernel_pgtable_work.lock);
+
+	schedule_work(&kernel_pgtable_work.work);
+}
+#endif
-- 
2.43.0

Thanks,
baolu

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* RE: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-28  5:31                                   ` Baolu Lu
@ 2025-08-28  7:08                                     ` Tian, Kevin
  2025-08-28 18:56                                       ` Dave Hansen
  0 siblings, 1 reply; 51+ messages in thread
From: Tian, Kevin @ 2025-08-28  7:08 UTC (permalink / raw)
  To: Baolu Lu, Hansen, Dave, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	vishal.moola@gmail.com, Matthew Wilcox

> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Thursday, August 28, 2025 1:31 PM
>
> @@ -438,7 +438,10 @@ static void cpa_collapse_large_pages(struct
> cpa_data *cpa)
> 
>   	list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
>   		list_del(&ptdesc->pt_list);
> -		__free_page(ptdesc_page(ptdesc));
> +		if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
> +			kernel_pgtable_async_free(ptdesc);
> +		else
> +			__free_page(ptdesc_page(ptdesc));
>   	}

Dave's suggestion is to check the new ptdesc flag and defer in
pagetable_free().

both here and above could be converted to:

	ptdesc->__page_type |= PTDESC_TYPE_KERNEL;
	pagetable_free(ptdesc);

> @@ -757,7 +757,14 @@ int pud_free_pmd_page(pud_t *pud, unsigned long
> addr)
> 
>   	free_page((unsigned long)pmd_sv);
> 
> -	pmd_free(&init_mm, pmd);
> +	if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) {
> +		struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
> +
> +		ptdesc->__page_type |= PTDESC_TYPE_KERNEL;
> +		kernel_pgtable_async_free(ptdesc);
> +	} else {
> +		pmd_free(&init_mm, pmd);
> +	}

We may add a new pmd_free_kernel() helper, which does:

	ptdesc->__page_type |= PTDESC_TYPE_KERNEL;
	pagetable_dtor_free(ptdesc);

>   static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
>   {
> -	pagetable_dtor_free(virt_to_ptdesc(pte));
> +	struct ptdesc *ptdesc = virt_to_ptdesc(pte);
> +
> +	ptdesc->__page_type |= PTDESC_TYPE_KERNEL;
> +
> +	if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE))
> +		kernel_pgtable_async_free(ptdesc);
> +	else
> +		pagetable_dtor_free(ptdesc);
>   }

same:

	ptdesc->__page_type |= PTDESC_TYPE_KERNEL;
	pagetable_dtor_free(ptdesc);

Then you have pagetable_free() to handle defer in one place (revised on
Dave's draft):

static inline void pagetable_free(struct ptdesc *pt)
{
	struct page *page = ptdesc_page(pt);

	if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE) &&
	    (ptdesc->__page_type | PTDESC_KERNEL))
		kernel_pgtable_async_free_page(page);
	else
		__free_pages(page, compound_order(page));
}

> +static void kernel_pgtable_work_func(struct work_struct *work)
> +{
> +	struct ptdesc *ptdesc, *next;
> +	LIST_HEAD(page_list);
> +
> +	spin_lock(&kernel_pgtable_work.lock);
> +	list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
> +	spin_unlock(&kernel_pgtable_work.lock);
> +
> +	list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) {
> +		list_del(&ptdesc->pt_list);
> +		if (ptdesc->__page_type & PTDESC_TYPE_KERNEL) {
> +			pagetable_dtor_free(ptdesc);
> +		} else {
> +			struct page *page = ptdesc_page(ptdesc);
> +
> +			__free_pages(page, compound_order(page));
> +		}

Then you only need __free_pages() here.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-28  7:08                                     ` Tian, Kevin
@ 2025-08-28 18:56                                       ` Dave Hansen
  2025-08-28 19:10                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Hansen @ 2025-08-28 18:56 UTC (permalink / raw)
  To: Tian, Kevin, Baolu Lu, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	vishal.moola@gmail.com, Matthew Wilcox

I got a bit sick of sending constant emails on this one and just decided
to hack it together myself. Here's roughly what I came up with.
Completely untested. Changelogs are bare. Barely compiles:

> https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=kpte

I also didn't actually hook up the IOMMU flush yet.

The biggest single chunk of code is defining ptdesc_*_kernel(). The rest
of it is painfully simple.

I'm pretty sure this hits the vmalloc() case that we all care about. It
needs some more time with folks staring at it to make sure it didn't
miss some of the other cases.

I'll wrangle this into a real series and send it via another thread
unless someone beats me to it. If anyone sees anything in there they
can't live with, please scream sooner rather than later.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-28 18:56                                       ` Dave Hansen
@ 2025-08-28 19:10                                         ` Jason Gunthorpe
  2025-08-28 19:31                                           ` Dave Hansen
  0 siblings, 1 reply; 51+ messages in thread
From: Jason Gunthorpe @ 2025-08-28 19:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Tian, Kevin, Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
	Lai, Yi1, iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	vishal.moola@gmail.com, Matthew Wilcox

On Thu, Aug 28, 2025 at 11:56:00AM -0700, Dave Hansen wrote:
> I got a bit sick of sending constant emails on this one and just decided
> to hack it together myself. Here's roughly what I came up with.
> Completely untested. Changelogs are bare. Barely compiles:
> 
> > https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=kpte
> 
> I also didn't actually hook up the IOMMU flush yet.
> 
> The biggest single chunk of code is defining ptdesc_*_kernel(). The rest
> of it is painfully simple.

Seems not great to be casting ptdesc to folio just to use
folio_set_referenced(), I'm pretty sure that is no the direction
things are going in..

I'd set the flag yourself or perhaps use the memory of __page_mapping
to store a flag (but zero it before freeing).

Looks OK otherwise

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-28 19:10                                         ` Jason Gunthorpe
@ 2025-08-28 19:31                                           ` Dave Hansen
  2025-08-28 19:39                                             ` Matthew Wilcox
  0 siblings, 1 reply; 51+ messages in thread
From: Dave Hansen @ 2025-08-28 19:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
	Lai, Yi1, iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	vishal.moola@gmail.com, Matthew Wilcox

On 8/28/25 12:10, Jason Gunthorpe wrote:
>> The biggest single chunk of code is defining ptdesc_*_kernel(). The rest
>> of it is painfully simple.
> Seems not great to be casting ptdesc to folio just to use
> folio_set_referenced(), I'm pretty sure that is no the direction
> things are going in..

Ideally, the ptdesc->__page_flags fields would have their own set of
macros just like folios do, right? Alas, I was too lazy to go to the
trouble of conjuring those up for this single bit.

> I'd set the flag yourself or perhaps use the memory of __page_mapping
> to store a flag (but zero it before freeing).

That was my first inclination as well (actually I was going to use
__page_type since the lower 24 bits are evidently open, but same idea).

Willy pointed me in the direction of PG_referenced on IRC, so I took
that path. But I'm open to doing whatever. It's just three lines of code
that probably end up boiling down to

	bts 0x1, 0x8(%rax)
versus
	bts 0x2, 0x16(%rax)

;)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-28 19:31                                           ` Dave Hansen
@ 2025-08-28 19:39                                             ` Matthew Wilcox
  0 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox @ 2025-08-28 19:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jason Gunthorpe, Tian, Kevin, Baolu Lu, Joerg Roedel, Will Deacon,
	Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev,
	security@kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org, vishal.moola@gmail.com

On Thu, Aug 28, 2025 at 12:31:40PM -0700, Dave Hansen wrote:
> On 8/28/25 12:10, Jason Gunthorpe wrote:
> >> The biggest single chunk of code is defining ptdesc_*_kernel(). The rest
> >> of it is painfully simple.
> > Seems not great to be casting ptdesc to folio just to use
> > folio_set_referenced(), I'm pretty sure that is no the direction
> > things are going in..
> 
> Ideally, the ptdesc->__page_flags fields would have their own set of
> macros just like folios do, right? Alas, I was too lazy to go to the
> trouble of conjuring those up for this single bit.

There's a bunch of "Ideally" in ptdesc, but I'm equially too lazy to do
that before I've proven everything works great with slab and/or folios.
It'll have to be changed no matter which way we do it, so as long as
it's obvious where needs to be changed, it's fine.  I deem this
"good enough for now".

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-26  2:49                           ` Baolu Lu
  2025-08-26 14:22                             ` Dave Hansen
@ 2025-08-26 16:21                             ` Dave Hansen
  2025-08-27  6:34                               ` Baolu Lu
  1 sibling, 1 reply; 51+ messages in thread
From: Dave Hansen @ 2025-08-26 16:21 UTC (permalink / raw)
  To: Baolu Lu, Tian, Kevin, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org

On 8/25/25 19:49, Baolu Lu wrote:
> What's strange is that order is almost always 0, except in the path of
> remove_pmd_table() -> free_hugepage_table(), where order can be greater
> than 0. However, in this context path, free_hugepage_table() isn't used
> to free a page table page itself. Instead, it's used to free the actual
> pages that a leaf PMD is pointing to.

When I first read this, I thought you meant that remove_pmd_table() was
trying to free a high-order page composed of multiple pte pages. I don't
think it is doing that.

Just to be clear: remove_pmd_table() has two modes:

	1. The pmd_leaf() code that calls free_hugepage_table(). It is
           removing the memory pointed to by a PMD *entry*. It is *NOT*
	   removing page tables themselves.
	2. The !pmd_leaf() code that does remove the pointers to
	   individual pte pages and frees them via free_pte_table().

*Neither* of these is freeing high-order page tables pages. It either
frees plain old kernel data pages or it frees an order-0 page table page.

In other words, the pmd_leaf() (mode #1) code is irrelevant to us. Those
entries are all _PAGE_KERNEL so IOMMU accesses with user privilege can't
do anything with them.

Or have I just horribly confused myself?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-26 16:21                             ` Dave Hansen
@ 2025-08-27  6:34                               ` Baolu Lu
  0 siblings, 0 replies; 51+ messages in thread
From: Baolu Lu @ 2025-08-27  6:34 UTC (permalink / raw)
  To: Dave Hansen, Tian, Kevin, Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
	iommu@lists.linux.dev, security@kernel.org,
	linux-kernel@vger.kernel.org

On 8/27/25 00:21, Dave Hansen wrote:
> On 8/25/25 19:49, Baolu Lu wrote:
>> What's strange is that order is almost always 0, except in the path of
>> remove_pmd_table() -> free_hugepage_table(), where order can be greater
>> than 0. However, in this context path, free_hugepage_table() isn't used
>> to free a page table page itself. Instead, it's used to free the actual
>> pages that a leaf PMD is pointing to.
> When I first read this, I thought you meant that remove_pmd_table() was
> trying to free a high-order page composed of multiple pte pages. I don't
> think it is doing that.
> 
> Just to be clear: remove_pmd_table() has two modes:
> 
> 	1. The pmd_leaf() code that calls free_hugepage_table(). It is
>             removing the memory pointed to by a PMD*entry*. It is*NOT*
> 	   removing page tables themselves.
> 	2. The !pmd_leaf() code that does remove the pointers to
> 	   individual pte pages and frees them via free_pte_table().
> 
> *Neither* of these is freeing high-order page tables pages. It either
> frees plain old kernel data pages or it frees an order-0 page table page.
> 
> In other words, the pmd_leaf() (mode #1) code is irrelevant to us. Those
> entries are all _PAGE_KERNEL so IOMMU accesses with user privilege can't
> do anything with them.

I have the same understanding, so case #3 (higher-order non-compound
page table pages) doesn't exist. Page table pages always have an order
equal to 0.

I will head in the direction you pointed out in your previous reply and
post the change later for further review. Thank you for the insights.

> 
> Or have I just horribly confused myself?

Thanks,
baolu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-07 19:51             ` Jason Gunthorpe
  2025-08-08  2:57               ` Tian, Kevin
@ 2025-08-08  5:08               ` Baolu Lu
  1 sibling, 0 replies; 51+ messages in thread
From: Baolu Lu @ 2025-08-08  5:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
	Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
	iommu, security, linux-kernel, stable

On 8/8/25 03:51, Jason Gunthorpe wrote:
> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
>> +static void kernel_pte_work_func(struct work_struct *work)
>> +{
>> +	struct page *page, *next;
>> +
>> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
>> +
>> +	guard(spinlock)(&kernel_pte_work.lock);
>> +	list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
>> +		list_del_init(&page->lru);
> Please don't add new usages of lru, we are trying to get rid of this. 🙁
> 
> I think the memory should be struct ptdesc, use that..

Yes, sure.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-06 15:03 ` Dave Hansen
  2025-08-06 15:52   ` Jason Gunthorpe
@ 2025-08-07  6:53   ` Baolu Lu
  1 sibling, 0 replies; 51+ messages in thread
From: Baolu Lu @ 2025-08-07  6:53 UTC (permalink / raw)
  To: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Alistair Popple,
	Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
	Andy Lutomirski, Yi Lai
  Cc: iommu, security, linux-kernel, stable

On 8/6/25 23:03, Dave Hansen wrote:
> On 8/5/25 22:25, Lu Baolu wrote:
>> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
>> shares and walks the CPU's page tables. The Linux x86 architecture maps
>> the kernel address space into the upper portion of every process’s page
>> table. Consequently, in an SVA context, the IOMMU hardware can walk and
>> cache kernel space mappings. However, the Linux kernel currently lacks
>> a notification mechanism for kernel space mapping changes. This means
>> the IOMMU driver is not aware of such changes, leading to a break in
>> IOMMU cache coherence.
> FWIW, I wouldn't use the term "cache coherence" in this context. I'd
> probably just call them "stale IOTLB entries".
> 
> I also think this over states the problem. There is currently no problem
> with "kernel space mapping changes". The issue is solely when kernel
> page table pages are freed and reused.
> 
>> Modern IOMMUs often cache page table entries of the intermediate-level
>> page table as long as the entry is valid, no matter the permissions, to
>> optimize walk performance. Currently the iommu driver is notified only
>> for changes of user VA mappings, so the IOMMU's internal caches may
>> retain stale entries for kernel VA. When kernel page table mappings are
>> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
>> entries, Use-After-Free (UAF) vulnerability condition arises.
>>
>> If these freed page table pages are reallocated for a different purpose,
>> potentially by an attacker, the IOMMU could misinterpret the new data as
>> valid page table entries. This allows the IOMMU to walk into attacker-
>> controlled memory, leading to arbitrary physical memory DMA access or
>> privilege escalation.
> Note that it's not just use-after-free. It's literally that the IOMMU
> will keep writing Accessed and Dirty bits while it thinks the page is
> still a page table. The IOMMU will sit there happily setting bits. So,
> it's_write_ after free too.
> 
>> To mitigate this, introduce a new iommu interface to flush IOMMU caches.
>> This interface should be invoked from architecture-specific code that
>> manages combined user and kernel page tables, whenever a kernel page table
>> update is done and the CPU TLB needs to be flushed.
> There's one tidbit missing from this:
> 
> 	Currently SVA contexts are all unprivileged. They can only
> 	access user mappings and never kernel mappings. However, IOMMU
> 	still walk kernel-only page tables all the way down to the leaf
> 	where they realize that the entry is a kernel mapping and error
> 	out.

Thank you for the guidance. I will improve the commit message
accordingly, as follows.

iommu/sva: Invalidate stale IOTLB entries for kernel address space

In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The x86 architecture maps the
kernel's virtual address space into the upper portion of every process's
page table. Consequently, in an SVA context, the IOMMU hardware can walk
and cache kernel page table entries.

The Linux kernel currently lacks a notification mechanism for kernel page
table changes, specifically when page table pages are freed and reused.
The IOMMU driver is only notified of changes to user virtual address
mappings. This can cause the IOMMU's internal caches to retain stale
entries for kernel VA.

A Use-After-Free (UAF) and Write-After-Free (WAF) condition arises when
kernel page table pages are freed and later reallocated. The IOMMU could
misinterpret the new data as valid page table entries. The IOMMU might
then walk into attacker-controlled memory, leading to arbitrary physical
memory DMA access or privilege escalation. This is also a Write-After-Free
issue, as the IOMMU will potentially continue to write Accessed and Dirty
bits to the freed memory while attempting to walk the stale page tables.

Currently, SVA contexts are unprivileged and cannot access kernel
mappings. However, the IOMMU will still walk kernel-only page tables
all the way down to the leaf entries, where it realizes the mapping
is for the kernel and errors out. This means the IOMMU still caches
these intermediate page table entries, making the described vulnerability
a real concern.

To mitigate this, a new IOMMU interface is introduced to flush IOTLB
entries for the kernel address space. This interface is invoked from the
x86 architecture code that manages combined user and kernel page tables,
specifically when a kernel page table update requires a CPU TLB flush.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-06  5:25 [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush Lu Baolu
  2025-08-06 15:03 ` Dave Hansen
@ 2025-08-14  4:48 ` Ethan Zhao
  2025-08-15  7:48   ` Baolu Lu
  1 sibling, 1 reply; 51+ messages in thread
From: Ethan Zhao @ 2025-08-14  4:48 UTC (permalink / raw)
  To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
  Cc: iommu, security, linux-kernel, stable



On 8/6/2025 1:25 PM, Lu Baolu wrote:
> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
> shares and walks the CPU's page tables. The Linux x86 architecture maps
> the kernel address space into the upper portion of every process’s page
> table. Consequently, in an SVA context, the IOMMU hardware can walk and
> cache kernel space mappings. However, the Linux kernel currently lacks
> a notification mechanism for kernel space mapping changes. This means
> the IOMMU driver is not aware of such changes, leading to a break in
> IOMMU cache coherence.
> 
> Modern IOMMUs often cache page table entries of the intermediate-level
> page table as long as the entry is valid, no matter the permissions, to
> optimize walk performance. Currently the iommu driver is notified only
> for changes of user VA mappings, so the IOMMU's internal caches may
> retain stale entries for kernel VA. When kernel page table mappings are
> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
> entries, Use-After-Free (UAF) vulnerability condition arises.
> 
> If these freed page table pages are reallocated for a different purpose,
> potentially by an attacker, the IOMMU could misinterpret the new data as
> valid page table entries. This allows the IOMMU to walk into attacker-
> controlled memory, leading to arbitrary physical memory DMA access or
> privilege escalation.
> 
> To mitigate this, introduce a new iommu interface to flush IOMMU caches.
> This interface should be invoked from architecture-specific code that
> manages combined user and kernel page tables, whenever a kernel page table
> update is done and the CPU TLB needs to be flushed.
> 
> Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
> Cc: stable@vger.kernel.org
> Suggested-by: Jann Horn <jannh@google.com>
> Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Tested-by: Yi Lai <yi1.lai@intel.com>
> ---
>   arch/x86/mm/tlb.c         |  4 +++
>   drivers/iommu/iommu-sva.c | 60 ++++++++++++++++++++++++++++++++++++++-
>   include/linux/iommu.h     |  4 +++
>   3 files changed, 67 insertions(+), 1 deletion(-)
> 
> Change log:
> v3:
>   - iommu_sva_mms is an unbound list; iterating it in an atomic context
>     could introduce significant latency issues. Schedule it in a kernel
>     thread and replace the spinlock with a mutex.
>   - Replace the static key with a normal bool; it can be brought back if
>     data shows the benefit.
>   - Invalidate KVA range in the flush_tlb_all() paths.
>   - All previous reviewed-bys are preserved. Please let me know if there
>     are any objections.
> 
> v2:
>   - https://lore.kernel.org/linux-iommu/20250709062800.651521-1-baolu.lu@linux.intel.com/
>   - Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range);
>   - Replace the mutex with a spinlock to make the interface usable in the
>     critical regions.
> 
> v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux.intel.com/
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 39f80111e6f1..3b85e7d3ba44 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -12,6 +12,7 @@
>   #include <linux/task_work.h>
>   #include <linux/mmu_notifier.h>
>   #include <linux/mmu_context.h>
> +#include <linux/iommu.h>
>   
>   #include <asm/tlbflush.h>
>   #include <asm/mmu_context.h>
> @@ -1478,6 +1479,8 @@ void flush_tlb_all(void)
>   	else
>   		/* Fall back to the IPI-based invalidation. */
>   		on_each_cpu(do_flush_tlb_all, NULL, 1);
> +
> +	iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
Establishing such a simple one-to-one connection between CPU TLB flush
and IOMMU TLB flush is debatable. At the very least, not every process
is attached to an IOMMU SVA domain. Currently, devices and IOMMU 
operating in scalable mode are not commonly applied to every process.

Thanks,
Ethan>   }
>   
>   /* Flush an arbitrarily large range of memory with INVLPGB. */
> @@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>   		kernel_tlb_flush_range(info);
>   
>   	put_flush_tlb_info();
> +	iommu_sva_invalidate_kva_range(start, end);
>   }
>   
>   /*
> diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
> index 1a51cfd82808..d0da2b3fd64b 100644
> --- a/drivers/iommu/iommu-sva.c
> +++ b/drivers/iommu/iommu-sva.c
> @@ -10,6 +10,8 @@
>   #include "iommu-priv.h"
>   
>   static DEFINE_MUTEX(iommu_sva_lock);
> +static bool iommu_sva_present;
> +static LIST_HEAD(iommu_sva_mms);
>   static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>   						   struct mm_struct *mm);
>   
> @@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de
>   		return ERR_PTR(-ENOSPC);
>   	}
>   	iommu_mm->pasid = pasid;
> +	iommu_mm->mm = mm;
>   	INIT_LIST_HEAD(&iommu_mm->sva_domains);
>   	/*
>   	 * Make sure the write to mm->iommu_mm is not reordered in front of
> @@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
>   	if (ret)
>   		goto out_free_domain;
>   	domain->users = 1;
> -	list_add(&domain->next, &mm->iommu_mm->sva_domains);
>   
> +	if (list_empty(&iommu_mm->sva_domains)) {
> +		if (list_empty(&iommu_sva_mms))
> +			WRITE_ONCE(iommu_sva_present, true);
> +		list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
> +	}
> +	list_add(&domain->next, &iommu_mm->sva_domains);
>   out:
>   	refcount_set(&handle->users, 1);
>   	mutex_unlock(&iommu_sva_lock);
> @@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
>   		list_del(&domain->next);
>   		iommu_domain_free(domain);
>   	}
> +
> +	if (list_empty(&iommu_mm->sva_domains)) {
> +		list_del(&iommu_mm->mm_list_elm);
> +		if (list_empty(&iommu_sva_mms))
> +			WRITE_ONCE(iommu_sva_present, false);
> +	}
> +
>   	mutex_unlock(&iommu_sva_lock);
>   	kfree(handle);
>   }
> @@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
>   
>   	return domain;
>   }
> +
> +struct kva_invalidation_work_data {
> +	struct work_struct work;
> +	unsigned long start;
> +	unsigned long end;
> +};
> +
> +static void invalidate_kva_func(struct work_struct *work)
> +{
> +	struct kva_invalidation_work_data *data =
> +		container_of(work, struct kva_invalidation_work_data, work);
> +	struct iommu_mm_data *iommu_mm;
> +
> +	guard(mutex)(&iommu_sva_lock);
> +	list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
> +		mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm,
> +				data->start, data->end);
> +
> +	kfree(data);
> +}
> +
> +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
> +{
> +	struct kva_invalidation_work_data *data;
> +
> +	if (likely(!READ_ONCE(iommu_sva_present)))
> +		return;
> +
> +	/* will be freed in the task function */
> +	data = kzalloc(sizeof(*data), GFP_ATOMIC);
> +	if (!data)
> +		return;
> +
> +	data->start = start;
> +	data->end = end;
> +	INIT_WORK(&data->work, invalidate_kva_func);
> +
> +	/*
> +	 * Since iommu_sva_mms is an unbound list, iterating it in an atomic
> +	 * context could introduce significant latency issues.
> +	 */
> +	schedule_work(&data->work);
> +}
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index c30d12e16473..66e4abb2df0d 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -1134,7 +1134,9 @@ struct iommu_sva {
>   
>   struct iommu_mm_data {
>   	u32			pasid;
> +	struct mm_struct	*mm;
>   	struct list_head	sva_domains;
> +	struct list_head	mm_list_elm;
>   };
>   
>   int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode);
> @@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
>   					struct mm_struct *mm);
>   void iommu_sva_unbind_device(struct iommu_sva *handle);
>   u32 iommu_sva_get_pasid(struct iommu_sva *handle);
> +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end);
>   #else
>   static inline struct iommu_sva *
>   iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
> @@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm)
>   }
>   
>   static inline void mm_pasid_drop(struct mm_struct *mm) {}
> +static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {}
>   #endif /* CONFIG_IOMMU_SVA */
>   
>   #ifdef CONFIG_IOMMU_IOPF


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush
  2025-08-14  4:48 ` Ethan Zhao
@ 2025-08-15  7:48   ` Baolu Lu
  0 siblings, 0 replies; 51+ messages in thread
From: Baolu Lu @ 2025-08-15  7:48 UTC (permalink / raw)
  To: Ethan Zhao, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
	Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
	Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
	Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
  Cc: baolu.lu, iommu, security, linux-kernel, stable

On 8/14/2025 12:48 PM, Ethan Zhao wrote:
> 
> 
> On 8/6/2025 1:25 PM, Lu Baolu wrote:
>> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
>> shares and walks the CPU's page tables. The Linux x86 architecture maps
>> the kernel address space into the upper portion of every process’s page
>> table. Consequently, in an SVA context, the IOMMU hardware can walk and
>> cache kernel space mappings. However, the Linux kernel currently lacks
>> a notification mechanism for kernel space mapping changes. This means
>> the IOMMU driver is not aware of such changes, leading to a break in
>> IOMMU cache coherence.
>>
>> Modern IOMMUs often cache page table entries of the intermediate-level
>> page table as long as the entry is valid, no matter the permissions, to
>> optimize walk performance. Currently the iommu driver is notified only
>> for changes of user VA mappings, so the IOMMU's internal caches may
>> retain stale entries for kernel VA. When kernel page table mappings are
>> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale
>> entries, Use-After-Free (UAF) vulnerability condition arises.
>>
>> If these freed page table pages are reallocated for a different purpose,
>> potentially by an attacker, the IOMMU could misinterpret the new data as
>> valid page table entries. This allows the IOMMU to walk into attacker-
>> controlled memory, leading to arbitrary physical memory DMA access or
>> privilege escalation.
>>
>> To mitigate this, introduce a new iommu interface to flush IOMMU caches.
>> This interface should be invoked from architecture-specific code that
>> manages combined user and kernel page tables, whenever a kernel page 
>> table
>> update is done and the CPU TLB needs to be flushed.
>>
>> Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
>> Cc: stable@vger.kernel.org
>> Suggested-by: Jann Horn <jannh@google.com>
>> Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>> Tested-by: Yi Lai <yi1.lai@intel.com>
>> ---
>>   arch/x86/mm/tlb.c         |  4 +++
>>   drivers/iommu/iommu-sva.c | 60 ++++++++++++++++++++++++++++++++++++++-
>>   include/linux/iommu.h     |  4 +++
>>   3 files changed, 67 insertions(+), 1 deletion(-)
>>
>> Change log:
>> v3:
>>   - iommu_sva_mms is an unbound list; iterating it in an atomic context
>>     could introduce significant latency issues. Schedule it in a kernel
>>     thread and replace the spinlock with a mutex.
>>   - Replace the static key with a normal bool; it can be brought back if
>>     data shows the benefit.
>>   - Invalidate KVA range in the flush_tlb_all() paths.
>>   - All previous reviewed-bys are preserved. Please let me know if there
>>     are any objections.
>>
>> v2:
>>   - https://lore.kernel.org/linux-iommu/20250709062800.651521-1- 
>> baolu.lu@linux.intel.com/
>>   - Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range);
>>   - Replace the mutex with a spinlock to make the interface usable in the
>>     critical regions.
>>
>> v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1- 
>> baolu.lu@linux.intel.com/
>>
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 39f80111e6f1..3b85e7d3ba44 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/task_work.h>
>>   #include <linux/mmu_notifier.h>
>>   #include <linux/mmu_context.h>
>> +#include <linux/iommu.h>
>>   #include <asm/tlbflush.h>
>>   #include <asm/mmu_context.h>
>> @@ -1478,6 +1479,8 @@ void flush_tlb_all(void)
>>       else
>>           /* Fall back to the IPI-based invalidation. */
>>           on_each_cpu(do_flush_tlb_all, NULL, 1);
>> +
>> +    iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> Establishing such a simple one-to-one connection between CPU TLB flush
> and IOMMU TLB flush is debatable. At the very least, not every process
> is attached to an IOMMU SVA domain. Currently, devices and IOMMU 
> operating in scalable mode are not commonly applied to every process.

You're right. As discussed, I'll defer the IOTLB invalidation to a
scheduled kernel work in pte_free_kernel(). The
sva_invalidation_kva_range() function on the IOMMU side is actually a
no-op if there's no SVA domain in use.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2025-08-28 19:39 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-06  5:25 [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush Lu Baolu
2025-08-06 15:03 ` Dave Hansen
2025-08-06 15:52   ` Jason Gunthorpe
2025-08-06 16:04     ` Dave Hansen
2025-08-06 16:09       ` Jason Gunthorpe
2025-08-06 16:34         ` Dave Hansen
2025-08-06 16:42           ` Jason Gunthorpe
2025-08-07 14:40           ` Baolu Lu
2025-08-07 15:31             ` Dave Hansen
2025-08-08  5:15               ` Baolu Lu
2025-08-10  7:19                 ` Ethan Zhao
2025-08-11  9:15                   ` Uladzislau Rezki
2025-08-11 12:55                     ` Jason Gunthorpe
2025-08-15  9:23                       ` Baolu Lu
2025-08-11 13:55                     ` Dave Hansen
2025-08-11 14:56                       ` Uladzislau Rezki
2025-08-12  1:17                       ` Ethan Zhao
2025-08-15 14:35                         ` Dave Hansen
2025-08-11 12:57                 ` Jason Gunthorpe
2025-08-13  3:17                   ` Ethan Zhao
2025-08-18  1:34                   ` Baolu Lu
2025-08-07 19:51             ` Jason Gunthorpe
2025-08-08  2:57               ` Tian, Kevin
2025-08-15  9:16                 ` Baolu Lu
2025-08-15  9:46                   ` Tian, Kevin
2025-08-18  5:58                     ` Baolu Lu
2025-08-15 14:31                   ` Dave Hansen
2025-08-18  6:08                     ` Baolu Lu
2025-08-18  6:21                 ` Baolu Lu
2025-08-21  7:05                   ` Tian, Kevin
2025-08-23  3:26                     ` Baolu Lu
2025-08-25 22:36                       ` Dave Hansen
2025-08-26  1:25                         ` Baolu Lu
2025-08-26  2:49                           ` Baolu Lu
2025-08-26 14:22                             ` Dave Hansen
2025-08-26 14:33                               ` Matthew Wilcox
2025-08-26 14:57                                 ` Dave Hansen
2025-08-27 10:58                               ` Baolu Lu
2025-08-27 23:31                                 ` Dave Hansen
2025-08-28  5:31                                   ` Baolu Lu
2025-08-28  7:08                                     ` Tian, Kevin
2025-08-28 18:56                                       ` Dave Hansen
2025-08-28 19:10                                         ` Jason Gunthorpe
2025-08-28 19:31                                           ` Dave Hansen
2025-08-28 19:39                                             ` Matthew Wilcox
2025-08-26 16:21                             ` Dave Hansen
2025-08-27  6:34                               ` Baolu Lu
2025-08-08  5:08               ` Baolu Lu
2025-08-07  6:53   ` Baolu Lu
2025-08-14  4:48 ` Ethan Zhao
2025-08-15  7:48   ` Baolu Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).