* [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush @ 2025-08-06 5:25 Lu Baolu 2025-08-06 15:03 ` Dave Hansen 2025-08-14 4:48 ` Ethan Zhao 0 siblings, 2 replies; 51+ messages in thread From: Lu Baolu @ 2025-08-06 5:25 UTC (permalink / raw) To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai Cc: iommu, security, linux-kernel, Lu Baolu, stable In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The Linux x86 architecture maps the kernel address space into the upper portion of every process’s page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel space mappings. However, the Linux kernel currently lacks a notification mechanism for kernel space mapping changes. This means the IOMMU driver is not aware of such changes, leading to a break in IOMMU cache coherence. Modern IOMMUs often cache page table entries of the intermediate-level page table as long as the entry is valid, no matter the permissions, to optimize walk performance. Currently the iommu driver is notified only for changes of user VA mappings, so the IOMMU's internal caches may retain stale entries for kernel VA. When kernel page table mappings are changed (e.g., by vfree()), but the IOMMU's internal caches retain stale entries, Use-After-Free (UAF) vulnerability condition arises. If these freed page table pages are reallocated for a different purpose, potentially by an attacker, the IOMMU could misinterpret the new data as valid page table entries. This allows the IOMMU to walk into attacker- controlled memory, leading to arbitrary physical memory DMA access or privilege escalation. To mitigate this, introduce a new iommu interface to flush IOMMU caches. This interface should be invoked from architecture-specific code that manages combined user and kernel page tables, whenever a kernel page table update is done and the CPU TLB needs to be flushed. Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices") Cc: stable@vger.kernel.org Suggested-by: Jann Horn <jannh@google.com> Co-developed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Yi Lai <yi1.lai@intel.com> --- arch/x86/mm/tlb.c | 4 +++ drivers/iommu/iommu-sva.c | 60 ++++++++++++++++++++++++++++++++++++++- include/linux/iommu.h | 4 +++ 3 files changed, 67 insertions(+), 1 deletion(-) Change log: v3: - iommu_sva_mms is an unbound list; iterating it in an atomic context could introduce significant latency issues. Schedule it in a kernel thread and replace the spinlock with a mutex. - Replace the static key with a normal bool; it can be brought back if data shows the benefit. - Invalidate KVA range in the flush_tlb_all() paths. - All previous reviewed-bys are preserved. Please let me know if there are any objections. v2: - https://lore.kernel.org/linux-iommu/20250709062800.651521-1-baolu.lu@linux.intel.com/ - Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range); - Replace the mutex with a spinlock to make the interface usable in the critical regions. v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux.intel.com/ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 39f80111e6f1..3b85e7d3ba44 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -12,6 +12,7 @@ #include <linux/task_work.h> #include <linux/mmu_notifier.h> #include <linux/mmu_context.h> +#include <linux/iommu.h> #include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -1478,6 +1479,8 @@ void flush_tlb_all(void) else /* Fall back to the IPI-based invalidation. */ on_each_cpu(do_flush_tlb_all, NULL, 1); + + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); } /* Flush an arbitrarily large range of memory with INVLPGB. */ @@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) kernel_tlb_flush_range(info); put_flush_tlb_info(); + iommu_sva_invalidate_kva_range(start, end); } /* diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c index 1a51cfd82808..d0da2b3fd64b 100644 --- a/drivers/iommu/iommu-sva.c +++ b/drivers/iommu/iommu-sva.c @@ -10,6 +10,8 @@ #include "iommu-priv.h" static DEFINE_MUTEX(iommu_sva_lock); +static bool iommu_sva_present; +static LIST_HEAD(iommu_sva_mms); static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, struct mm_struct *mm); @@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de return ERR_PTR(-ENOSPC); } iommu_mm->pasid = pasid; + iommu_mm->mm = mm; INIT_LIST_HEAD(&iommu_mm->sva_domains); /* * Make sure the write to mm->iommu_mm is not reordered in front of @@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm if (ret) goto out_free_domain; domain->users = 1; - list_add(&domain->next, &mm->iommu_mm->sva_domains); + if (list_empty(&iommu_mm->sva_domains)) { + if (list_empty(&iommu_sva_mms)) + WRITE_ONCE(iommu_sva_present, true); + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms); + } + list_add(&domain->next, &iommu_mm->sva_domains); out: refcount_set(&handle->users, 1); mutex_unlock(&iommu_sva_lock); @@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle) list_del(&domain->next); iommu_domain_free(domain); } + + if (list_empty(&iommu_mm->sva_domains)) { + list_del(&iommu_mm->mm_list_elm); + if (list_empty(&iommu_sva_mms)) + WRITE_ONCE(iommu_sva_present, false); + } + mutex_unlock(&iommu_sva_lock); kfree(handle); } @@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, return domain; } + +struct kva_invalidation_work_data { + struct work_struct work; + unsigned long start; + unsigned long end; +}; + +static void invalidate_kva_func(struct work_struct *work) +{ + struct kva_invalidation_work_data *data = + container_of(work, struct kva_invalidation_work_data, work); + struct iommu_mm_data *iommu_mm; + + guard(mutex)(&iommu_sva_lock); + list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm) + mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, + data->start, data->end); + + kfree(data); +} + +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) +{ + struct kva_invalidation_work_data *data; + + if (likely(!READ_ONCE(iommu_sva_present))) + return; + + /* will be freed in the task function */ + data = kzalloc(sizeof(*data), GFP_ATOMIC); + if (!data) + return; + + data->start = start; + data->end = end; + INIT_WORK(&data->work, invalidate_kva_func); + + /* + * Since iommu_sva_mms is an unbound list, iterating it in an atomic + * context could introduce significant latency issues. + */ + schedule_work(&data->work); +} diff --git a/include/linux/iommu.h b/include/linux/iommu.h index c30d12e16473..66e4abb2df0d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -1134,7 +1134,9 @@ struct iommu_sva { struct iommu_mm_data { u32 pasid; + struct mm_struct *mm; struct list_head sva_domains; + struct list_head mm_list_elm; }; int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode); @@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm); void iommu_sva_unbind_device(struct iommu_sva *handle); u32 iommu_sva_get_pasid(struct iommu_sva *handle); +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end); #else static inline struct iommu_sva * iommu_sva_bind_device(struct device *dev, struct mm_struct *mm) @@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm) } static inline void mm_pasid_drop(struct mm_struct *mm) {} +static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {} #endif /* CONFIG_IOMMU_SVA */ #ifdef CONFIG_IOMMU_IOPF -- 2.43.0 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-06 5:25 [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush Lu Baolu @ 2025-08-06 15:03 ` Dave Hansen 2025-08-06 15:52 ` Jason Gunthorpe 2025-08-07 6:53 ` Baolu Lu 2025-08-14 4:48 ` Ethan Zhao 1 sibling, 2 replies; 51+ messages in thread From: Dave Hansen @ 2025-08-06 15:03 UTC (permalink / raw) To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jason Gunthorpe, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai Cc: iommu, security, linux-kernel, stable On 8/5/25 22:25, Lu Baolu wrote: > In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware > shares and walks the CPU's page tables. The Linux x86 architecture maps > the kernel address space into the upper portion of every process’s page > table. Consequently, in an SVA context, the IOMMU hardware can walk and > cache kernel space mappings. However, the Linux kernel currently lacks > a notification mechanism for kernel space mapping changes. This means > the IOMMU driver is not aware of such changes, leading to a break in > IOMMU cache coherence. FWIW, I wouldn't use the term "cache coherence" in this context. I'd probably just call them "stale IOTLB entries". I also think this over states the problem. There is currently no problem with "kernel space mapping changes". The issue is solely when kernel page table pages are freed and reused. > Modern IOMMUs often cache page table entries of the intermediate-level > page table as long as the entry is valid, no matter the permissions, to > optimize walk performance. Currently the iommu driver is notified only > for changes of user VA mappings, so the IOMMU's internal caches may > retain stale entries for kernel VA. When kernel page table mappings are > changed (e.g., by vfree()), but the IOMMU's internal caches retain stale > entries, Use-After-Free (UAF) vulnerability condition arises. > > If these freed page table pages are reallocated for a different purpose, > potentially by an attacker, the IOMMU could misinterpret the new data as > valid page table entries. This allows the IOMMU to walk into attacker- > controlled memory, leading to arbitrary physical memory DMA access or > privilege escalation. Note that it's not just use-after-free. It's literally that the IOMMU will keep writing Accessed and Dirty bits while it thinks the page is still a page table. The IOMMU will sit there happily setting bits. So, it's _write_ after free too. > To mitigate this, introduce a new iommu interface to flush IOMMU caches. > This interface should be invoked from architecture-specific code that > manages combined user and kernel page tables, whenever a kernel page table > update is done and the CPU TLB needs to be flushed. There's one tidbit missing from this: Currently SVA contexts are all unprivileged. They can only access user mappings and never kernel mappings. However, IOMMU still walk kernel-only page tables all the way down to the leaf where they realize that the entry is a kernel mapping and error out. > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 39f80111e6f1..3b85e7d3ba44 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -12,6 +12,7 @@ > #include <linux/task_work.h> > #include <linux/mmu_notifier.h> > #include <linux/mmu_context.h> > +#include <linux/iommu.h> > > #include <asm/tlbflush.h> > #include <asm/mmu_context.h> > @@ -1478,6 +1479,8 @@ void flush_tlb_all(void) > else > /* Fall back to the IPI-based invalidation. */ > on_each_cpu(do_flush_tlb_all, NULL, 1); > + > + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > } > > /* Flush an arbitrarily large range of memory with INVLPGB. */ > @@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) > kernel_tlb_flush_range(info); > > put_flush_tlb_info(); > + iommu_sva_invalidate_kva_range(start, end); > } These desperately need to be commented. They especially need comments that they are solely for flushing the IOMMU mid-level paging structures after freeing a page table page. > /* > diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c > index 1a51cfd82808..d0da2b3fd64b 100644 > --- a/drivers/iommu/iommu-sva.c > +++ b/drivers/iommu/iommu-sva.c > @@ -10,6 +10,8 @@ > #include "iommu-priv.h" > > static DEFINE_MUTEX(iommu_sva_lock); > +static bool iommu_sva_present; > +static LIST_HEAD(iommu_sva_mms); > static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, > struct mm_struct *mm); > > @@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de > return ERR_PTR(-ENOSPC); > } > iommu_mm->pasid = pasid; > + iommu_mm->mm = mm; > INIT_LIST_HEAD(&iommu_mm->sva_domains); > /* > * Make sure the write to mm->iommu_mm is not reordered in front of > @@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm > if (ret) > goto out_free_domain; > domain->users = 1; > - list_add(&domain->next, &mm->iommu_mm->sva_domains); > > + if (list_empty(&iommu_mm->sva_domains)) { > + if (list_empty(&iommu_sva_mms)) > + WRITE_ONCE(iommu_sva_present, true); > + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms); > + } > + list_add(&domain->next, &iommu_mm->sva_domains); > out: > refcount_set(&handle->users, 1); > mutex_unlock(&iommu_sva_lock); > @@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle) > list_del(&domain->next); > iommu_domain_free(domain); > } > + > + if (list_empty(&iommu_mm->sva_domains)) { > + list_del(&iommu_mm->mm_list_elm); > + if (list_empty(&iommu_sva_mms)) > + WRITE_ONCE(iommu_sva_present, false); > + } > + > mutex_unlock(&iommu_sva_lock); > kfree(handle); > } > @@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, > > return domain; > } > + > +struct kva_invalidation_work_data { > + struct work_struct work; > + unsigned long start; > + unsigned long end; > +}; > + > +static void invalidate_kva_func(struct work_struct *work) > +{ > + struct kva_invalidation_work_data *data = > + container_of(work, struct kva_invalidation_work_data, work); > + struct iommu_mm_data *iommu_mm; > + > + guard(mutex)(&iommu_sva_lock); > + list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm) > + mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, > + data->start, data->end); > + > + kfree(data); > +} > + > +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) > +{ > + struct kva_invalidation_work_data *data; > + > + if (likely(!READ_ONCE(iommu_sva_present))) > + return; It would be nice to hear a few more words about why this is safe without a lock. > + /* will be freed in the task function */ > + data = kzalloc(sizeof(*data), GFP_ATOMIC); > + if (!data) > + return; > + > + data->start = start; > + data->end = end; > + INIT_WORK(&data->work, invalidate_kva_func); > + > + /* > + * Since iommu_sva_mms is an unbound list, iterating it in an atomic > + * context could introduce significant latency issues. > + */ > + schedule_work(&data->work); > +} Hold on a sec, though. the problematic caller of this looks something like this (logically): pmd_free_pte_page() { pte = pmd_page_vaddr(*pmd); pmd_clear(pmd); flush_tlb_kernel_range(...); // does schedule_work() pte_free_kernel(pte); } It _immediately_ frees the PTE page. The schedule_work() work will probably happen sometime after the page is freed. Isn't that still a use-after-free? It's for some arbitrary amount of time and better than before but it's still a use-after-free. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-06 15:03 ` Dave Hansen @ 2025-08-06 15:52 ` Jason Gunthorpe 2025-08-06 16:04 ` Dave Hansen 2025-08-07 6:53 ` Baolu Lu 1 sibling, 1 reply; 51+ messages in thread From: Jason Gunthorpe @ 2025-08-06 15:52 UTC (permalink / raw) To: Dave Hansen Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On Wed, Aug 06, 2025 at 08:03:42AM -0700, Dave Hansen wrote: > Hold on a sec, though. the problematic caller of this looks something > like this (logically): > > pmd_free_pte_page() > { > pte = pmd_page_vaddr(*pmd); > pmd_clear(pmd); > flush_tlb_kernel_range(...); // does schedule_work() > pte_free_kernel(pte); > } > > It _immediately_ frees the PTE page. The schedule_work() work will > probably happen sometime after the page is freed. > > Isn't that still a use-after-free? It's for some arbitrary amount of > time and better than before but it's still a use-after-free. Yes it is. You can't do this approach without also pushing the pages to freed on a list and defering the free till the work. This is broadly what the normal mm user flow is doing.. Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-06 15:52 ` Jason Gunthorpe @ 2025-08-06 16:04 ` Dave Hansen 2025-08-06 16:09 ` Jason Gunthorpe 0 siblings, 1 reply; 51+ messages in thread From: Dave Hansen @ 2025-08-06 16:04 UTC (permalink / raw) To: Jason Gunthorpe Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/6/25 08:52, Jason Gunthorpe wrote: >> Isn't that still a use-after-free? It's for some arbitrary amount of >> time and better than before but it's still a use-after-free. > Yes it is. > > You can't do this approach without also pushing the pages to freed on > a list and defering the free till the work. This is broadly what the > normal mm user flow is doing.. FWIW, I think the simplest way to do this is to plop an unconditional schedule_work() in pte_free_kernel(). The work function will invalidate the IOTLBs and then free the page. Keep the schedule_work() unconditional to keep it simple. The schedule_work() is way cheaper than all the system-wide TLB invalidation IPIs that have to get sent as well. No need to add complexity to optimize out something that's in the noise already. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-06 16:04 ` Dave Hansen @ 2025-08-06 16:09 ` Jason Gunthorpe 2025-08-06 16:34 ` Dave Hansen 0 siblings, 1 reply; 51+ messages in thread From: Jason Gunthorpe @ 2025-08-06 16:09 UTC (permalink / raw) To: Dave Hansen Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On Wed, Aug 06, 2025 at 09:04:29AM -0700, Dave Hansen wrote: > On 8/6/25 08:52, Jason Gunthorpe wrote: > >> Isn't that still a use-after-free? It's for some arbitrary amount of > >> time and better than before but it's still a use-after-free. > > Yes it is. > > > > You can't do this approach without also pushing the pages to freed on > > a list and defering the free till the work. This is broadly what the > > normal mm user flow is doing.. > > FWIW, I think the simplest way to do this is to plop an unconditional > schedule_work() in pte_free_kernel(). The work function will invalidate > the IOTLBs and then free the page. > > Keep the schedule_work() unconditional to keep it simple. The > schedule_work() is way cheaper than all the system-wide TLB invalidation > IPIs that have to get sent as well. No need to add complexity to > optimize out something that's in the noise already. That works also, but now you have to allocate memory or you are dead.. Is it OK these days, and safe in this code which seems a little bit linked to memory management? The MM side avoided this by putting the list and the rcu_head in the struct page. Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-06 16:09 ` Jason Gunthorpe @ 2025-08-06 16:34 ` Dave Hansen 2025-08-06 16:42 ` Jason Gunthorpe 2025-08-07 14:40 ` Baolu Lu 0 siblings, 2 replies; 51+ messages in thread From: Dave Hansen @ 2025-08-06 16:34 UTC (permalink / raw) To: Jason Gunthorpe Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/6/25 09:09, Jason Gunthorpe wrote: >>> >>> You can't do this approach without also pushing the pages to freed on >>> a list and defering the free till the work. This is broadly what the >>> normal mm user flow is doing.. >> FWIW, I think the simplest way to do this is to plop an unconditional >> schedule_work() in pte_free_kernel(). The work function will invalidate >> the IOTLBs and then free the page. >> >> Keep the schedule_work() unconditional to keep it simple. The >> schedule_work() is way cheaper than all the system-wide TLB invalidation >> IPIs that have to get sent as well. No need to add complexity to >> optimize out something that's in the noise already. > That works also, but now you have to allocate memory or you are > dead.. Is it OK these days, and safe in this code which seems a little > bit linked to memory management? > > The MM side avoided this by putting the list and the rcu_head in the > struct page. I don't think you need to allocate memory. A little static structure that uses the page->list and has a lock should do. Logically something like this: struct kernel_pgtable_work { struct list_head list; spinlock_t lock; struct work_struct work; } kernel_pte_work; pte_free_kernel() { struct page *page = ptdesc_magic(); guard(spinlock)(&kernel_pte_work.lock); list_add(&page->list, &kernel_pte_work.list); schedule_work(&kernel_pte_work.work); } work_func() { iommu_sva_invalidate_kva(); guard(spinlock)(&kernel_pte_work.lock); list_for_each_safe() { page = container_of(...); free_whatever(page); } } The only wrinkle is that pte_free_kernel() itself still has a pte and 'ptdesc', not a 'struct page'. But there is ptdesc->pt_list, which should be unused at this point, especially for non-pgd pages on x86. So, either go over to the 'struct page' earlier (maybe by open-coding pagetable_dtor_free()?), or just use the ptdesc. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-06 16:34 ` Dave Hansen @ 2025-08-06 16:42 ` Jason Gunthorpe 2025-08-07 14:40 ` Baolu Lu 1 sibling, 0 replies; 51+ messages in thread From: Jason Gunthorpe @ 2025-08-06 16:42 UTC (permalink / raw) To: Dave Hansen Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On Wed, Aug 06, 2025 at 09:34:12AM -0700, Dave Hansen wrote: > struct kernel_pgtable_work > { > struct list_head list; > spinlock_t lock; > struct work_struct work; > } kernel_pte_work; > > pte_free_kernel() > { > struct page *page = ptdesc_magic(); > > guard(spinlock)(&kernel_pte_work.lock); > > list_add(&page->list, &kernel_pte_work.list); > schedule_work(&kernel_pte_work.work); > } Oh, OK, yeah this can work > The only wrinkle is that pte_free_kernel() itself still has a pte and > 'ptdesc', not a 'struct page'. But there is ptdesc->pt_list, which > should be unused at this point, especially for non-pgd pages on x86. It should all be ptdesc, so lets avoid a struct page here and use the pt_list instead.. Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-06 16:34 ` Dave Hansen 2025-08-06 16:42 ` Jason Gunthorpe @ 2025-08-07 14:40 ` Baolu Lu 2025-08-07 15:31 ` Dave Hansen 2025-08-07 19:51 ` Jason Gunthorpe 1 sibling, 2 replies; 51+ messages in thread From: Baolu Lu @ 2025-08-07 14:40 UTC (permalink / raw) To: Dave Hansen, Jason Gunthorpe Cc: baolu.lu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/7/2025 12:34 AM, Dave Hansen wrote: > On 8/6/25 09:09, Jason Gunthorpe wrote: >>>> >>>> You can't do this approach without also pushing the pages to freed on >>>> a list and defering the free till the work. This is broadly what the >>>> normal mm user flow is doing.. >>> FWIW, I think the simplest way to do this is to plop an unconditional >>> schedule_work() in pte_free_kernel(). The work function will invalidate >>> the IOTLBs and then free the page. >>> >>> Keep the schedule_work() unconditional to keep it simple. The >>> schedule_work() is way cheaper than all the system-wide TLB invalidation >>> IPIs that have to get sent as well. No need to add complexity to >>> optimize out something that's in the noise already. >> That works also, but now you have to allocate memory or you are >> dead.. Is it OK these days, and safe in this code which seems a little >> bit linked to memory management? >> >> The MM side avoided this by putting the list and the rcu_head in the >> struct page. > > I don't think you need to allocate memory. A little static structure > that uses the page->list and has a lock should do. Logically something > like this: > > struct kernel_pgtable_work > { > struct list_head list; > spinlock_t lock; > struct work_struct work; > } kernel_pte_work; > > pte_free_kernel() > { > struct page *page = ptdesc_magic(); > > guard(spinlock)(&kernel_pte_work.lock); > > list_add(&page->list, &kernel_pte_work.list); > schedule_work(&kernel_pte_work.work); > } > > work_func() > { > iommu_sva_invalidate_kva(); > > guard(spinlock)(&kernel_pte_work.lock); > > list_for_each_safe() { > page = container_of(...); > free_whatever(page); > } > } > > The only wrinkle is that pte_free_kernel() itself still has a pte and > 'ptdesc', not a 'struct page'. But there is ptdesc->pt_list, which > should be unused at this point, especially for non-pgd pages on x86. > > So, either go over to the 'struct page' earlier (maybe by open-coding > pagetable_dtor_free()?), or just use the ptdesc. I refactored the code above as follows. It compiles but hasn't been tested yet. Does it look good to you? diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h index c88691b15f3c..d9307dd09f67 100644 --- a/arch/x86/include/asm/pgalloc.h +++ b/arch/x86/include/asm/pgalloc.h @@ -10,9 +10,11 @@ #define __HAVE_ARCH_PTE_ALLOC_ONE #define __HAVE_ARCH_PGD_FREE +#define __HAVE_ARCH_PTE_FREE_KERNEL #include <asm-generic/pgalloc.h> static inline int __paravirt_pgd_alloc(struct mm_struct *mm) { return 0; } +void pte_free_kernel(struct mm_struct *mm, pte_t *pte); #ifdef CONFIG_PARAVIRT_XXL #include <asm/paravirt.h> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index ddf248c3ee7d..f9f6738dd3cc 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -2,6 +2,7 @@ #include <linux/mm.h> #include <linux/gfp.h> #include <linux/hugetlb.h> +#include <linux/iommu.h> #include <asm/pgalloc.h> #include <asm/tlb.h> #include <asm/fixmap.h> @@ -844,3 +845,42 @@ void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud) /* See note in arch_check_zapped_pte() */ VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && pud_shstk(pud)); } + +static void kernel_pte_work_func(struct work_struct *work); + +static struct { + struct list_head list; + spinlock_t lock; + struct work_struct work; +} kernel_pte_work = { + .list = LIST_HEAD_INIT(kernel_pte_work.list), + .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock), + .work = __WORK_INITIALIZER(kernel_pte_work.work, kernel_pte_work_func), +}; + +static void kernel_pte_work_func(struct work_struct *work) +{ + struct page *page, *next; + + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); + + guard(spinlock)(&kernel_pte_work.lock); + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { + list_del_init(&page->lru); + pagetable_dtor_free(page_ptdesc(page)); + } +} + +/** + * pte_free_kernel - free PTE-level kernel page table memory + * @mm: the mm_struct of the current context + * @pte: pointer to the memory containing the page table + */ +void pte_free_kernel(struct mm_struct *mm, pte_t *pte) +{ + struct page *page = virt_to_page(pte); + + guard(spinlock)(&kernel_pte_work.lock); + list_add(&page->lru, &kernel_pte_work.list); + schedule_work(&kernel_pte_work.work); +} diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 3c8ec3bfea44..716ebab67636 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -46,6 +46,7 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #define pte_alloc_one_kernel(...) alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) #endif +#ifndef __HAVE_ARCH_PTE_FREE_KERNEL /** * pte_free_kernel - free PTE-level kernel page table memory * @mm: the mm_struct of the current context @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { pagetable_dtor_free(virt_to_ptdesc(pte)); } +#endif /** * __pte_alloc_one - allocate memory for a PTE-level user page table -- 2.43.0 Thanks, baolu ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-07 14:40 ` Baolu Lu @ 2025-08-07 15:31 ` Dave Hansen 2025-08-08 5:15 ` Baolu Lu 2025-08-07 19:51 ` Jason Gunthorpe 1 sibling, 1 reply; 51+ messages in thread From: Dave Hansen @ 2025-08-07 15:31 UTC (permalink / raw) To: Baolu Lu, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/7/25 07:40, Baolu Lu wrote: ... > I refactored the code above as follows. It compiles but hasn't been > tested yet. Does it look good to you? As in, it takes the non-compiling gunk I spewed into my email client and makes it compile, yeah. Sure. ;) > diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/ > pgalloc.h > index c88691b15f3c..d9307dd09f67 100644 > --- a/arch/x86/include/asm/pgalloc.h > +++ b/arch/x86/include/asm/pgalloc.h > @@ -10,9 +10,11 @@ > > #define __HAVE_ARCH_PTE_ALLOC_ONE > #define __HAVE_ARCH_PGD_FREE > +#define __HAVE_ARCH_PTE_FREE_KERNEL But I think it really muddies the waters down here. This kinda reads like "x86 has its own per-arch pte_free_kernel() that it always needs". Which is far from accurate. > @@ -844,3 +845,42 @@ void arch_check_zapped_pud(struct vm_area_struct > *vma, pud_t pud) > /* See note in arch_check_zapped_pte() */ > VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && pud_shstk(pud)); > } > + > +static void kernel_pte_work_func(struct work_struct *work); > + > +static struct { > + struct list_head list; > + spinlock_t lock; > + struct work_struct work; > +} kernel_pte_work = { > + .list = LIST_HEAD_INIT(kernel_pte_work.list), > + .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock), > + .work = __WORK_INITIALIZER(kernel_pte_work.work, > kernel_pte_work_func), > +}; > + > +static void kernel_pte_work_func(struct work_struct *work) > +{ > + struct page *page, *next; > + > + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > + > + guard(spinlock)(&kernel_pte_work.lock); > + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { > + list_del_init(&page->lru); > + pagetable_dtor_free(page_ptdesc(page)); > + } > +} > + > +/** > + * pte_free_kernel - free PTE-level kernel page table memory > + * @mm: the mm_struct of the current context > + * @pte: pointer to the memory containing the page table > + */ The kerneldoc here is just wasted bytes, IMNHO. Why not use those bytes to actually explain what the heck is going on here? > +void pte_free_kernel(struct mm_struct *mm, pte_t *pte) > +{ > + struct page *page = virt_to_page(pte); > + > + guard(spinlock)(&kernel_pte_work.lock); > + list_add(&page->lru, &kernel_pte_work.list); > + schedule_work(&kernel_pte_work.work); > +} > diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h > index 3c8ec3bfea44..716ebab67636 100644 > --- a/include/asm-generic/pgalloc.h > +++ b/include/asm-generic/pgalloc.h > @@ -46,6 +46,7 @@ static inline pte_t > *pte_alloc_one_kernel_noprof(struct mm_struct *mm) > #define pte_alloc_one_kernel(...) > alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) > #endif > > +#ifndef __HAVE_ARCH_PTE_FREE_KERNEL > /** > * pte_free_kernel - free PTE-level kernel page table memory > * @mm: the mm_struct of the current context > @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct > *mm, pte_t *pte) > { > pagetable_dtor_free(virt_to_ptdesc(pte)); > } > +#endif > > /** > * __pte_alloc_one - allocate memory for a PTE-level user page table I'd much rather the arch-generic code looked like this: #ifdef CONFIG_ASYNC_PGTABLE_FREE // code and struct here, or dump them over in some // other file and do this in a header #else static void pte_free_kernel_async(struct page *page) {} #endif void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { struct page *page = virt_to_page(pte); if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) { pte_free_kernel_async(page); else pagetable_dtor_free(page_ptdesc(page)); } Then in Kconfig, you end up with something like: config ASYNC_PGTABLE_FREE def_bool y depends on INTEL_IOMMU_WHATEVER That very much tells much more of the whole story in code. It also gives the x86 folks that compile out the IOMMU the exact same code as the arch-generic folks. It _also_ makes it dirt simple and obvious for the x86 folks to optimize out the async behavior if they don't like it in the future by replacing the compile-time IOMMU check with a runtime one. Also, if another crazy IOMMU implementation comes along that happens to do what the x86 IOMMUs do, then they have a single Kconfig switch to flip. If they follow what this patch tries to do, they'll start by copying and pasting the x86 implementation. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-07 15:31 ` Dave Hansen @ 2025-08-08 5:15 ` Baolu Lu 2025-08-10 7:19 ` Ethan Zhao 2025-08-11 12:57 ` Jason Gunthorpe 0 siblings, 2 replies; 51+ messages in thread From: Baolu Lu @ 2025-08-08 5:15 UTC (permalink / raw) To: Dave Hansen, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/7/25 23:31, Dave Hansen wrote: >> +void pte_free_kernel(struct mm_struct *mm, pte_t *pte) >> +{ >> + struct page *page = virt_to_page(pte); >> + >> + guard(spinlock)(&kernel_pte_work.lock); >> + list_add(&page->lru, &kernel_pte_work.list); >> + schedule_work(&kernel_pte_work.work); >> +} >> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h >> index 3c8ec3bfea44..716ebab67636 100644 >> --- a/include/asm-generic/pgalloc.h >> +++ b/include/asm-generic/pgalloc.h >> @@ -46,6 +46,7 @@ static inline pte_t >> *pte_alloc_one_kernel_noprof(struct mm_struct *mm) >> #define pte_alloc_one_kernel(...) >> alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) >> #endif >> >> +#ifndef __HAVE_ARCH_PTE_FREE_KERNEL >> /** >> * pte_free_kernel - free PTE-level kernel page table memory >> * @mm: the mm_struct of the current context >> @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct >> *mm, pte_t *pte) >> { >> pagetable_dtor_free(virt_to_ptdesc(pte)); >> } >> +#endif >> >> /** >> * __pte_alloc_one - allocate memory for a PTE-level user page table > I'd much rather the arch-generic code looked like this: > > #ifdef CONFIG_ASYNC_PGTABLE_FREE > // code and struct here, or dump them over in some > // other file and do this in a header > #else > static void pte_free_kernel_async(struct page *page) {} > #endif > > void pte_free_kernel(struct mm_struct *mm, pte_t *pte) > { > struct page *page = virt_to_page(pte); > > if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) { > pte_free_kernel_async(page); > else > pagetable_dtor_free(page_ptdesc(page)); > } > > Then in Kconfig, you end up with something like: > > config ASYNC_PGTABLE_FREE > def_bool y > depends on INTEL_IOMMU_WHATEVER > > That very much tells much more of the whole story in code. It also gives > the x86 folks that compile out the IOMMU the exact same code as the > arch-generic folks. It_also_ makes it dirt simple and obvious for the > x86 folks to optimize out the async behavior if they don't like it in > the future by replacing the compile-time IOMMU check with a runtime one. > > Also, if another crazy IOMMU implementation comes along that happens to > do what the x86 IOMMUs do, then they have a single Kconfig switch to > flip. If they follow what this patch tries to do, they'll start by > copying and pasting the x86 implementation. I'll do it like this. Does that look good to you? diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 70d29b14d851..6f1113e024fa 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -160,6 +160,7 @@ config IOMMU_DMA # Shared Virtual Addressing config IOMMU_SVA select IOMMU_MM_DATA + select ASYNC_PGTABLE_FREE if X86 bool config IOMMU_IOPF diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 3c8ec3bfea44..dbddacdca2ce 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -46,6 +46,19 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #define pte_alloc_one_kernel(...) alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) #endif +#ifdef CONFIG_ASYNC_PGTABLE_FREE +struct pgtable_free_work { + struct list_head list; + spinlock_t lock; + struct work_struct work; +}; +extern struct pgtable_free_work kernel_pte_work; + +void pte_free_kernel_async(struct ptdesc *ptdesc); +#else +static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {} +#endif + /** * pte_free_kernel - free PTE-level kernel page table memory * @mm: the mm_struct of the current context @@ -53,7 +66,12 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) */ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { - pagetable_dtor_free(virt_to_ptdesc(pte)); + struct ptdesc *ptdesc = virt_to_ptdesc(pte); + + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) + pte_free_kernel_async(ptdesc); + else + pagetable_dtor_free(ptdesc); } /** diff --git a/mm/Kconfig b/mm/Kconfig index e443fe8cd6cf..528550cfa7fe 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1346,6 +1346,13 @@ config LOCK_MM_AND_FIND_VMA config IOMMU_MM_DATA bool +config ASYNC_PGTABLE_FREE + bool "Asynchronous kernel page table freeing" + help + Perform kernel page table freeing asynchronously. This is required + for systems with IOMMU Shared Virtual Address (SVA) to flush IOTLB + paging structure caches. + config EXECMEM bool diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 567e2d084071..6639ee6641d4 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -13,6 +13,7 @@ #include <linux/swap.h> #include <linux/swapops.h> #include <linux/mm_inline.h> +#include <linux/iommu.h> #include <asm/pgalloc.h> #include <asm/tlb.h> @@ -406,3 +407,32 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, pte_unmap_unlock(pte, ptl); goto again; } + +#ifdef CONFIG_ASYNC_PGTABLE_FREE +static void kernel_pte_work_func(struct work_struct *work); +struct pgtable_free_work kernel_pte_work = { + .list = LIST_HEAD_INIT(kernel_pte_work.list), + .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock), + .work = __WORK_INITIALIZER(kernel_pte_work.work, kernel_pte_work_func), +}; + +static void kernel_pte_work_func(struct work_struct *work) +{ + struct ptdesc *ptdesc, *next; + + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); + + guard(spinlock)(&kernel_pte_work.lock); + list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) { + list_del_init(&ptdesc->pt_list); + pagetable_dtor_free(ptdesc); + } +} + +void pte_free_kernel_async(struct ptdesc *ptdesc) +{ + guard(spinlock)(&kernel_pte_work.lock); + list_add(&ptdesc->pt_list, &kernel_pte_work.list); + schedule_work(&kernel_pte_work.work); +} +#endif -- 2.43.0 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-08 5:15 ` Baolu Lu @ 2025-08-10 7:19 ` Ethan Zhao 2025-08-11 9:15 ` Uladzislau Rezki 2025-08-11 12:57 ` Jason Gunthorpe 1 sibling, 1 reply; 51+ messages in thread From: Ethan Zhao @ 2025-08-10 7:19 UTC (permalink / raw) To: Baolu Lu, Dave Hansen, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/8/2025 1:15 PM, Baolu Lu wrote: > On 8/7/25 23:31, Dave Hansen wrote: >>> +void pte_free_kernel(struct mm_struct *mm, pte_t *pte) >>> +{ >>> + struct page *page = virt_to_page(pte); >>> + >>> + guard(spinlock)(&kernel_pte_work.lock); >>> + list_add(&page->lru, &kernel_pte_work.list); >>> + schedule_work(&kernel_pte_work.work); >>> +} >>> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/ >>> pgalloc.h >>> index 3c8ec3bfea44..716ebab67636 100644 >>> --- a/include/asm-generic/pgalloc.h >>> +++ b/include/asm-generic/pgalloc.h >>> @@ -46,6 +46,7 @@ static inline pte_t >>> *pte_alloc_one_kernel_noprof(struct mm_struct *mm) >>> #define pte_alloc_one_kernel(...) >>> alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) >>> #endif >>> >>> +#ifndef __HAVE_ARCH_PTE_FREE_KERNEL >>> /** >>> * pte_free_kernel - free PTE-level kernel page table memory >>> * @mm: the mm_struct of the current context >>> @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct >>> *mm, pte_t *pte) >>> { >>> pagetable_dtor_free(virt_to_ptdesc(pte)); >>> } >>> +#endif >>> >>> /** >>> * __pte_alloc_one - allocate memory for a PTE-level user page table >> I'd much rather the arch-generic code looked like this: >> >> #ifdef CONFIG_ASYNC_PGTABLE_FREE >> // code and struct here, or dump them over in some >> // other file and do this in a header >> #else >> static void pte_free_kernel_async(struct page *page) {} >> #endif >> >> void pte_free_kernel(struct mm_struct *mm, pte_t *pte) >> { >> struct page *page = virt_to_page(pte); >> >> if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) { >> pte_free_kernel_async(page); >> else >> pagetable_dtor_free(page_ptdesc(page)); >> } >> >> Then in Kconfig, you end up with something like: >> >> config ASYNC_PGTABLE_FREE >> def_bool y >> depends on INTEL_IOMMU_WHATEVER >> >> That very much tells much more of the whole story in code. It also gives >> the x86 folks that compile out the IOMMU the exact same code as the >> arch-generic folks. It_also_ makes it dirt simple and obvious for the >> x86 folks to optimize out the async behavior if they don't like it in >> the future by replacing the compile-time IOMMU check with a runtime one. >> >> Also, if another crazy IOMMU implementation comes along that happens to >> do what the x86 IOMMUs do, then they have a single Kconfig switch to >> flip. If they follow what this patch tries to do, they'll start by >> copying and pasting the x86 implementation. > > I'll do it like this. Does that look good to you? > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > index 70d29b14d851..6f1113e024fa 100644 > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -160,6 +160,7 @@ config IOMMU_DMA > # Shared Virtual Addressing > config IOMMU_SVA > select IOMMU_MM_DATA > + select ASYNC_PGTABLE_FREE if X86 > bool > > config IOMMU_IOPF > diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h > index 3c8ec3bfea44..dbddacdca2ce 100644 > --- a/include/asm-generic/pgalloc.h > +++ b/include/asm-generic/pgalloc.h > @@ -46,6 +46,19 @@ static inline pte_t > *pte_alloc_one_kernel_noprof(struct mm_struct *mm) > #define pte_alloc_one_kernel(...) > alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) > #endif > > +#ifdef CONFIG_ASYNC_PGTABLE_FREE > +struct pgtable_free_work { > + struct list_head list; > + spinlock_t lock; > + struct work_struct work; > +}; > +extern struct pgtable_free_work kernel_pte_work; > + > +void pte_free_kernel_async(struct ptdesc *ptdesc); > +#else > +static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {} > +#endif > + > /** > * pte_free_kernel - free PTE-level kernel page table memory > * @mm: the mm_struct of the current context > @@ -53,7 +66,12 @@ static inline pte_t > *pte_alloc_one_kernel_noprof(struct mm_struct *mm) > */ > static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) > { > - pagetable_dtor_free(virt_to_ptdesc(pte)); > + struct ptdesc *ptdesc = virt_to_ptdesc(pte); > + > + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) > + pte_free_kernel_async(ptdesc); > + else > + pagetable_dtor_free(ptdesc); > } > > /** > diff --git a/mm/Kconfig b/mm/Kconfig > index e443fe8cd6cf..528550cfa7fe 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -1346,6 +1346,13 @@ config LOCK_MM_AND_FIND_VMA > config IOMMU_MM_DATA > bool > > +config ASYNC_PGTABLE_FREE > + bool "Asynchronous kernel page table freeing" > + help > + Perform kernel page table freeing asynchronously. This is required > + for systems with IOMMU Shared Virtual Address (SVA) to flush IOTLB > + paging structure caches. > + > config EXECMEM > bool > > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > index 567e2d084071..6639ee6641d4 100644 > --- a/mm/pgtable-generic.c > +++ b/mm/pgtable-generic.c > @@ -13,6 +13,7 @@ > #include <linux/swap.h> > #include <linux/swapops.h> > #include <linux/mm_inline.h> > +#include <linux/iommu.h> > #include <asm/pgalloc.h> > #include <asm/tlb.h> > > @@ -406,3 +407,32 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, > pmd_t *pmd, > pte_unmap_unlock(pte, ptl); > goto again; > } > + > +#ifdef CONFIG_ASYNC_PGTABLE_FREE > +static void kernel_pte_work_func(struct work_struct *work); > +struct pgtable_free_work kernel_pte_work = { > + .list = LIST_HEAD_INIT(kernel_pte_work.list), > + .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock), > + .work = __WORK_INITIALIZER(kernel_pte_work.work, > kernel_pte_work_func), > +}; > + > +static void kernel_pte_work_func(struct work_struct *work) > +{ > + struct ptdesc *ptdesc, *next; > + > + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > + > + guard(spinlock)(&kernel_pte_work.lock); > + list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, > pt_list) { > + list_del_init(&ptdesc->pt_list); > + pagetable_dtor_free(ptdesc); > + } > +} > + > +void pte_free_kernel_async(struct ptdesc *ptdesc) > +{ > + guard(spinlock)(&kernel_pte_work.lock); > + list_add(&ptdesc->pt_list, &kernel_pte_work.list); > + schedule_work(&kernel_pte_work.work); > +} kernel_pte_work.list is global shared var, it would make the producer pte_free_kernel() and the consumer kernel_pte_work_func() to operate in serialized timing. In a large system, I don't think you design this deliberately :) Thanks, Ethan > +#endif ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-10 7:19 ` Ethan Zhao @ 2025-08-11 9:15 ` Uladzislau Rezki 2025-08-11 12:55 ` Jason Gunthorpe 2025-08-11 13:55 ` Dave Hansen 0 siblings, 2 replies; 51+ messages in thread From: Uladzislau Rezki @ 2025-08-11 9:15 UTC (permalink / raw) To: Ethan Zhao, Baolu Lu Cc: Baolu Lu, Dave Hansen, Jason Gunthorpe, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On Sun, Aug 10, 2025 at 03:19:58PM +0800, Ethan Zhao wrote: > > > On 8/8/2025 1:15 PM, Baolu Lu wrote: > > On 8/7/25 23:31, Dave Hansen wrote: > > > > +void pte_free_kernel(struct mm_struct *mm, pte_t *pte) > > > > +{ > > > > + struct page *page = virt_to_page(pte); > > > > + > > > > + guard(spinlock)(&kernel_pte_work.lock); > > > > + list_add(&page->lru, &kernel_pte_work.list); > > > > + schedule_work(&kernel_pte_work.work); > > > > +} > > > > diff --git a/include/asm-generic/pgalloc.h > > > > b/include/asm-generic/ pgalloc.h > > > > index 3c8ec3bfea44..716ebab67636 100644 > > > > --- a/include/asm-generic/pgalloc.h > > > > +++ b/include/asm-generic/pgalloc.h > > > > @@ -46,6 +46,7 @@ static inline pte_t > > > > *pte_alloc_one_kernel_noprof(struct mm_struct *mm) > > > > #define pte_alloc_one_kernel(...) > > > > alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) > > > > #endif > > > > > > > > +#ifndef __HAVE_ARCH_PTE_FREE_KERNEL > > > > /** > > > > * pte_free_kernel - free PTE-level kernel page table memory > > > > * @mm: the mm_struct of the current context > > > > @@ -55,6 +56,7 @@ static inline void pte_free_kernel(struct mm_struct > > > > *mm, pte_t *pte) > > > > { > > > > pagetable_dtor_free(virt_to_ptdesc(pte)); > > > > } > > > > +#endif > > > > > > > > /** > > > > * __pte_alloc_one - allocate memory for a PTE-level user page table > > > I'd much rather the arch-generic code looked like this: > > > > > > #ifdef CONFIG_ASYNC_PGTABLE_FREE > > > // code and struct here, or dump them over in some > > > // other file and do this in a header > > > #else > > > static void pte_free_kernel_async(struct page *page) {} > > > #endif > > > > > > void pte_free_kernel(struct mm_struct *mm, pte_t *pte) > > > { > > > struct page *page = virt_to_page(pte); > > > > > > if (IS_DEFINED(CONFIG_ASYNC_PGTABLE_FREE)) { > > > pte_free_kernel_async(page); > > > else > > > pagetable_dtor_free(page_ptdesc(page)); > > > } > > > > > > Then in Kconfig, you end up with something like: > > > > > > config ASYNC_PGTABLE_FREE > > > def_bool y > > > depends on INTEL_IOMMU_WHATEVER > > > > > > That very much tells much more of the whole story in code. It also gives > > > the x86 folks that compile out the IOMMU the exact same code as the > > > arch-generic folks. It_also_ makes it dirt simple and obvious for the > > > x86 folks to optimize out the async behavior if they don't like it in > > > the future by replacing the compile-time IOMMU check with a runtime one. > > > > > > Also, if another crazy IOMMU implementation comes along that happens to > > > do what the x86 IOMMUs do, then they have a single Kconfig switch to > > > flip. If they follow what this patch tries to do, they'll start by > > > copying and pasting the x86 implementation. > > > > I'll do it like this. Does that look good to you? > > > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > > index 70d29b14d851..6f1113e024fa 100644 > > --- a/drivers/iommu/Kconfig > > +++ b/drivers/iommu/Kconfig > > @@ -160,6 +160,7 @@ config IOMMU_DMA > > # Shared Virtual Addressing > > config IOMMU_SVA > > select IOMMU_MM_DATA > > + select ASYNC_PGTABLE_FREE if X86 > > bool > > > > config IOMMU_IOPF > > diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h > > index 3c8ec3bfea44..dbddacdca2ce 100644 > > --- a/include/asm-generic/pgalloc.h > > +++ b/include/asm-generic/pgalloc.h > > @@ -46,6 +46,19 @@ static inline pte_t > > *pte_alloc_one_kernel_noprof(struct mm_struct *mm) > > #define pte_alloc_one_kernel(...) > > alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) > > #endif > > > > +#ifdef CONFIG_ASYNC_PGTABLE_FREE > > +struct pgtable_free_work { > > + struct list_head list; > > + spinlock_t lock; > > + struct work_struct work; > > +}; > > +extern struct pgtable_free_work kernel_pte_work; > > + > > +void pte_free_kernel_async(struct ptdesc *ptdesc); > > +#else > > +static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {} > > +#endif > > + > > /** > > * pte_free_kernel - free PTE-level kernel page table memory > > * @mm: the mm_struct of the current context > > @@ -53,7 +66,12 @@ static inline pte_t > > *pte_alloc_one_kernel_noprof(struct mm_struct *mm) > > */ > > static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) > > { > > - pagetable_dtor_free(virt_to_ptdesc(pte)); > > + struct ptdesc *ptdesc = virt_to_ptdesc(pte); > > + > > + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) > > + pte_free_kernel_async(ptdesc); > > + else > > + pagetable_dtor_free(ptdesc); > > } > > > > /** > > diff --git a/mm/Kconfig b/mm/Kconfig > > index e443fe8cd6cf..528550cfa7fe 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -1346,6 +1346,13 @@ config LOCK_MM_AND_FIND_VMA > > config IOMMU_MM_DATA > > bool > > > > +config ASYNC_PGTABLE_FREE > > + bool "Asynchronous kernel page table freeing" > > + help > > + Perform kernel page table freeing asynchronously. This is required > > + for systems with IOMMU Shared Virtual Address (SVA) to flush IOTLB > > + paging structure caches. > > + > > config EXECMEM > > bool > > > > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > > index 567e2d084071..6639ee6641d4 100644 > > --- a/mm/pgtable-generic.c > > +++ b/mm/pgtable-generic.c > > @@ -13,6 +13,7 @@ > > #include <linux/swap.h> > > #include <linux/swapops.h> > > #include <linux/mm_inline.h> > > +#include <linux/iommu.h> > > #include <asm/pgalloc.h> > > #include <asm/tlb.h> > > > > @@ -406,3 +407,32 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, > > pmd_t *pmd, > > pte_unmap_unlock(pte, ptl); > > goto again; > > } > > + > > +#ifdef CONFIG_ASYNC_PGTABLE_FREE > > +static void kernel_pte_work_func(struct work_struct *work); > > +struct pgtable_free_work kernel_pte_work = { > > + .list = LIST_HEAD_INIT(kernel_pte_work.list), > > + .lock = __SPIN_LOCK_UNLOCKED(kernel_pte_work.lock), > > + .work = __WORK_INITIALIZER(kernel_pte_work.work, > > kernel_pte_work_func), > > +}; > > + > > +static void kernel_pte_work_func(struct work_struct *work) > > +{ > > + struct ptdesc *ptdesc, *next; > > + > > + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > > + > > + guard(spinlock)(&kernel_pte_work.lock); > > + list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, > > pt_list) { > > + list_del_init(&ptdesc->pt_list); > > + pagetable_dtor_free(ptdesc); > > + } > > +} > > + > > +void pte_free_kernel_async(struct ptdesc *ptdesc) > > +{ > > + guard(spinlock)(&kernel_pte_work.lock); > > + list_add(&ptdesc->pt_list, &kernel_pte_work.list); > > + schedule_work(&kernel_pte_work.work); > > +} > kernel_pte_work.list is global shared var, it would make the producer > pte_free_kernel() and the consumer kernel_pte_work_func() to operate in > serialized timing. In a large system, I don't think you design this > deliberately :) > Sorry for jumping. Agree, unless it is never considered as a hot path or something that can be really contented. It looks like you can use just a per-cpu llist to drain thinks. As for reference you can have a look at how vfree_atomic() handles deferred freeing. Thanks! -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-11 9:15 ` Uladzislau Rezki @ 2025-08-11 12:55 ` Jason Gunthorpe 2025-08-15 9:23 ` Baolu Lu 2025-08-11 13:55 ` Dave Hansen 1 sibling, 1 reply; 51+ messages in thread From: Jason Gunthorpe @ 2025-08-11 12:55 UTC (permalink / raw) To: Uladzislau Rezki Cc: Ethan Zhao, Baolu Lu, Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On Mon, Aug 11, 2025 at 11:15:58AM +0200, Uladzislau Rezki wrote: > Agree, unless it is never considered as a hot path or something that can > be really contented. It looks like you can use just a per-cpu llist to drain > thinks. I think we already largely agreed this was not a true fast path, I don't think we need a per-cpu here. But I would not make it extern unless there is a user? Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-11 12:55 ` Jason Gunthorpe @ 2025-08-15 9:23 ` Baolu Lu 0 siblings, 0 replies; 51+ messages in thread From: Baolu Lu @ 2025-08-15 9:23 UTC (permalink / raw) To: Jason Gunthorpe, Uladzislau Rezki Cc: baolu.lu, Ethan Zhao, Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/11/2025 8:55 PM, Jason Gunthorpe wrote: > On Mon, Aug 11, 2025 at 11:15:58AM +0200, Uladzislau Rezki wrote: > >> Agree, unless it is never considered as a hot path or something that can >> be really contented. It looks like you can use just a per-cpu llist to drain >> thinks. > I think we already largely agreed this was not a true fast path, I > don't think we need a per-cpu here. > > But I would not make it extern unless there is a user? Fixed. Thanks! Thanks, baolu ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-11 9:15 ` Uladzislau Rezki 2025-08-11 12:55 ` Jason Gunthorpe @ 2025-08-11 13:55 ` Dave Hansen 2025-08-11 14:56 ` Uladzislau Rezki 2025-08-12 1:17 ` Ethan Zhao 1 sibling, 2 replies; 51+ messages in thread From: Dave Hansen @ 2025-08-11 13:55 UTC (permalink / raw) To: Uladzislau Rezki, Ethan Zhao, Baolu Lu Cc: Jason Gunthorpe, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/11/25 02:15, Uladzislau Rezki wrote: >> kernel_pte_work.list is global shared var, it would make the producer >> pte_free_kernel() and the consumer kernel_pte_work_func() to operate in >> serialized timing. In a large system, I don't think you design this >> deliberately 🙂 >> > Sorry for jumping. > > Agree, unless it is never considered as a hot path or something that can > be really contented. It looks like you can use just a per-cpu llist to drain > thinks. Remember, the code that has to run just before all this sent an IPI to every single CPU on the system to have them do a (on x86 at least) pretty expensive TLB flush. If this is a hot path, we have bigger problems on our hands: the full TLB flush on every CPU. So, sure, there are a million ways to make this deferred freeing more scalable. But the code that's here is dirt simple and self contained. If someone has some ideas for something that's simpler and more scalable, then I'm totally open to it. But this is _not_ the place to add complexity to get scalability. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-11 13:55 ` Dave Hansen @ 2025-08-11 14:56 ` Uladzislau Rezki 2025-08-12 1:17 ` Ethan Zhao 1 sibling, 0 replies; 51+ messages in thread From: Uladzislau Rezki @ 2025-08-11 14:56 UTC (permalink / raw) To: Dave Hansen Cc: Uladzislau Rezki, Ethan Zhao, Baolu Lu, Jason Gunthorpe, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On Mon, Aug 11, 2025 at 06:55:52AM -0700, Dave Hansen wrote: > On 8/11/25 02:15, Uladzislau Rezki wrote: > >> kernel_pte_work.list is global shared var, it would make the producer > >> pte_free_kernel() and the consumer kernel_pte_work_func() to operate in > >> serialized timing. In a large system, I don't think you design this > >> deliberately 🙂 > >> > > Sorry for jumping. > > > > Agree, unless it is never considered as a hot path or something that can > > be really contented. It looks like you can use just a per-cpu llist to drain > > thinks. > > Remember, the code that has to run just before all this sent an IPI to > every single CPU on the system to have them do a (on x86 at least) > pretty expensive TLB flush. > > If this is a hot path, we have bigger problems on our hands: the full > TLB flush on every CPU. > > So, sure, there are a million ways to make this deferred freeing more > scalable. But the code that's here is dirt simple and self contained. If > someone has some ideas for something that's simpler and more scalable, > then I'm totally open to it. > You could also have a look toward removing the &kernel_pte_work.lock. Replace it by llist_add() on adding side and llist_for_each_safe(n, t, llist_del_all(&list)) on removing side. So you do not need guard(spinlock) stuff. If i do not miss anything. > > But this is _not_ the place to add complexity to get scalability. > OK. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-11 13:55 ` Dave Hansen 2025-08-11 14:56 ` Uladzislau Rezki @ 2025-08-12 1:17 ` Ethan Zhao 2025-08-15 14:35 ` Dave Hansen 1 sibling, 1 reply; 51+ messages in thread From: Ethan Zhao @ 2025-08-12 1:17 UTC (permalink / raw) To: Dave Hansen, Uladzislau Rezki, Baolu Lu Cc: Jason Gunthorpe, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/11/2025 9:55 PM, Dave Hansen wrote: > On 8/11/25 02:15, Uladzislau Rezki wrote: >>> kernel_pte_work.list is global shared var, it would make the producer >>> pte_free_kernel() and the consumer kernel_pte_work_func() to operate in >>> serialized timing. In a large system, I don't think you design this >>> deliberately 🙂 >>> >> Sorry for jumping. >> >> Agree, unless it is never considered as a hot path or something that can >> be really contented. It looks like you can use just a per-cpu llist to drain >> thinks. > > Remember, the code that has to run just before all this sent an IPI to > every single CPU on the system to have them do a (on x86 at least) > pretty expensive TLB flush. > It can be easily identified as a bottleneck by multi-CPU stress testing programs involving frequent process creation and destruction, similar to the operation of a heavily loaded multi-process Apache web server. Hot/cold path ? > If this is a hot path, we have bigger problems on our hands: the full > TLB flush on every CPU. Perhaps not "WE", IPI driven TLB flush seems not the shared mechanism of all CPUs, at least not for ARM as far as I know. > > So, sure, there are a million ways to make this deferred freeing more > scalable. But the code that's here is dirt simple and self contained. If > someone has some ideas for something that's simpler and more scalable, > then I'm totally open to it. > > But this is _not_ the place to add complexity to get scalability. At least, please dont add bottleneck, how complex to do that ? Thanks, Ethan ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-12 1:17 ` Ethan Zhao @ 2025-08-15 14:35 ` Dave Hansen 0 siblings, 0 replies; 51+ messages in thread From: Dave Hansen @ 2025-08-15 14:35 UTC (permalink / raw) To: Ethan Zhao, Uladzislau Rezki, Baolu Lu Cc: Jason Gunthorpe, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/11/25 18:17, Ethan Zhao wrote: >> But this is _not_ the place to add complexity to get scalability. > At least, please dont add bottleneck, how complex to do that ? Very good question! If you're really interested and concerned about this, I'd encourage you to demonstrate where the contention becomes a problem in practice, then post a patch to fix it. If it's simple, I'll happily ack it! ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-08 5:15 ` Baolu Lu 2025-08-10 7:19 ` Ethan Zhao @ 2025-08-11 12:57 ` Jason Gunthorpe 2025-08-13 3:17 ` Ethan Zhao 2025-08-18 1:34 ` Baolu Lu 1 sibling, 2 replies; 51+ messages in thread From: Jason Gunthorpe @ 2025-08-11 12:57 UTC (permalink / raw) To: Baolu Lu Cc: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On Fri, Aug 08, 2025 at 01:15:12PM +0800, Baolu Lu wrote: > +static void kernel_pte_work_func(struct work_struct *work) > +{ > + struct ptdesc *ptdesc, *next; > + > + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > + > + guard(spinlock)(&kernel_pte_work.lock); > + list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) { > + list_del_init(&ptdesc->pt_list); > + pagetable_dtor_free(ptdesc); > + } Do a list_move from kernel_pte_work.list to an on-stack list head and then immediately release the lock. No reason to hold the spinock while doing frees, also no reason to do list_del_init, that memory probably gets zerod in pagetable_dtor_free() Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-11 12:57 ` Jason Gunthorpe @ 2025-08-13 3:17 ` Ethan Zhao 2025-08-18 1:34 ` Baolu Lu 1 sibling, 0 replies; 51+ messages in thread From: Ethan Zhao @ 2025-08-13 3:17 UTC (permalink / raw) To: Jason Gunthorpe, Baolu Lu Cc: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/11/2025 8:57 PM, Jason Gunthorpe wrote: > On Fri, Aug 08, 2025 at 01:15:12PM +0800, Baolu Lu wrote: >> +static void kernel_pte_work_func(struct work_struct *work) >> +{ >> + struct ptdesc *ptdesc, *next; >> + >> + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); >> + >> + guard(spinlock)(&kernel_pte_work.lock); >> + list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) { >> + list_del_init(&ptdesc->pt_list); >> + pagetable_dtor_free(ptdesc); >> + } > > Do a list_move from kernel_pte_work.list to an on-stack list head and > then immediately release the lock. No reason to hold the spinock while > doing frees, also no reason to do list_del_init, that memory probably > gets zerod in pagetable_dtor_free() Yep,using guard(spinlock)() for scope-bound lock management sacrifices fine-grained control over the protected area. If offers convenience at the cost of precision. Out of my bias, calling it sluggard(spinlock)() might be proper. Thanks, Ethan > > Jason > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-11 12:57 ` Jason Gunthorpe 2025-08-13 3:17 ` Ethan Zhao @ 2025-08-18 1:34 ` Baolu Lu 1 sibling, 0 replies; 51+ messages in thread From: Baolu Lu @ 2025-08-18 1:34 UTC (permalink / raw) To: Jason Gunthorpe Cc: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/11/25 20:57, Jason Gunthorpe wrote: > On Fri, Aug 08, 2025 at 01:15:12PM +0800, Baolu Lu wrote: >> +static void kernel_pte_work_func(struct work_struct *work) >> +{ >> + struct ptdesc *ptdesc, *next; >> + >> + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); >> + >> + guard(spinlock)(&kernel_pte_work.lock); >> + list_for_each_entry_safe(ptdesc, next, &kernel_pte_work.list, pt_list) { >> + list_del_init(&ptdesc->pt_list); >> + pagetable_dtor_free(ptdesc); >> + } > Do a list_move from kernel_pte_work.list to an on-stack list head and > then immediately release the lock. No reason to hold the spinock while > doing frees, also no reason to do list_del_init, that memory probably > gets zerod in pagetable_dtor_free() Done. Thanks, baolu ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-07 14:40 ` Baolu Lu 2025-08-07 15:31 ` Dave Hansen @ 2025-08-07 19:51 ` Jason Gunthorpe 2025-08-08 2:57 ` Tian, Kevin 2025-08-08 5:08 ` Baolu Lu 1 sibling, 2 replies; 51+ messages in thread From: Jason Gunthorpe @ 2025-08-07 19:51 UTC (permalink / raw) To: Baolu Lu Cc: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote: > +static void kernel_pte_work_func(struct work_struct *work) > +{ > + struct page *page, *next; > + > + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > + > + guard(spinlock)(&kernel_pte_work.lock); > + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { > + list_del_init(&page->lru); Please don't add new usages of lru, we are trying to get rid of this. :( I think the memory should be struct ptdesc, use that.. Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-07 19:51 ` Jason Gunthorpe @ 2025-08-08 2:57 ` Tian, Kevin 2025-08-15 9:16 ` Baolu Lu 2025-08-18 6:21 ` Baolu Lu 2025-08-08 5:08 ` Baolu Lu 1 sibling, 2 replies; 51+ messages in thread From: Tian, Kevin @ 2025-08-08 2:57 UTC (permalink / raw) To: Jason Gunthorpe, Baolu Lu Cc: Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, August 8, 2025 3:52 AM > > On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote: > > +static void kernel_pte_work_func(struct work_struct *work) > > +{ > > + struct page *page, *next; > > + > > + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > > + > > + guard(spinlock)(&kernel_pte_work.lock); > > + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { > > + list_del_init(&page->lru); > > Please don't add new usages of lru, we are trying to get rid of this. :( > > I think the memory should be struct ptdesc, use that.. > btw with this change we should also defer free of the pmd page: pud_free_pmd_page() ... for (i = 0; i < PTRS_PER_PMD; i++) { if (!pmd_none(pmd_sv[i])) { pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); pte_free_kernel(&init_mm, pte); } } free_page((unsigned long)pmd_sv); Otherwise the risk still exists if the pmd page is repurposed before the pte work is scheduled. another observation - pte_free_kernel is not used in remove_pagetable () and __change_page_attr(). Is it straightforward to put it in those paths or do we need duplicate some deferring logic there? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-08 2:57 ` Tian, Kevin @ 2025-08-15 9:16 ` Baolu Lu 2025-08-15 9:46 ` Tian, Kevin 2025-08-15 14:31 ` Dave Hansen 2025-08-18 6:21 ` Baolu Lu 1 sibling, 2 replies; 51+ messages in thread From: Baolu Lu @ 2025-08-15 9:16 UTC (permalink / raw) To: Tian, Kevin, Jason Gunthorpe Cc: baolu.lu, Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org On 8/8/2025 10:57 AM, Tian, Kevin wrote: >> From: Jason Gunthorpe <jgg@nvidia.com> >> Sent: Friday, August 8, 2025 3:52 AM >> >> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote: >>> +static void kernel_pte_work_func(struct work_struct *work) >>> +{ >>> + struct page *page, *next; >>> + >>> + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); >>> + >>> + guard(spinlock)(&kernel_pte_work.lock); >>> + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { >>> + list_del_init(&page->lru); >> >> Please don't add new usages of lru, we are trying to get rid of this. :( >> >> I think the memory should be struct ptdesc, use that.. >> > > btw with this change we should also defer free of the pmd page: > > pud_free_pmd_page() > ... > for (i = 0; i < PTRS_PER_PMD; i++) { > if (!pmd_none(pmd_sv[i])) { > pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); > pte_free_kernel(&init_mm, pte); > } > } > > free_page((unsigned long)pmd_sv); > > Otherwise the risk still exists if the pmd page is repurposed before the > pte work is scheduled. You're right that freeing high-level page table pages also requires an IOTLB flush before the pages are freed. But I question the practical risk of the race given the extremely small time window. If this is a real concern, a potential mitigation would be to clear the U/S bits in all page table entries for kernel address space? But I am not confident in making that change at this time as I am unsure of the side effects it might cause. > > another observation - pte_free_kernel is not used in remove_pagetable () > and __change_page_attr(). Is it straightforward to put it in those paths > or do we need duplicate some deferring logic there? The remove_pagetable() function is called in the path where memory is hot-removed from the system, right? If so, there should be no issue, as the threat model here is a page table page being freed and repurposed while it's still cached in the IOTLB. In the hot-remove case, the memory is removed and will not be reused, so that's fine as far as I can see. The same to __change_page_attr(), which only changes the attributes of a page table entry while the underlying page remains in use. Thanks, baolu ^ permalink raw reply [flat|nested] 51+ messages in thread
* RE: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-15 9:16 ` Baolu Lu @ 2025-08-15 9:46 ` Tian, Kevin 2025-08-18 5:58 ` Baolu Lu 2025-08-15 14:31 ` Dave Hansen 1 sibling, 1 reply; 51+ messages in thread From: Tian, Kevin @ 2025-08-15 9:46 UTC (permalink / raw) To: Baolu Lu, Jason Gunthorpe Cc: Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org > From: Baolu Lu <baolu.lu@linux.intel.com> > Sent: Friday, August 15, 2025 5:17 PM > > On 8/8/2025 10:57 AM, Tian, Kevin wrote: > >> From: Jason Gunthorpe <jgg@nvidia.com> > >> Sent: Friday, August 8, 2025 3:52 AM > >> > >> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote: > >>> +static void kernel_pte_work_func(struct work_struct *work) > >>> +{ > >>> + struct page *page, *next; > >>> + > >>> + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > >>> + > >>> + guard(spinlock)(&kernel_pte_work.lock); > >>> + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { > >>> + list_del_init(&page->lru); > >> > >> Please don't add new usages of lru, we are trying to get rid of this. :( > >> > >> I think the memory should be struct ptdesc, use that.. > >> > > > > btw with this change we should also defer free of the pmd page: > > > > pud_free_pmd_page() > > ... > > for (i = 0; i < PTRS_PER_PMD; i++) { > > if (!pmd_none(pmd_sv[i])) { > > pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); > > pte_free_kernel(&init_mm, pte); > > } > > } > > > > free_page((unsigned long)pmd_sv); > > > > Otherwise the risk still exists if the pmd page is repurposed before the > > pte work is scheduled. > > You're right that freeing high-level page table pages also requires an > IOTLB flush before the pages are freed. But I question the practical > risk of the race given the extremely small time window. If this is a It's already extremely difficult to conduct a real attack even w/o this fix. I'm not sure the criteria how small we consider acceptable in this specific case. but leaving an incomplete fix in code doesn't sound clean... > real concern, a potential mitigation would be to clear the U/S bits in > all page table entries for kernel address space? But I am not confident > in making that change at this time as I am unsure of the side effects it > might cause. I think there was already consensus that clearing U/S bits in all entries doesn't prevent the IOMMU caching them and setting A/D bits on the freed pagetable. > > > > > another observation - pte_free_kernel is not used in remove_pagetable () > > and __change_page_attr(). Is it straightforward to put it in those paths > > or do we need duplicate some deferring logic there? > > The remove_pagetable() function is called in the path where memory is > hot-removed from the system, right? If so, there should be no issue, as > the threat model here is a page table page being freed and repurposed > while it's still cached in the IOTLB. In the hot-remove case, the memory > is removed and will not be reused, so that's fine as far as I can see. what about the page is hot-added back while the stale entry pointing to it is still valid in the IOMMU, theoretically? 😊 > > The same to __change_page_attr(), which only changes the attributes of a > page table entry while the underlying page remains in use. > it may lead to cpa_collapse_large_pages() if changing attribute leads to all adjacent 4k pages in 2M range are with same attribute. Then page table might be freed: cpa_collapse_large_pages(): list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { list_del(&ptdesc->pt_list); __free_page(ptdesc_page(ptdesc)); } ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-15 9:46 ` Tian, Kevin @ 2025-08-18 5:58 ` Baolu Lu 0 siblings, 0 replies; 51+ messages in thread From: Baolu Lu @ 2025-08-18 5:58 UTC (permalink / raw) To: Tian, Kevin, Jason Gunthorpe Cc: Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org On 8/15/25 17:46, Tian, Kevin wrote: >> From: Baolu Lu <baolu.lu@linux.intel.com> >> Sent: Friday, August 15, 2025 5:17 PM >> >> On 8/8/2025 10:57 AM, Tian, Kevin wrote: >>>> From: Jason Gunthorpe <jgg@nvidia.com> >>>> Sent: Friday, August 8, 2025 3:52 AM >>>> >>>> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote: >>>>> +static void kernel_pte_work_func(struct work_struct *work) >>>>> +{ >>>>> + struct page *page, *next; >>>>> + >>>>> + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); >>>>> + >>>>> + guard(spinlock)(&kernel_pte_work.lock); >>>>> + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { >>>>> + list_del_init(&page->lru); >>>> >>>> Please don't add new usages of lru, we are trying to get rid of this. :( >>>> >>>> I think the memory should be struct ptdesc, use that.. >>>> >>> >>> btw with this change we should also defer free of the pmd page: >>> >>> pud_free_pmd_page() >>> ... >>> for (i = 0; i < PTRS_PER_PMD; i++) { >>> if (!pmd_none(pmd_sv[i])) { >>> pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); >>> pte_free_kernel(&init_mm, pte); >>> } >>> } >>> >>> free_page((unsigned long)pmd_sv); >>> >>> Otherwise the risk still exists if the pmd page is repurposed before the >>> pte work is scheduled. >> >> You're right that freeing high-level page table pages also requires an >> IOTLB flush before the pages are freed. But I question the practical >> risk of the race given the extremely small time window. If this is a > > It's already extremely difficult to conduct a real attack even w/o this > fix. I'm not sure the criteria how small we consider acceptable in this > specific case. but leaving an incomplete fix in code doesn't sound clean... > >> real concern, a potential mitigation would be to clear the U/S bits in >> all page table entries for kernel address space? But I am not confident >> in making that change at this time as I am unsure of the side effects it >> might cause. > > I think there was already consensus that clearing U/S bits in all entries > doesn't prevent the IOMMU caching them and setting A/D bits on > the freed pagetable. > >> >>> >>> another observation - pte_free_kernel is not used in remove_pagetable () >>> and __change_page_attr(). Is it straightforward to put it in those paths >>> or do we need duplicate some deferring logic there? >> >> The remove_pagetable() function is called in the path where memory is >> hot-removed from the system, right? If so, there should be no issue, as >> the threat model here is a page table page being freed and repurposed >> while it's still cached in the IOTLB. In the hot-remove case, the memory >> is removed and will not be reused, so that's fine as far as I can see. > > what about the page is hot-added back while the stale entry pointing to > it is still valid in the IOMMU, theoretically? 😊 > >> >> The same to __change_page_attr(), which only changes the attributes of a >> page table entry while the underlying page remains in use. >> > > it may lead to cpa_collapse_large_pages() if changing attribute leads to > all adjacent 4k pages in 2M range are with same attribute. Then page > table might be freed: > > cpa_collapse_large_pages(): > list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { > list_del(&ptdesc->pt_list); > __free_page(ptdesc_page(ptdesc)); > } All look fair enough to me. I will handle all the cases and make it complete. Thanks, baolu ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-15 9:16 ` Baolu Lu 2025-08-15 9:46 ` Tian, Kevin @ 2025-08-15 14:31 ` Dave Hansen 2025-08-18 6:08 ` Baolu Lu 1 sibling, 1 reply; 51+ messages in thread From: Dave Hansen @ 2025-08-15 14:31 UTC (permalink / raw) To: Baolu Lu, Tian, Kevin, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org On 8/15/25 02:16, Baolu Lu wrote: > On 8/8/2025 10:57 AM, Tian, Kevin wrote: >> pud_free_pmd_page() >> ... >> for (i = 0; i < PTRS_PER_PMD; i++) { >> if (!pmd_none(pmd_sv[i])) { >> pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); >> pte_free_kernel(&init_mm, pte); >> } >> } >> >> free_page((unsigned long)pmd_sv); >> >> Otherwise the risk still exists if the pmd page is repurposed before the >> pte work is scheduled. > > You're right that freeing high-level page table pages also requires an > IOTLB flush before the pages are freed. But I question the practical > risk of the race given the extremely small time window. I hear that Linux is gaining popularity these days. There might even be dozens of users! Given that large scale of dozens (or even hundreds??) of users, I would suggest exercising some care. The race might be small but it only needs to happen once to cause chaos. Seriously, though... A race is a race. Preemption or interrupts or SMIs or VMExits or a million other things can cause a "small time window" to become a big time window. Even perceived small races need to be fixed. > If this is a real concern, a potential mitigation would be to clear > the U/S bits in all page table entries for kernel address space? But > I am not confident in making that change at this time as I am unsure > of the side effects it might cause. That doesn't do any good. I even went as far as double-checking months ago with the IOMMU hardware folks to confirm the actual implementation. I'm really surprised this is being brought up again. >> another observation - pte_free_kernel is not used in remove_pagetable () >> and __change_page_attr(). Is it straightforward to put it in those paths >> or do we need duplicate some deferring logic there? > > The remove_pagetable() function is called in the path where memory is > hot-removed from the system, right? No. Not right. This is in the vmalloc() code: the side of things that _creates_ mappings for new allocations, not tears them down. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-15 14:31 ` Dave Hansen @ 2025-08-18 6:08 ` Baolu Lu 0 siblings, 0 replies; 51+ messages in thread From: Baolu Lu @ 2025-08-18 6:08 UTC (permalink / raw) To: Dave Hansen, Tian, Kevin, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org On 8/15/25 22:31, Dave Hansen wrote: > On 8/15/25 02:16, Baolu Lu wrote: >> On 8/8/2025 10:57 AM, Tian, Kevin wrote: >>> pud_free_pmd_page() >>> ... >>> for (i = 0; i < PTRS_PER_PMD; i++) { >>> if (!pmd_none(pmd_sv[i])) { >>> pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); >>> pte_free_kernel(&init_mm, pte); >>> } >>> } >>> >>> free_page((unsigned long)pmd_sv); >>> >>> Otherwise the risk still exists if the pmd page is repurposed before the >>> pte work is scheduled. >> >> You're right that freeing high-level page table pages also requires an >> IOTLB flush before the pages are freed. But I question the practical >> risk of the race given the extremely small time window. > > I hear that Linux is gaining popularity these days. There might even be > dozens of users! Given that large scale of dozens (or even hundreds??) > of users, I would suggest exercising some care. The race might be small > but it only needs to happen once to cause chaos. > > Seriously, though... A race is a race. Preemption or interrupts or SMIs > or VMExits or a million other things can cause a "small time window" to > become a big time window. > > Even perceived small races need to be fixed. Yes, agreed. > >> If this is a real concern, a potential mitigation would be to clear >> the U/S bits in all page table entries for kernel address space? But >> I am not confident in making that change at this time as I am unsure >> of the side effects it might cause. > > That doesn't do any good. I even went as far as double-checking months > ago with the IOMMU hardware folks to confirm the actual implementation. > I'm really surprised this is being brought up again. It should not be brought up again. Sorry about it. I was thinking about deferring pte page table page free plus clearing U/S bit. That's also not desirable anyway, so let's ignore it. > >>> another observation - pte_free_kernel is not used in remove_pagetable () >>> and __change_page_attr(). Is it straightforward to put it in those paths >>> or do we need duplicate some deferring logic there? >> >> The remove_pagetable() function is called in the path where memory is >> hot-removed from the system, right? > > No. Not right. > > This is in the vmalloc() code: the side of things that _creates_ > mappings for new allocations, not tears them down. Yeah, let me look into it. Thanks, baolu ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-08 2:57 ` Tian, Kevin 2025-08-15 9:16 ` Baolu Lu @ 2025-08-18 6:21 ` Baolu Lu 2025-08-21 7:05 ` Tian, Kevin 1 sibling, 1 reply; 51+ messages in thread From: Baolu Lu @ 2025-08-18 6:21 UTC (permalink / raw) To: Tian, Kevin, Jason Gunthorpe Cc: Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org On 8/8/25 10:57, Tian, Kevin wrote: >> From: Jason Gunthorpe <jgg@nvidia.com> >> Sent: Friday, August 8, 2025 3:52 AM >> >> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote: >>> +static void kernel_pte_work_func(struct work_struct *work) >>> +{ >>> + struct page *page, *next; >>> + >>> + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); >>> + >>> + guard(spinlock)(&kernel_pte_work.lock); >>> + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { >>> + list_del_init(&page->lru); >> >> Please don't add new usages of lru, we are trying to get rid of this. :( >> >> I think the memory should be struct ptdesc, use that.. >> > > btw with this change we should also defer free of the pmd page: > > pud_free_pmd_page() > ... > for (i = 0; i < PTRS_PER_PMD; i++) { > if (!pmd_none(pmd_sv[i])) { > pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); > pte_free_kernel(&init_mm, pte); > } > } > > free_page((unsigned long)pmd_sv); > > Otherwise the risk still exists if the pmd page is repurposed before the > pte work is scheduled. My understanding is that you were talking about pmd_free(), correct? It appears that both pmd_free() and pte_free_kernel() call pagetable_dtor_free(). If so, how about the following change? diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index dbddacdca2ce..04f6b5bc212c 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -172,10 +172,8 @@ static inline pmd_t *pmd_alloc_one_noprof(struct mm_struct *mm, unsigned long ad #ifndef __HAVE_ARCH_PMD_FREE static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) { - struct ptdesc *ptdesc = virt_to_ptdesc(pmd); - BUG_ON((unsigned long)pmd & (PAGE_SIZE-1)); - pagetable_dtor_free(ptdesc); + pte_free_kernel(mm, (pte_t *)pmd); } > > another observation - pte_free_kernel is not used in remove_pagetable () > and __change_page_attr(). Is it straightforward to put it in those paths > or do we need duplicate some deferring logic there? > In both of these cases, pages in an array or list require deferred freeing. I don't believe the previous approach, which calls pagetable_dtor_free(), will still work here. Perhaps a separate list is needed? What about something like the following? diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 76e33bd7c556..2e887463c165 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1013,7 +1013,10 @@ static void __meminit free_pagetable(struct page *page, int order) free_reserved_pages(page, nr_pages); #endif } else { - free_pages((unsigned long)page_address(page), order); + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) + kernel_pgtable_free_pages_async(page, order); + else + free_pages((unsigned long)page_address(page), order); } } diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 8834c76f91c9..7e567bdfddce 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -436,9 +436,13 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa) flush_tlb_all(); - list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { - list_del(&ptdesc->pt_list); - __free_page(ptdesc_page(ptdesc)); + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) { + kernel_pgtable_free_list_async(&pgtables); + } else { + list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { + list_del(&ptdesc->pt_list); + __free_page(ptdesc_page(ptdesc)); + } } } diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 4938eff5b482..13bbe1588872 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -48,8 +48,12 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #ifdef CONFIG_ASYNC_PGTABLE_FREE void pte_free_kernel_async(struct ptdesc *ptdesc); +void kernel_pgtable_free_list_async(struct list_head *list); +void kernel_pgtable_free_pages_async(struct page *pages, int order); #else static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {} +static inline void kernel_pgtable_free_list_async(struct list_head *list) {} +void kernel_pgtable_free_pages_async(struct page *pages, int order) {} #endif /** diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index b9a785dd6b8d..d1d19132bbe8 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -413,6 +413,7 @@ static void kernel_pte_work_func(struct work_struct *work); struct { struct list_head list; + struct list_head pages; spinlock_t lock; struct work_struct work; } kernel_pte_work = { @@ -425,17 +426,24 @@ static void kernel_pte_work_func(struct work_struct *work) { struct ptdesc *ptdesc, *next; LIST_HEAD(free_list); + LIST_HEAD(pages_list); iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); spin_lock(&kernel_pte_work.lock); list_move(&kernel_pte_work.list, &free_list); + list_move(&kernel_pte_work.pages, &pages_list); spin_unlock(&kernel_pte_work.lock); list_for_each_entry_safe(ptdesc, next, &free_list, pt_list) { list_del(&ptdesc->pt_list); pagetable_dtor_free(ptdesc); } + + list_for_each_entry_safe(ptdesc, next, &pages_list, pt_list) { + list_del(&ptdesc->pt_list); + __free_page(ptdesc_page(ptdesc)); + } } void pte_free_kernel_async(struct ptdesc *ptdesc) @@ -444,4 +452,25 @@ void pte_free_kernel_async(struct ptdesc *ptdesc) list_add(&ptdesc->pt_list, &kernel_pte_work.list); schedule_work(&kernel_pte_work.work); } + +void kernel_pgtable_free_list_async(struct list_head *list) +{ + guard(spinlock)(&kernel_pte_work.lock); + list_splice_tail(list, &kernel_pte_work.pages); + schedule_work(&kernel_pte_work.work); +} + +void kernel_pgtable_free_pages_async(struct page *pages, int order) +{ + unsigned long nr_pages = 1 << order; + struct ptdesc *ptdesc; + int i; + + guard(spinlock)(&kernel_pte_work.lock); + for (i = 0; i < nr_pages; i++) { + ptdesc = page_ptdesc(&pages[i]); + list_add(&ptdesc->pt_list, &kernel_pte_work.pages); + } + schedule_work(&kernel_pte_work.work); +} #endif Thanks, baolu ^ permalink raw reply related [flat|nested] 51+ messages in thread
* RE: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-18 6:21 ` Baolu Lu @ 2025-08-21 7:05 ` Tian, Kevin 2025-08-23 3:26 ` Baolu Lu 0 siblings, 1 reply; 51+ messages in thread From: Tian, Kevin @ 2025-08-21 7:05 UTC (permalink / raw) To: Baolu Lu, Jason Gunthorpe Cc: Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org > From: Baolu Lu <baolu.lu@linux.intel.com> > Sent: Monday, August 18, 2025 2:22 PM > > On 8/8/25 10:57, Tian, Kevin wrote: > >> From: Jason Gunthorpe <jgg@nvidia.com> > >> Sent: Friday, August 8, 2025 3:52 AM > >> > >> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote: > >>> +static void kernel_pte_work_func(struct work_struct *work) > >>> +{ > >>> + struct page *page, *next; > >>> + > >>> + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > >>> + > >>> + guard(spinlock)(&kernel_pte_work.lock); > >>> + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { > >>> + list_del_init(&page->lru); > >> > >> Please don't add new usages of lru, we are trying to get rid of this. :( > >> > >> I think the memory should be struct ptdesc, use that.. > >> > > > > btw with this change we should also defer free of the pmd page: > > > > pud_free_pmd_page() > > ... > > for (i = 0; i < PTRS_PER_PMD; i++) { > > if (!pmd_none(pmd_sv[i])) { > > pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); > > pte_free_kernel(&init_mm, pte); > > } > > } > > > > free_page((unsigned long)pmd_sv); > > > > Otherwise the risk still exists if the pmd page is repurposed before the > > pte work is scheduled. > > > My understanding is that you were talking about pmd_free(), correct? It yes > appears that both pmd_free() and pte_free_kernel() call > pagetable_dtor_free(). If so, how about the following change? > > diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h > index dbddacdca2ce..04f6b5bc212c 100644 > --- a/include/asm-generic/pgalloc.h > +++ b/include/asm-generic/pgalloc.h > @@ -172,10 +172,8 @@ static inline pmd_t *pmd_alloc_one_noprof(struct > mm_struct *mm, unsigned long ad > #ifndef __HAVE_ARCH_PMD_FREE > static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) > { > - struct ptdesc *ptdesc = virt_to_ptdesc(pmd); > - > BUG_ON((unsigned long)pmd & (PAGE_SIZE-1)); > - pagetable_dtor_free(ptdesc); > + pte_free_kernel(mm, (pte_t *)pmd); > } better to rename pte_free_kernel_async() to pagetable_free_kernel_async() and call it directly here like you did in pte_free_kernel(). otherwise it's a bit weird to call a pte helper for pmd. accordingly do we need a new helper pmd_free_kernel()? Now there is no restriction that pmd_free() can only be called on kernel entries. e.g. mop_up_one_pmd() (only in PAE and KPTI), destroy_args() if CONFIG_DEBUG_VM_PGTABLE, etc. > > > > > another observation - pte_free_kernel is not used in remove_pagetable () > > and __change_page_attr(). Is it straightforward to put it in those paths > > or do we need duplicate some deferring logic there? > > > > In both of these cases, pages in an array or list require deferred > freeing. I don't believe the previous approach, which calls > pagetable_dtor_free(), will still work here. Perhaps a separate list is this is the part which I don't quite understand. Is there guarantee that page tables removed in those path are not allocated with any pagetable_ctor helpers? Otherwise some state might be broken w/o proper pagetable_dtor(). Knowing little here, so likely I missed some basic context. > needed? What about something like the following? > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c > index 76e33bd7c556..2e887463c165 100644 > --- a/arch/x86/mm/init_64.c > +++ b/arch/x86/mm/init_64.c > @@ -1013,7 +1013,10 @@ static void __meminit free_pagetable(struct page > *page, int order) > free_reserved_pages(page, nr_pages); > #endif > } else { > - free_pages((unsigned long)page_address(page), order); > + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) > + kernel_pgtable_free_pages_async(page, order); > + else > + free_pages((unsigned long)page_address(page), > order); > } > } > > diff --git a/arch/x86/mm/pat/set_memory.c > b/arch/x86/mm/pat/set_memory.c > index 8834c76f91c9..7e567bdfddce 100644 > --- a/arch/x86/mm/pat/set_memory.c > +++ b/arch/x86/mm/pat/set_memory.c > @@ -436,9 +436,13 @@ static void cpa_collapse_large_pages(struct > cpa_data *cpa) > > flush_tlb_all(); > > - list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { > - list_del(&ptdesc->pt_list); > - __free_page(ptdesc_page(ptdesc)); > + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) { > + kernel_pgtable_free_list_async(&pgtables); > + } else { > + list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { > + list_del(&ptdesc->pt_list); > + __free_page(ptdesc_page(ptdesc)); > + } > } > } > > diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h > index 4938eff5b482..13bbe1588872 100644 > --- a/include/asm-generic/pgalloc.h > +++ b/include/asm-generic/pgalloc.h > @@ -48,8 +48,12 @@ static inline pte_t > *pte_alloc_one_kernel_noprof(struct mm_struct *mm) > > #ifdef CONFIG_ASYNC_PGTABLE_FREE > void pte_free_kernel_async(struct ptdesc *ptdesc); > +void kernel_pgtable_free_list_async(struct list_head *list); > +void kernel_pgtable_free_pages_async(struct page *pages, int order); > #else > static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {} > +static inline void kernel_pgtable_free_list_async(struct list_head > *list) {} > +void kernel_pgtable_free_pages_async(struct page *pages, int order) {} > #endif > > /** > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > index b9a785dd6b8d..d1d19132bbe8 100644 > --- a/mm/pgtable-generic.c > +++ b/mm/pgtable-generic.c > @@ -413,6 +413,7 @@ static void kernel_pte_work_func(struct work_struct > *work); > > struct { > struct list_head list; > + struct list_head pages; the real difference between the two lists is about whether to use pagetable_dtor_free() or __free_page(). Then clearer to call them 'pages_dtor' and 'pages'? > spinlock_t lock; > struct work_struct work; > } kernel_pte_work = { > @@ -425,17 +426,24 @@ static void kernel_pte_work_func(struct > work_struct *work) > { > struct ptdesc *ptdesc, *next; > LIST_HEAD(free_list); > + LIST_HEAD(pages_list); > > iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > > spin_lock(&kernel_pte_work.lock); > list_move(&kernel_pte_work.list, &free_list); > + list_move(&kernel_pte_work.pages, &pages_list); > spin_unlock(&kernel_pte_work.lock); > > list_for_each_entry_safe(ptdesc, next, &free_list, pt_list) { > list_del(&ptdesc->pt_list); > pagetable_dtor_free(ptdesc); > } > + > + list_for_each_entry_safe(ptdesc, next, &pages_list, pt_list) { > + list_del(&ptdesc->pt_list); > + __free_page(ptdesc_page(ptdesc)); > + } > } > > void pte_free_kernel_async(struct ptdesc *ptdesc) > @@ -444,4 +452,25 @@ void pte_free_kernel_async(struct ptdesc *ptdesc) > list_add(&ptdesc->pt_list, &kernel_pte_work.list); > schedule_work(&kernel_pte_work.work); > } > + > +void kernel_pgtable_free_list_async(struct list_head *list) > +{ > + guard(spinlock)(&kernel_pte_work.lock); > + list_splice_tail(list, &kernel_pte_work.pages); > + schedule_work(&kernel_pte_work.work); > +} > + > +void kernel_pgtable_free_pages_async(struct page *pages, int order) > +{ > + unsigned long nr_pages = 1 << order; > + struct ptdesc *ptdesc; > + int i; > + > + guard(spinlock)(&kernel_pte_work.lock); > + for (i = 0; i < nr_pages; i++) { > + ptdesc = page_ptdesc(&pages[i]); > + list_add(&ptdesc->pt_list, &kernel_pte_work.pages); > + } there is a side-effect here. Now a contiguous range of pages (order=N) are freed one-by-one, so they are first fed back to the order=0 free list with possibility of getting split due to small order allocations before having chance to lift back to the order=N list. then the number of available huge pages is unnecessarily reduced. but look at the code probably we don't need to handle the order>0 case? now only free_hugepage_table() may free a huge page, called from remove_pmd_table() when a pmd is a *leaf* entry pointing to a vmemmap huge page. So it's not a real page table. I'm not sure why it's piggybacked in a pagetable helper, but sounds like a case we can ignore in this mitigation... > + schedule_work(&kernel_pte_work.work); > +} > #endif > > ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-21 7:05 ` Tian, Kevin @ 2025-08-23 3:26 ` Baolu Lu 2025-08-25 22:36 ` Dave Hansen 0 siblings, 1 reply; 51+ messages in thread From: Baolu Lu @ 2025-08-23 3:26 UTC (permalink / raw) To: Tian, Kevin, Jason Gunthorpe Cc: baolu.lu, Hansen, Dave, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org On 8/21/2025 3:05 PM, Tian, Kevin wrote: >> From: Baolu Lu<baolu.lu@linux.intel.com> >> Sent: Monday, August 18, 2025 2:22 PM >> >> On 8/8/25 10:57, Tian, Kevin wrote: >>>> From: Jason Gunthorpe<jgg@nvidia.com> >>>> Sent: Friday, August 8, 2025 3:52 AM >>>> >>>> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote: >>>>> +static void kernel_pte_work_func(struct work_struct *work) >>>>> +{ >>>>> + struct page *page, *next; >>>>> + >>>>> + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); >>>>> + >>>>> + guard(spinlock)(&kernel_pte_work.lock); >>>>> + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { >>>>> + list_del_init(&page->lru); >>>> Please don't add new usages of lru, we are trying to get rid of this. 🙁 >>>> >>>> I think the memory should be struct ptdesc, use that.. >>>> >>> btw with this change we should also defer free of the pmd page: >>> >>> pud_free_pmd_page() >>> ... >>> for (i = 0; i < PTRS_PER_PMD; i++) { >>> if (!pmd_none(pmd_sv[i])) { >>> pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]); >>> pte_free_kernel(&init_mm, pte); >>> } >>> } >>> >>> free_page((unsigned long)pmd_sv); >>> >>> Otherwise the risk still exists if the pmd page is repurposed before the >>> pte work is scheduled. >> >> My understanding is that you were talking about pmd_free(), correct? It > yes > >> appears that both pmd_free() and pte_free_kernel() call >> pagetable_dtor_free(). If so, how about the following change? >> >> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h >> index dbddacdca2ce..04f6b5bc212c 100644 >> --- a/include/asm-generic/pgalloc.h >> +++ b/include/asm-generic/pgalloc.h >> @@ -172,10 +172,8 @@ static inline pmd_t *pmd_alloc_one_noprof(struct >> mm_struct *mm, unsigned long ad >> #ifndef __HAVE_ARCH_PMD_FREE >> static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) >> { >> - struct ptdesc *ptdesc = virt_to_ptdesc(pmd); >> - >> BUG_ON((unsigned long)pmd & (PAGE_SIZE-1)); >> - pagetable_dtor_free(ptdesc); >> + pte_free_kernel(mm, (pte_t *)pmd); >> } > better to rename pte_free_kernel_async() to pagetable_free_kernel_async() > and call it directly here like you did in pte_free_kernel(). otherwise it's > a bit weird to call a pte helper for pmd. > > accordingly do we need a new helper pmd_free_kernel()? Now there is > no restriction that pmd_free() can only be called on kernel entries. e.g. > mop_up_one_pmd() (only in PAE and KPTI), destroy_args() if > CONFIG_DEBUG_VM_PGTABLE, etc. > >>> another observation - pte_free_kernel is not used in remove_pagetable () >>> and __change_page_attr(). Is it straightforward to put it in those paths >>> or do we need duplicate some deferring logic there? >>> >> In both of these cases, pages in an array or list require deferred >> freeing. I don't believe the previous approach, which calls >> pagetable_dtor_free(), will still work here. Perhaps a separate list is > this is the part which I don't quite understand. Is there guarantee that > page tables removed in those path are not allocated with any > pagetable_ctor helpers? Otherwise some state might be broken w/o > proper pagetable_dtor(). > > Knowing little here, so likely I missed some basic context. > >> needed? What about something like the following? >> >> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c >> index 76e33bd7c556..2e887463c165 100644 >> --- a/arch/x86/mm/init_64.c >> +++ b/arch/x86/mm/init_64.c >> @@ -1013,7 +1013,10 @@ static void __meminit free_pagetable(struct page >> *page, int order) >> free_reserved_pages(page, nr_pages); >> #endif >> } else { >> - free_pages((unsigned long)page_address(page), order); >> + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) >> + kernel_pgtable_free_pages_async(page, order); >> + else >> + free_pages((unsigned long)page_address(page), >> order); >> } >> } >> >> diff --git a/arch/x86/mm/pat/set_memory.c >> b/arch/x86/mm/pat/set_memory.c >> index 8834c76f91c9..7e567bdfddce 100644 >> --- a/arch/x86/mm/pat/set_memory.c >> +++ b/arch/x86/mm/pat/set_memory.c >> @@ -436,9 +436,13 @@ static void cpa_collapse_large_pages(struct >> cpa_data *cpa) >> >> flush_tlb_all(); >> >> - list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { >> - list_del(&ptdesc->pt_list); >> - __free_page(ptdesc_page(ptdesc)); >> + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) { >> + kernel_pgtable_free_list_async(&pgtables); >> + } else { >> + list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { >> + list_del(&ptdesc->pt_list); >> + __free_page(ptdesc_page(ptdesc)); >> + } >> } >> } >> >> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h >> index 4938eff5b482..13bbe1588872 100644 >> --- a/include/asm-generic/pgalloc.h >> +++ b/include/asm-generic/pgalloc.h >> @@ -48,8 +48,12 @@ static inline pte_t >> *pte_alloc_one_kernel_noprof(struct mm_struct *mm) >> >> #ifdef CONFIG_ASYNC_PGTABLE_FREE >> void pte_free_kernel_async(struct ptdesc *ptdesc); >> +void kernel_pgtable_free_list_async(struct list_head *list); >> +void kernel_pgtable_free_pages_async(struct page *pages, int order); >> #else >> static inline void pte_free_kernel_async(struct ptdesc *ptdesc) {} >> +static inline void kernel_pgtable_free_list_async(struct list_head >> *list) {} >> +void kernel_pgtable_free_pages_async(struct page *pages, int order) {} >> #endif >> >> /** >> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c >> index b9a785dd6b8d..d1d19132bbe8 100644 >> --- a/mm/pgtable-generic.c >> +++ b/mm/pgtable-generic.c >> @@ -413,6 +413,7 @@ static void kernel_pte_work_func(struct work_struct >> *work); >> >> struct { >> struct list_head list; >> + struct list_head pages; > the real difference between the two lists is about whether to use > pagetable_dtor_free() or __free_page(). Then clearer to call them > 'pages_dtor' and 'pages'? > >> spinlock_t lock; >> struct work_struct work; >> } kernel_pte_work = { >> @@ -425,17 +426,24 @@ static void kernel_pte_work_func(struct >> work_struct *work) >> { >> struct ptdesc *ptdesc, *next; >> LIST_HEAD(free_list); >> + LIST_HEAD(pages_list); >> >> iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); >> >> spin_lock(&kernel_pte_work.lock); >> list_move(&kernel_pte_work.list, &free_list); >> + list_move(&kernel_pte_work.pages, &pages_list); >> spin_unlock(&kernel_pte_work.lock); >> >> list_for_each_entry_safe(ptdesc, next, &free_list, pt_list) { >> list_del(&ptdesc->pt_list); >> pagetable_dtor_free(ptdesc); >> } >> + >> + list_for_each_entry_safe(ptdesc, next, &pages_list, pt_list) { >> + list_del(&ptdesc->pt_list); >> + __free_page(ptdesc_page(ptdesc)); >> + } >> } >> >> void pte_free_kernel_async(struct ptdesc *ptdesc) >> @@ -444,4 +452,25 @@ void pte_free_kernel_async(struct ptdesc *ptdesc) >> list_add(&ptdesc->pt_list, &kernel_pte_work.list); >> schedule_work(&kernel_pte_work.work); >> } >> + >> +void kernel_pgtable_free_list_async(struct list_head *list) >> +{ >> + guard(spinlock)(&kernel_pte_work.lock); >> + list_splice_tail(list, &kernel_pte_work.pages); >> + schedule_work(&kernel_pte_work.work); >> +} >> + >> +void kernel_pgtable_free_pages_async(struct page *pages, int order) >> +{ >> + unsigned long nr_pages = 1 << order; >> + struct ptdesc *ptdesc; >> + int i; >> + >> + guard(spinlock)(&kernel_pte_work.lock); >> + for (i = 0; i < nr_pages; i++) { >> + ptdesc = page_ptdesc(&pages[i]); >> + list_add(&ptdesc->pt_list, &kernel_pte_work.pages); >> + } > there is a side-effect here. Now a contiguous range of pages (order=N) > are freed one-by-one, so they are first fed back to the order=0 free list > with possibility of getting split due to small order allocations before > having chance to lift back to the order=N list. then the number of > available huge pages is unnecessarily reduced. > > but look at the code probably we don't need to handle the order>0 case? > > now only free_hugepage_table() may free a huge page, called from > remove_pmd_table() when a pmd is a*leaf* entry pointing to a > vmemmap huge page. So it's not a real page table. I'm not sure why > it's piggybacked in a pagetable helper, but sounds like a case we > can ignore in this mitigation... I am not sure about this context either. My understanding is that "order > 0" is used for the batch deallocation of page table pages that were created to map large, contiguous blocks of memory. Thanks for your comments. I have created a new patch to introduce the asynchronous page table page free mechanism, taking the above comments into consideration. Let me post it below for further review. mm: Conditionally defer freeing of kernel page table pages On x86 and other architectures that map the kernel's virtual address space into the upper portion of every process's page table, the IOMMU's paging structure caches can become stale. This occurs when a page used for the kernel's page tables is freed and reused without the IOMMU being notified. While the IOMMU driver is notified of changes to user virtual address mappings, there is no similar notification mechanism for kernel page table changes. This can lead to data corruption or system instability when Shared Virtual Address (SVA) is enabled, as the IOMMU's internal caches may retain stale entries for kernel virtual addresses. This introduces a conditional asynchronous mechanism, enabled by CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the freeing of pages that are used as page tables for kernel address mappings. These pages are now queued to a work struct instead of being freed immediately. This deferred freeing provides a safe context for a future patch to add an IOMMU-specific callback, which might be expensive on large-scale systems. This ensures the necessary IOMMU cache invalidation is performed before the page is finally returned to the page allocator outside of any critical, non-sleepable path. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> --- arch/x86/mm/init_64.c | 5 +- arch/x86/mm/pat/set_memory.c | 10 +++- arch/x86/mm/pgtable.c | 5 +- include/asm-generic/pgalloc.h | 17 +++++- mm/Kconfig | 3 + mm/pgtable-generic.c | 106 ++++++++++++++++++++++++++++++++++ 6 files changed, 140 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 76e33bd7c556..bf06d815d214 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1013,7 +1013,10 @@ static void __meminit free_pagetable(struct page *page, int order) free_reserved_pages(page, nr_pages); #endif } else { - free_pages((unsigned long)page_address(page), order); + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) + kernel_pgtable_async_free_pages(page, order); + else + free_pages((unsigned long)page_address(page), order); } } diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 8834c76f91c9..97d4f39cae7d 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -436,9 +436,13 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa) flush_tlb_all(); - list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { - list_del(&ptdesc->pt_list); - __free_page(ptdesc_page(ptdesc)); + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) { + kernel_pgtable_async_free_page_list(&pgtables); + } else { + list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { + list_del(&ptdesc->pt_list); + __free_page(ptdesc_page(ptdesc)); + } } } diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index ddf248c3ee7d..45268f009cb0 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -757,7 +757,10 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr) free_page((unsigned long)pmd_sv); - pmd_free(&init_mm, pmd); + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) + kernel_pgtable_async_free_dtor(virt_to_ptdesc(pmd)); + else + pmd_free(&init_mm, pmd); return 1; } diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 3c8ec3bfea44..baa9780cad71 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -46,6 +46,16 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #define pte_alloc_one_kernel(...) alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) #endif +#ifdef CONFIG_ASYNC_PGTABLE_FREE +void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc); +void kernel_pgtable_async_free_page_list(struct list_head *list); +void kernel_pgtable_async_free_pages(struct page *pages, int order); +#else +static inline void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc) {} +static inline void kernel_pgtable_async_free_page_list(struct list_head *list) {} +static inline void kernel_pgtable_async_free_pages(struct page *pages, int order) {} +#endif + /** * pte_free_kernel - free PTE-level kernel page table memory * @mm: the mm_struct of the current context @@ -53,7 +63,12 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) */ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { - pagetable_dtor_free(virt_to_ptdesc(pte)); + struct ptdesc *ptdesc = virt_to_ptdesc(pte); + + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) + kernel_pgtable_async_free_dtor(ptdesc); + else + pagetable_dtor_free(ptdesc); } /** diff --git a/mm/Kconfig b/mm/Kconfig index e443fe8cd6cf..369f3259e6da 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1346,6 +1346,9 @@ config LOCK_MM_AND_FIND_VMA config IOMMU_MM_DATA bool +config ASYNC_PGTABLE_FREE + bool + config EXECMEM bool diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 567e2d084071..d2751da58c94 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -406,3 +406,109 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, pte_unmap_unlock(pte, ptl); goto again; } + +#ifdef CONFIG_ASYNC_PGTABLE_FREE +static void kernel_pgtable_work_func(struct work_struct *work); + +static struct { + /* list for pagetable_dtor_free() */ + struct list_head dtor; + /* list for __free_page() */ + struct list_head page; + /* list for free_pages() */ + struct list_head pages; + /* protect all the ptdesc lists */ + spinlock_t lock; + struct work_struct work; +} kernel_pgtable_work = { + .dtor = LIST_HEAD_INIT(kernel_pgtable_work.dtor), + .page = LIST_HEAD_INIT(kernel_pgtable_work.page), + .pages = LIST_HEAD_INIT(kernel_pgtable_work.pages), + .lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock), + .work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func), +}; + +struct async_free_data { + struct list_head node; + unsigned long addr; + int order; +}; + +static void kernel_pgtable_work_func(struct work_struct *work) +{ + struct async_free_data *data, *temp; + struct ptdesc *ptdesc, *next; + LIST_HEAD(pages_list); + LIST_HEAD(dtor_list); + LIST_HEAD(page_list); + + spin_lock(&kernel_pgtable_work.lock); + list_splice_tail_init(&kernel_pgtable_work.dtor, &dtor_list); + list_splice_tail_init(&kernel_pgtable_work.page, &page_list); + list_splice_tail_init(&kernel_pgtable_work.pages, &pages_list); + spin_unlock(&kernel_pgtable_work.lock); + + list_for_each_entry_safe(ptdesc, next, &dtor_list, pt_list) { + list_del(&ptdesc->pt_list); + pagetable_dtor_free(ptdesc); + } + + list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) { + list_del(&ptdesc->pt_list); + __free_page(ptdesc_page(ptdesc)); + } + + list_for_each_entry_safe(data, temp, &pages_list, node) { + list_del(&data->node); + free_pages(data->addr, data->order); + kfree(data); + } +} + +void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc) +{ + spin_lock(&kernel_pgtable_work.lock); + list_add(&ptdesc->pt_list, &kernel_pgtable_work.dtor); + spin_unlock(&kernel_pgtable_work.lock); + + schedule_work(&kernel_pgtable_work.work); +} + +void kernel_pgtable_async_free_page_list(struct list_head *list) +{ + spin_lock(&kernel_pgtable_work.lock); + list_splice_tail(list, &kernel_pgtable_work.page); + spin_unlock(&kernel_pgtable_work.lock); + + schedule_work(&kernel_pgtable_work.work); +} + +void kernel_pgtable_async_free_pages(struct page *pages, int order) +{ + struct async_free_data *data; + + if (order == 0) { + spin_lock(&kernel_pgtable_work.lock); + list_add(&page_ptdesc(pages)->pt_list, &kernel_pgtable_work.page); + spin_unlock(&kernel_pgtable_work.lock); + + goto out; + } + + data = kzalloc(sizeof(*data), GFP_ATOMIC); + if (!data) { + free_pages((unsigned long)page_address(pages), order); + goto out; + } + + data->addr = (unsigned long)page_address(pages); + data->order = order; + + spin_lock(&kernel_pgtable_work.lock); + list_add(&data->node, &kernel_pgtable_work.pages); + spin_unlock(&kernel_pgtable_work.lock); + +out: + schedule_work(&kernel_pgtable_work.work); +} +#endif -- 2.43.0 Thanks, baolu ^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-23 3:26 ` Baolu Lu @ 2025-08-25 22:36 ` Dave Hansen 2025-08-26 1:25 ` Baolu Lu 0 siblings, 1 reply; 51+ messages in thread From: Dave Hansen @ 2025-08-25 22:36 UTC (permalink / raw) To: Baolu Lu, Tian, Kevin, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org On 8/22/25 20:26, Baolu Lu wrote: > +static struct { > + /* list for pagetable_dtor_free() */ > + struct list_head dtor; > + /* list for __free_page() */ > + struct list_head page; > + /* list for free_pages() */ > + struct list_head pages; > + /* protect all the ptdesc lists */ > + spinlock_t lock; > + struct work_struct work; Could you explain a bit why this now needs three separate lists? Seems like pure overkill. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-25 22:36 ` Dave Hansen @ 2025-08-26 1:25 ` Baolu Lu 2025-08-26 2:49 ` Baolu Lu 0 siblings, 1 reply; 51+ messages in thread From: Baolu Lu @ 2025-08-26 1:25 UTC (permalink / raw) To: Dave Hansen, Tian, Kevin, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org On 8/26/25 06:36, Dave Hansen wrote: > On 8/22/25 20:26, Baolu Lu wrote: >> +static struct { >> + /* list for pagetable_dtor_free() */ >> + struct list_head dtor; >> + /* list for __free_page() */ >> + struct list_head page; >> + /* list for free_pages() */ >> + struct list_head pages; >> + /* protect all the ptdesc lists */ >> + spinlock_t lock; >> + struct work_struct work; > > Could you explain a bit why this now needs three separate lists? Seems > like pure overkill. Yes, sure. The three separate lists are needed because we're handling three distinct types of page deallocation. Grouping the pages this way allows the workqueue handler to free each type using the correct function. - pagetable_dtor_free(): This is for freeing PTE pages, which require specific cleanup of a ptdesc structure. - __free_page(): This is for freeing a single page. - free_pages(): This is for freeing a contiguous block of pages that were allocated together. Using separate lists for each type ensures that every page is handled correctly without having to check the page's type at runtime. This seems like overkill, it was chosen to ensure functional correctness. Any better solution? Thanks, baolu ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-26 1:25 ` Baolu Lu @ 2025-08-26 2:49 ` Baolu Lu 2025-08-26 14:22 ` Dave Hansen 2025-08-26 16:21 ` Dave Hansen 0 siblings, 2 replies; 51+ messages in thread From: Baolu Lu @ 2025-08-26 2:49 UTC (permalink / raw) To: Dave Hansen, Tian, Kevin, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org On 8/26/25 09:25, Baolu Lu wrote: > On 8/26/25 06:36, Dave Hansen wrote: >> On 8/22/25 20:26, Baolu Lu wrote: >>> +static struct { >>> + /* list for pagetable_dtor_free() */ >>> + struct list_head dtor; >>> + /* list for __free_page() */ >>> + struct list_head page; >>> + /* list for free_pages() */ >>> + struct list_head pages; >>> + /* protect all the ptdesc lists */ >>> + spinlock_t lock; >>> + struct work_struct work; >> >> Could you explain a bit why this now needs three separate lists? Seems >> like pure overkill. > > Yes, sure. > > The three separate lists are needed because we're handling three > distinct types of page deallocation. Grouping the pages this way allows > the workqueue handler to free each type using the correct function. Please allow me to add more details. > > - pagetable_dtor_free(): This is for freeing PTE pages, which require > specific cleanup of a ptdesc structure. This is used in static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) and int pud_free_pmd_page(pud_t *pud, unsigned long addr) > > - __free_page(): This is for freeing a single page. This is used in static void cpa_collapse_large_pages(struct cpa_data *cpa) { ... ... list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { list_del(&ptdesc->pt_list); __free_page(ptdesc_page(ptdesc)); } } > > - free_pages(): This is for freeing a contiguous block of pages that > were allocated together. This is used in static void __meminit free_pagetable(struct page *page, int order) { ... ... free_pages((unsigned long)page_address(page), order); } What's strange is that order is almost always 0, except in the path of remove_pmd_table() -> free_hugepage_table(), where order can be greater than 0. However, in this context path, free_hugepage_table() isn't used to free a page table page itself. Instead, it's used to free the actual pages that a leaf PMD is pointing to. static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end, bool direct, struct vmem_altmap *altmap) { ... ... if (pmd_leaf(*pmd)) { if (IS_ALIGNED(addr, PMD_SIZE) && IS_ALIGNED(next, PMD_SIZE)) { if (!direct) free_hugepage_table(pmd_page(*pmd), altmap); spin_lock(&init_mm.page_table_lock); pmd_clear(pmd); spin_unlock(&init_mm.page_table_lock); pages++; } else if (vmemmap_pmd_is_unused(addr, next)) { free_hugepage_table(pmd_page(*pmd), altmap); spin_lock(&init_mm.page_table_lock); pmd_clear(pmd); spin_unlock(&init_mm.page_table_lock); } continue; ... ... } Is this a misuse of free_pagetable() or anything overlooked? Thanks, baolu ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-26 2:49 ` Baolu Lu @ 2025-08-26 14:22 ` Dave Hansen 2025-08-26 14:33 ` Matthew Wilcox 2025-08-27 10:58 ` Baolu Lu 2025-08-26 16:21 ` Dave Hansen 1 sibling, 2 replies; 51+ messages in thread From: Dave Hansen @ 2025-08-26 14:22 UTC (permalink / raw) To: Baolu Lu, Tian, Kevin, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, vishal.moola, Matthew Wilcox On 8/25/25 19:49, Baolu Lu wrote: >> The three separate lists are needed because we're handling three >> distinct types of page deallocation. Grouping the pages this way allows >> the workqueue handler to free each type using the correct function. > > Please allow me to add more details. Right, I know why it got added this way: it was the quickest way to hack together a patch that fixes the IOMMU issue without refactoring anything. I agree that you have three cases: 1. A full on 'struct ptdesc' that needs its destructor run 2. An order-0 'struct page' 3. A higher-order 'struct page' Long-term, #2 and #3 probably need to get converted over to 'struct ptdesc'. They don't look _that_ hard to convert to me. Willy, Vishal, any other mm folks: do you agree? Short-term, I'd just consolidate your issue down to a single list. #1: For 'struct ptdesc', modify pte_free_kernel() to pass information in to pagetable_dtor_free() to tell it to use the deferred page table free list. Do this with a bit in the ptdesc or a new argument to pagetable_dtor_free(). #2. Just append these to the deferred page table free list. Easy. #3. The biggest hacky way to do this is to just treat the higher-order non-compound page and put the pages on the deferred page table free list one at a time. The other way to do it is to track down how this thing got allocated in the first place and make sure it's got __GFP_COMP metadata. If so, you can just use __free_pages() for everything. Yeah, it'll take a couple patches up front to refactor some things. But that refactoring will make things more consistent instead of adding adding complexity to deal with the inconsistency. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-26 14:22 ` Dave Hansen @ 2025-08-26 14:33 ` Matthew Wilcox 2025-08-26 14:57 ` Dave Hansen 2025-08-27 10:58 ` Baolu Lu 1 sibling, 1 reply; 51+ messages in thread From: Matthew Wilcox @ 2025-08-26 14:33 UTC (permalink / raw) To: Dave Hansen Cc: Baolu Lu, Tian, Kevin, Jason Gunthorpe, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, vishal.moola On Tue, Aug 26, 2025 at 07:22:06AM -0700, Dave Hansen wrote: > On 8/25/25 19:49, Baolu Lu wrote: > >> The three separate lists are needed because we're handling three > >> distinct types of page deallocation. Grouping the pages this way allows > >> the workqueue handler to free each type using the correct function. > > > > Please allow me to add more details. > > Right, I know why it got added this way: it was the quickest way to hack > together a patch that fixes the IOMMU issue without refactoring anything. > > I agree that you have three cases: > 1. A full on 'struct ptdesc' that needs its destructor run > 2. An order-0 'struct page' > 3. A higher-order 'struct page' > > Long-term, #2 and #3 probably need to get converted over to 'struct > ptdesc'. They don't look _that_ hard to convert to me. Willy, Vishal, > any other mm folks: do you agree? Uhh. I'm still quite ignorant about iommu page tables. Let me make some general observations that may be helpful. We are attempting to shrink struct page down to 8 bytes. That's going to proceed in stages over the next two to three years. Depending how much information you need to keep around, you may be able to keep using struct page for a while, but eventually (unless you only need a few bits of information), you're going to need a memdesc for your allocations. One memdesc type already assigned is for page tables. Maybe iommu page tables are the same / similar enough to a CPU page table that we keep the same data structure. Maybe they'll need their own data structure. I lack the knowledge to make that decision. For more on memdescs, please see here: https://kernelnewbies.org/MatthewWilcox/Memdescs > Short-term, I'd just consolidate your issue down to a single list. > > #1: For 'struct ptdesc', modify pte_free_kernel() to pass information in > to pagetable_dtor_free() to tell it to use the deferred page table > free list. Do this with a bit in the ptdesc or a new argument to > pagetable_dtor_free(). We should be able to reuse one of the flags in __page_flags or there are 24 unused bits in __page_type. > #2. Just append these to the deferred page table free list. Easy. > #3. The biggest hacky way to do this is to just treat the higher-order > non-compound page and put the pages on the deferred page table > free list one at a time. The other way to do it is to track down how > this thing got allocated in the first place and make sure it's got > __GFP_COMP metadata. If so, you can just use __free_pages() for > everything. Non-compound allocations are bad news. Is there a good reason these can't be made into compound allocations? > Yeah, it'll take a couple patches up front to refactor some things. But > that refactoring will make things more consistent instead of adding > adding complexity to deal with the inconsistency. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-26 14:33 ` Matthew Wilcox @ 2025-08-26 14:57 ` Dave Hansen 0 siblings, 0 replies; 51+ messages in thread From: Dave Hansen @ 2025-08-26 14:57 UTC (permalink / raw) To: Matthew Wilcox Cc: Baolu Lu, Tian, Kevin, Jason Gunthorpe, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, vishal.moola On 8/26/25 07:33, Matthew Wilcox wrote: ... > One memdesc type already assigned is for page tables. Maybe iommu page > tables are the same One bit of context: This is an IOMMU problem but the IOMMU is walking the normal x86 page tables in this case, specifically kernel page tables. There are separate IOMMU page tables, of course, but those aren't at issue here. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-26 14:22 ` Dave Hansen 2025-08-26 14:33 ` Matthew Wilcox @ 2025-08-27 10:58 ` Baolu Lu 2025-08-27 23:31 ` Dave Hansen 1 sibling, 1 reply; 51+ messages in thread From: Baolu Lu @ 2025-08-27 10:58 UTC (permalink / raw) To: Dave Hansen, Tian, Kevin, Jason Gunthorpe Cc: baolu.lu, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, vishal.moola, Matthew Wilcox On 8/26/2025 10:22 PM, Dave Hansen wrote: > On 8/25/25 19:49, Baolu Lu wrote: >>> The three separate lists are needed because we're handling three >>> distinct types of page deallocation. Grouping the pages this way allows >>> the workqueue handler to free each type using the correct function. >> >> Please allow me to add more details. > > Right, I know why it got added this way: it was the quickest way to hack > together a patch that fixes the IOMMU issue without refactoring anything. > > I agree that you have three cases: > 1. A full on 'struct ptdesc' that needs its destructor run > 2. An order-0 'struct page' > 3. A higher-order 'struct page' > > Long-term, #2 and #3 probably need to get converted over to 'struct > ptdesc'. They don't look _that_ hard to convert to me. Willy, Vishal, > any other mm folks: do you agree? > > Short-term, I'd just consolidate your issue down to a single list. > > #1: For 'struct ptdesc', modify pte_free_kernel() to pass information in > to pagetable_dtor_free() to tell it to use the deferred page table > free list. Do this with a bit in the ptdesc or a new argument to > pagetable_dtor_free(). > #2. Just append these to the deferred page table free list. Easy. > #3. The biggest hacky way to do this is to just treat the higher-order > non-compound page and put the pages on the deferred page table > free list one at a time. The other way to do it is to track down how > this thing got allocated in the first place and make sure it's got > __GFP_COMP metadata. If so, you can just use __free_pages() for > everything. > > Yeah, it'll take a couple patches up front to refactor some things. But > that refactoring will make things more consistent instead of adding > adding complexity to deal with the inconsistency. Following the insights above, I wrote the code as follows. Does it look good? mm/pgtable-generic.c: #ifdef CONFIG_ASYNC_PGTABLE_FREE /* a 'struct ptdesc' that needs its destructor run */ #define ASYNC_PGTABLE_FREE_DTOR BIT(NR_PAGEFLAGS) static void kernel_pgtable_work_func(struct work_struct *work); static struct { struct list_head list; /* protect above ptdesc lists */ spinlock_t lock; struct work_struct work; } kernel_pgtable_work = { .list = LIST_HEAD_INIT(kernel_pgtable_work.list), .lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock), .work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func), }; static void kernel_pgtable_work_func(struct work_struct *work) { struct ptdesc *ptdesc, *next; LIST_HEAD(page_list); spin_lock(&kernel_pgtable_work.lock); list_splice_tail_init(&kernel_pgtable_work.list, &page_list); spin_unlock(&kernel_pgtable_work.lock); iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) { list_del(&ptdesc->pt_list); if (ptdesc->__page_flags & ASYNC_PGTABLE_FREE_DTOR) pagetable_dtor_free(ptdesc); else __free_page(ptdesc_page(ptdesc)); } } void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc) { spin_lock(&kernel_pgtable_work.lock); ptdesc->__page_flags |= ASYNC_PGTABLE_FREE_DTOR; list_add(&ptdesc->pt_list, &kernel_pgtable_work.list); spin_unlock(&kernel_pgtable_work.lock); schedule_work(&kernel_pgtable_work.work); } void kernel_pgtable_async_free_page_list(struct list_head *list) { spin_lock(&kernel_pgtable_work.lock); list_splice_tail(list, &kernel_pgtable_work.list); spin_unlock(&kernel_pgtable_work.lock); schedule_work(&kernel_pgtable_work.work); } void kernel_pgtable_async_free_page(struct page *page) { spin_lock(&kernel_pgtable_work.lock); list_add(&page_ptdesc(page)->pt_list, &kernel_pgtable_work.list); spin_unlock(&kernel_pgtable_work.lock); schedule_work(&kernel_pgtable_work.work); } #endif Thanks, baolu ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-27 10:58 ` Baolu Lu @ 2025-08-27 23:31 ` Dave Hansen 2025-08-28 5:31 ` Baolu Lu 0 siblings, 1 reply; 51+ messages in thread From: Dave Hansen @ 2025-08-27 23:31 UTC (permalink / raw) To: Baolu Lu, Tian, Kevin, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, vishal.moola, Matthew Wilcox On 8/27/25 03:58, Baolu Lu wrote: > Following the insights above, I wrote the code as follows. Does it look > good? I'd really prefer an actual diff that compiles. Because: > #ifdef CONFIG_ASYNC_PGTABLE_FREE > /* a 'struct ptdesc' that needs its destructor run */ > #define ASYNC_PGTABLE_FREE_DTOR BIT(NR_PAGEFLAGS) This doesn't work, does it? Don't you need to _allocate_ a new bit? Wouldn't this die a hilariously horrible death if NR_PAGEFLAGS==64? ;) Also, I'm pretty sure you can't just go setting random bits in this because it aliases with page flags. ... > static void kernel_pgtable_work_func(struct work_struct *work) > { > struct ptdesc *ptdesc, *next; > LIST_HEAD(page_list); > > spin_lock(&kernel_pgtable_work.lock); > list_splice_tail_init(&kernel_pgtable_work.list, &page_list); > spin_unlock(&kernel_pgtable_work.lock); > > iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > > list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) { > list_del(&ptdesc->pt_list); > if (ptdesc->__page_flags & ASYNC_PGTABLE_FREE_DTOR) > pagetable_dtor_free(ptdesc); > else > __free_page(ptdesc_page(ptdesc)); > } > } What I had in mind was that kernel_pgtable_work_func() only does: __free_pages(page, compound_order(page)); All of the things that queue gunk to the list do the legwork of making the work_func() simple. This also guides them toward proving a single, compound page because _they_ have to do the work if they don't want something simple. > void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc) > { > spin_lock(&kernel_pgtable_work.lock); > ptdesc->__page_flags |= ASYNC_PGTABLE_FREE_DTOR; > list_add(&ptdesc->pt_list, &kernel_pgtable_work.list); > spin_unlock(&kernel_pgtable_work.lock); > > schedule_work(&kernel_pgtable_work.work); > } > > void kernel_pgtable_async_free_page_list(struct list_head *list) > { > spin_lock(&kernel_pgtable_work.lock); > list_splice_tail(list, &kernel_pgtable_work.list); > spin_unlock(&kernel_pgtable_work.lock); > > schedule_work(&kernel_pgtable_work.work); > } > > void kernel_pgtable_async_free_page(struct page *page) > { > spin_lock(&kernel_pgtable_work.lock); > list_add(&page_ptdesc(page)->pt_list, &kernel_pgtable_work.list); > spin_unlock(&kernel_pgtable_work.lock); > > schedule_work(&kernel_pgtable_work.work); > } I wouldn't have three of these, I'd have _one_. If you want to free a bunch of pages, then just call it a bunch of times. This is not performance critical. Oh, and the ptdesc flag shouldn't be for "I need to be asynchronously freed". All kernel PTE pages should ideally set it at allocation so you can do this: static inline void pagetable_free(struct ptdesc *pt) { struct page *page = ptdesc_page(pt); if (ptdesc->some_field | PTDESC_KERNEL) kernel_pgtable_async_free_page(page); else __free_pages(page, compound_order(page)); } The folks that, today, call pagetable_dtor_free() don't have to do _anything_ at free time. They just set the PTDESC_KERNEL bit at allocation time. See how that code reads? "If you have a kernel page table page, you must asynchronously free it." That's actually meaningful. I'm pretty sure the lower 24 bits of ptdesc->__page_type are free. So I'd just do this: #define PTDESC_TYPE_KERNEL BIT(0) Then, something needs to set: ptdesc->__page_type |= PTDESC_TYPE_KERNEL; That could happen as late as here: static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { struct ptdesc *ptdesc = virt_to_ptdesc(pte); // here pagetable_dtor_free(ptdesc); } Or as early as ptdesc allocation/constructor time. Then you check for PTDESC_TYPE_KERNEL in pagetable_free(). ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-27 23:31 ` Dave Hansen @ 2025-08-28 5:31 ` Baolu Lu 2025-08-28 7:08 ` Tian, Kevin 0 siblings, 1 reply; 51+ messages in thread From: Baolu Lu @ 2025-08-28 5:31 UTC (permalink / raw) To: Dave Hansen, Tian, Kevin, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, vishal.moola, Matthew Wilcox On 8/28/25 07:31, Dave Hansen wrote: > On 8/27/25 03:58, Baolu Lu wrote: >> Following the insights above, I wrote the code as follows. Does it look >> good? > > I'd really prefer an actual diff that compiles. Because: > >> #ifdef CONFIG_ASYNC_PGTABLE_FREE >> /* a 'struct ptdesc' that needs its destructor run */ >> #define ASYNC_PGTABLE_FREE_DTOR BIT(NR_PAGEFLAGS) > > This doesn't work, does it? Don't you need to _allocate_ a new bit? > Wouldn't this die a hilariously horrible death if NR_PAGEFLAGS==64? ;) > > Also, I'm pretty sure you can't just go setting random bits in this > because it aliases with page flags. > > ... >> static void kernel_pgtable_work_func(struct work_struct *work) >> { >> struct ptdesc *ptdesc, *next; >> LIST_HEAD(page_list); >> >> spin_lock(&kernel_pgtable_work.lock); >> list_splice_tail_init(&kernel_pgtable_work.list, &page_list); >> spin_unlock(&kernel_pgtable_work.lock); >> >> iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); >> >> list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) { >> list_del(&ptdesc->pt_list); >> if (ptdesc->__page_flags & ASYNC_PGTABLE_FREE_DTOR) >> pagetable_dtor_free(ptdesc); >> else >> __free_page(ptdesc_page(ptdesc)); >> } >> } > What I had in mind was that kernel_pgtable_work_func() only does: > > __free_pages(page, compound_order(page)); > > All of the things that queue gunk to the list do the legwork of making > the work_func() simple. This also guides them toward proving a single, > compound page because _they_ have to do the work if they don't want > something simple. > >> void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc) >> { >> spin_lock(&kernel_pgtable_work.lock); >> ptdesc->__page_flags |= ASYNC_PGTABLE_FREE_DTOR; >> list_add(&ptdesc->pt_list, &kernel_pgtable_work.list); >> spin_unlock(&kernel_pgtable_work.lock); >> >> schedule_work(&kernel_pgtable_work.work); >> } >> >> void kernel_pgtable_async_free_page_list(struct list_head *list) >> { >> spin_lock(&kernel_pgtable_work.lock); >> list_splice_tail(list, &kernel_pgtable_work.list); >> spin_unlock(&kernel_pgtable_work.lock); >> >> schedule_work(&kernel_pgtable_work.work); >> } >> >> void kernel_pgtable_async_free_page(struct page *page) >> { >> spin_lock(&kernel_pgtable_work.lock); >> list_add(&page_ptdesc(page)->pt_list, &kernel_pgtable_work.list); >> spin_unlock(&kernel_pgtable_work.lock); >> >> schedule_work(&kernel_pgtable_work.work); >> } > > I wouldn't have three of these, I'd have _one_. If you want to free a > bunch of pages, then just call it a bunch of times. This is not > performance critical. > > Oh, and the ptdesc flag shouldn't be for "I need to be asynchronously > freed". All kernel PTE pages should ideally set it at allocation so you > can do this: > > static inline void pagetable_free(struct ptdesc *pt) > { > struct page *page = ptdesc_page(pt); > > if (ptdesc->some_field | PTDESC_KERNEL) > kernel_pgtable_async_free_page(page); > else > __free_pages(page, compound_order(page)); > } > > The folks that, today, call pagetable_dtor_free() don't have to do > _anything_ at free time. They just set the PTDESC_KERNEL bit at > allocation time. > > See how that code reads? "If you have a kernel page table page, you must > asynchronously free it." That's actually meaningful. > > I'm pretty sure the lower 24 bits of ptdesc->__page_type are free. So > I'd just do this: > > #define PTDESC_TYPE_KERNEL BIT(0) > > Then, something needs to set: > > ptdesc->__page_type |= PTDESC_TYPE_KERNEL; > > That could happen as late as here: > > static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) > { > struct ptdesc *ptdesc = virt_to_ptdesc(pte); > > // here > > pagetable_dtor_free(ptdesc); > } > > Or as early as ptdesc allocation/constructor time. Then you check for > PTDESC_TYPE_KERNEL in pagetable_free(). Very appreciated for your review comments. I changed the patch accordingly like this: [PATCH 1/2] mm: Conditionally defer freeing of kernel page table pages On x86 and other architectures that map the kernel's virtual address space into the upper portion of every process's page table, the IOMMU's paging structure caches can become stale when the CPU page table is shared with IOMMU in the Shared Virtual Address (SVA) context. This occurs when a page used for the kernel's page tables is freed and reused without the IOMMU being notified. While the IOMMU driver is notified of changes to user virtual address mappings, there is no similar notification mechanism for kernel page table changes. This can lead to data corruption or system instability when Shared Virtual Address (SVA) is enabled, as the IOMMU's internal caches may retain stale entries for kernel virtual addresses. This introduces a conditional asynchronous mechanism, enabled by CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the freeing of pages that are used as page tables for kernel address mappings. These pages are now queued to a work struct instead of being freed immediately. This deferred freeing provides a safe context for a future patch to add an IOMMU-specific callback, which might be expensive on large-scale systems. This ensures the necessary IOMMU cache invalidation is performed before the page is finally returned to the page allocator outside of any critical, non-sleepable path. In the current kernel, some page table pages are allocated with an associated struct ptdesc, while others are not. Those without a ptdesc are freed using free_pages() and its variants, which bypasses the destructor that pagetable_dtor_free() would run. While the long-term plan is to convert all page table pages to use struct ptdesc, this uses a temporary flag within ptdesc to indicate whether a page needs a destructor, considering that this aims to fix a potential security issue in IOMMU SVA. The flag and its associated logic can be removed once the conversion is complete. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> --- arch/x86/mm/init_64.c | 9 ++++++- arch/x86/mm/pat/set_memory.c | 5 +++- arch/x86/mm/pgtable.c | 9 ++++++- include/asm-generic/pgalloc.h | 15 +++++++++++- include/linux/mm_types.h | 8 ++++++- mm/Kconfig | 3 +++ mm/pgtable-generic.c | 45 +++++++++++++++++++++++++++++++++++ 7 files changed, 89 insertions(+), 5 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 76e33bd7c556..8a050cfa0f8d 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1013,7 +1013,14 @@ static void __meminit free_pagetable(struct page *page, int order) free_reserved_pages(page, nr_pages); #endif } else { - free_pages((unsigned long)page_address(page), order); + /* + * 'order == 0' means a kernel page table page, otherwise it's + * data pages pointed by a huge page leaf pte. + */ + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE) && !order) + kernel_pgtable_async_free(page_ptdesc(page)); + else + free_pages((unsigned long)page_address(page), order); } } diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 8834c76f91c9..9a552e93d1f1 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -438,7 +438,10 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa) list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { list_del(&ptdesc->pt_list); - __free_page(ptdesc_page(ptdesc)); + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) + kernel_pgtable_async_free(ptdesc); + else + __free_page(ptdesc_page(ptdesc)); } } diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index ddf248c3ee7d..6f400cff2ad4 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -757,7 +757,14 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr) free_page((unsigned long)pmd_sv); - pmd_free(&init_mm, pmd); + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) { + struct ptdesc *ptdesc = virt_to_ptdesc(pmd); + + ptdesc->__page_type |= PTDESC_TYPE_KERNEL; + kernel_pgtable_async_free(ptdesc); + } else { + pmd_free(&init_mm, pmd); + } return 1; } diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h index 3c8ec3bfea44..d92dc5289ef4 100644 --- a/include/asm-generic/pgalloc.h +++ b/include/asm-generic/pgalloc.h @@ -46,6 +46,12 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) #define pte_alloc_one_kernel(...) alloc_hooks(pte_alloc_one_kernel_noprof(__VA_ARGS__)) #endif +#ifdef CONFIG_ASYNC_PGTABLE_FREE +void kernel_pgtable_async_free(struct ptdesc *ptdesc); +#else +static inline void kernel_pgtable_async_free(struct ptdesc *ptdesc) {} +#endif + /** * pte_free_kernel - free PTE-level kernel page table memory * @mm: the mm_struct of the current context @@ -53,7 +59,14 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm) */ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) { - pagetable_dtor_free(virt_to_ptdesc(pte)); + struct ptdesc *ptdesc = virt_to_ptdesc(pte); + + ptdesc->__page_type |= PTDESC_TYPE_KERNEL; + + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) + kernel_pgtable_async_free(ptdesc); + else + pagetable_dtor_free(ptdesc); } /** diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 08bc2442db93..f74c6b167e6a 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -539,7 +539,7 @@ FOLIO_MATCH(compound_head, _head_3); * @pt_share_count: Used for HugeTLB PMD page table share count. * @_pt_pad_2: Padding to ensure proper alignment. * @ptl: Lock for the page table. - * @__page_type: Same as page->page_type. Unused for page tables. + * @__page_type: Same as page->page_type. * @__page_refcount: Same as page refcount. * @pt_memcg_data: Memcg data. Tracked for page tables here. * @@ -637,6 +637,12 @@ static inline void ptdesc_pmd_pts_init(struct ptdesc *ptdesc) } #endif +/* + * Used by ptdesc::__page_type to indicate a page for kernel address page + * table mapping. + */ +#define PTDESC_TYPE_KERNEL BIT(0) + /* * Used for sizing the vmemmap region on some architectures */ diff --git a/mm/Kconfig b/mm/Kconfig index e443fe8cd6cf..369f3259e6da 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1346,6 +1346,9 @@ config LOCK_MM_AND_FIND_VMA config IOMMU_MM_DATA bool +config ASYNC_PGTABLE_FREE + bool + config EXECMEM bool diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index 567e2d084071..024d42480c51 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -406,3 +406,48 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, pte_unmap_unlock(pte, ptl); goto again; } + +#ifdef CONFIG_ASYNC_PGTABLE_FREE +static void kernel_pgtable_work_func(struct work_struct *work); + +static struct { + struct list_head list; + /* protect above ptdesc lists */ + spinlock_t lock; + struct work_struct work; +} kernel_pgtable_work = { + .list = LIST_HEAD_INIT(kernel_pgtable_work.list), + .lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock), + .work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func), +}; + +static void kernel_pgtable_work_func(struct work_struct *work) +{ + struct ptdesc *ptdesc, *next; + LIST_HEAD(page_list); + + spin_lock(&kernel_pgtable_work.lock); + list_splice_tail_init(&kernel_pgtable_work.list, &page_list); + spin_unlock(&kernel_pgtable_work.lock); + + list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) { + list_del(&ptdesc->pt_list); + if (ptdesc->__page_type & PTDESC_TYPE_KERNEL) { + pagetable_dtor_free(ptdesc); + } else { + struct page *page = ptdesc_page(ptdesc); + + __free_pages(page, compound_order(page)); + } + } +} + +void kernel_pgtable_async_free(struct ptdesc *ptdesc) +{ + spin_lock(&kernel_pgtable_work.lock); + list_add(&ptdesc->pt_list, &kernel_pgtable_work.list); + spin_unlock(&kernel_pgtable_work.lock); + + schedule_work(&kernel_pgtable_work.work); +} +#endif -- 2.43.0 Thanks, baolu ^ permalink raw reply related [flat|nested] 51+ messages in thread
* RE: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-28 5:31 ` Baolu Lu @ 2025-08-28 7:08 ` Tian, Kevin 2025-08-28 18:56 ` Dave Hansen 0 siblings, 1 reply; 51+ messages in thread From: Tian, Kevin @ 2025-08-28 7:08 UTC (permalink / raw) To: Baolu Lu, Hansen, Dave, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, vishal.moola@gmail.com, Matthew Wilcox > From: Baolu Lu <baolu.lu@linux.intel.com> > Sent: Thursday, August 28, 2025 1:31 PM > > @@ -438,7 +438,10 @@ static void cpa_collapse_large_pages(struct > cpa_data *cpa) > > list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) { > list_del(&ptdesc->pt_list); > - __free_page(ptdesc_page(ptdesc)); > + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) > + kernel_pgtable_async_free(ptdesc); > + else > + __free_page(ptdesc_page(ptdesc)); > } Dave's suggestion is to check the new ptdesc flag and defer in pagetable_free(). both here and above could be converted to: ptdesc->__page_type |= PTDESC_TYPE_KERNEL; pagetable_free(ptdesc); > @@ -757,7 +757,14 @@ int pud_free_pmd_page(pud_t *pud, unsigned long > addr) > > free_page((unsigned long)pmd_sv); > > - pmd_free(&init_mm, pmd); > + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) { > + struct ptdesc *ptdesc = virt_to_ptdesc(pmd); > + > + ptdesc->__page_type |= PTDESC_TYPE_KERNEL; > + kernel_pgtable_async_free(ptdesc); > + } else { > + pmd_free(&init_mm, pmd); > + } We may add a new pmd_free_kernel() helper, which does: ptdesc->__page_type |= PTDESC_TYPE_KERNEL; pagetable_dtor_free(ptdesc); > static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte) > { > - pagetable_dtor_free(virt_to_ptdesc(pte)); > + struct ptdesc *ptdesc = virt_to_ptdesc(pte); > + > + ptdesc->__page_type |= PTDESC_TYPE_KERNEL; > + > + if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE)) > + kernel_pgtable_async_free(ptdesc); > + else > + pagetable_dtor_free(ptdesc); > } same: ptdesc->__page_type |= PTDESC_TYPE_KERNEL; pagetable_dtor_free(ptdesc); Then you have pagetable_free() to handle defer in one place (revised on Dave's draft): static inline void pagetable_free(struct ptdesc *pt) { struct page *page = ptdesc_page(pt); if (IS_ENABLED(CONFIG_ASYNC_PGTABLE_FREE) && (ptdesc->__page_type | PTDESC_KERNEL)) kernel_pgtable_async_free_page(page); else __free_pages(page, compound_order(page)); } > +static void kernel_pgtable_work_func(struct work_struct *work) > +{ > + struct ptdesc *ptdesc, *next; > + LIST_HEAD(page_list); > + > + spin_lock(&kernel_pgtable_work.lock); > + list_splice_tail_init(&kernel_pgtable_work.list, &page_list); > + spin_unlock(&kernel_pgtable_work.lock); > + > + list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) { > + list_del(&ptdesc->pt_list); > + if (ptdesc->__page_type & PTDESC_TYPE_KERNEL) { > + pagetable_dtor_free(ptdesc); > + } else { > + struct page *page = ptdesc_page(ptdesc); > + > + __free_pages(page, compound_order(page)); > + } Then you only need __free_pages() here. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-28 7:08 ` Tian, Kevin @ 2025-08-28 18:56 ` Dave Hansen 2025-08-28 19:10 ` Jason Gunthorpe 0 siblings, 1 reply; 51+ messages in thread From: Dave Hansen @ 2025-08-28 18:56 UTC (permalink / raw) To: Tian, Kevin, Baolu Lu, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, vishal.moola@gmail.com, Matthew Wilcox I got a bit sick of sending constant emails on this one and just decided to hack it together myself. Here's roughly what I came up with. Completely untested. Changelogs are bare. Barely compiles: > https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=kpte I also didn't actually hook up the IOMMU flush yet. The biggest single chunk of code is defining ptdesc_*_kernel(). The rest of it is painfully simple. I'm pretty sure this hits the vmalloc() case that we all care about. It needs some more time with folks staring at it to make sure it didn't miss some of the other cases. I'll wrangle this into a real series and send it via another thread unless someone beats me to it. If anyone sees anything in there they can't live with, please scream sooner rather than later. ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-28 18:56 ` Dave Hansen @ 2025-08-28 19:10 ` Jason Gunthorpe 2025-08-28 19:31 ` Dave Hansen 0 siblings, 1 reply; 51+ messages in thread From: Jason Gunthorpe @ 2025-08-28 19:10 UTC (permalink / raw) To: Dave Hansen Cc: Tian, Kevin, Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, vishal.moola@gmail.com, Matthew Wilcox On Thu, Aug 28, 2025 at 11:56:00AM -0700, Dave Hansen wrote: > I got a bit sick of sending constant emails on this one and just decided > to hack it together myself. Here's roughly what I came up with. > Completely untested. Changelogs are bare. Barely compiles: > > > https://git.kernel.org/pub/scm/linux/kernel/git/daveh/devel.git/log/?h=kpte > > I also didn't actually hook up the IOMMU flush yet. > > The biggest single chunk of code is defining ptdesc_*_kernel(). The rest > of it is painfully simple. Seems not great to be casting ptdesc to folio just to use folio_set_referenced(), I'm pretty sure that is no the direction things are going in.. I'd set the flag yourself or perhaps use the memory of __page_mapping to store a flag (but zero it before freeing). Looks OK otherwise Jason ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-28 19:10 ` Jason Gunthorpe @ 2025-08-28 19:31 ` Dave Hansen 2025-08-28 19:39 ` Matthew Wilcox 0 siblings, 1 reply; 51+ messages in thread From: Dave Hansen @ 2025-08-28 19:31 UTC (permalink / raw) To: Jason Gunthorpe Cc: Tian, Kevin, Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, vishal.moola@gmail.com, Matthew Wilcox On 8/28/25 12:10, Jason Gunthorpe wrote: >> The biggest single chunk of code is defining ptdesc_*_kernel(). The rest >> of it is painfully simple. > Seems not great to be casting ptdesc to folio just to use > folio_set_referenced(), I'm pretty sure that is no the direction > things are going in.. Ideally, the ptdesc->__page_flags fields would have their own set of macros just like folios do, right? Alas, I was too lazy to go to the trouble of conjuring those up for this single bit. > I'd set the flag yourself or perhaps use the memory of __page_mapping > to store a flag (but zero it before freeing). That was my first inclination as well (actually I was going to use __page_type since the lower 24 bits are evidently open, but same idea). Willy pointed me in the direction of PG_referenced on IRC, so I took that path. But I'm open to doing whatever. It's just three lines of code that probably end up boiling down to bts 0x1, 0x8(%rax) versus bts 0x2, 0x16(%rax) ;) ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-28 19:31 ` Dave Hansen @ 2025-08-28 19:39 ` Matthew Wilcox 0 siblings, 0 replies; 51+ messages in thread From: Matthew Wilcox @ 2025-08-28 19:39 UTC (permalink / raw) To: Dave Hansen Cc: Jason Gunthorpe, Tian, Kevin, Baolu Lu, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org, vishal.moola@gmail.com On Thu, Aug 28, 2025 at 12:31:40PM -0700, Dave Hansen wrote: > On 8/28/25 12:10, Jason Gunthorpe wrote: > >> The biggest single chunk of code is defining ptdesc_*_kernel(). The rest > >> of it is painfully simple. > > Seems not great to be casting ptdesc to folio just to use > > folio_set_referenced(), I'm pretty sure that is no the direction > > things are going in.. > > Ideally, the ptdesc->__page_flags fields would have their own set of > macros just like folios do, right? Alas, I was too lazy to go to the > trouble of conjuring those up for this single bit. There's a bunch of "Ideally" in ptdesc, but I'm equially too lazy to do that before I've proven everything works great with slab and/or folios. It'll have to be changed no matter which way we do it, so as long as it's obvious where needs to be changed, it's fine. I deem this "good enough for now". ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-26 2:49 ` Baolu Lu 2025-08-26 14:22 ` Dave Hansen @ 2025-08-26 16:21 ` Dave Hansen 2025-08-27 6:34 ` Baolu Lu 1 sibling, 1 reply; 51+ messages in thread From: Dave Hansen @ 2025-08-26 16:21 UTC (permalink / raw) To: Baolu Lu, Tian, Kevin, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org On 8/25/25 19:49, Baolu Lu wrote: > What's strange is that order is almost always 0, except in the path of > remove_pmd_table() -> free_hugepage_table(), where order can be greater > than 0. However, in this context path, free_hugepage_table() isn't used > to free a page table page itself. Instead, it's used to free the actual > pages that a leaf PMD is pointing to. When I first read this, I thought you meant that remove_pmd_table() was trying to free a high-order page composed of multiple pte pages. I don't think it is doing that. Just to be clear: remove_pmd_table() has two modes: 1. The pmd_leaf() code that calls free_hugepage_table(). It is removing the memory pointed to by a PMD *entry*. It is *NOT* removing page tables themselves. 2. The !pmd_leaf() code that does remove the pointers to individual pte pages and frees them via free_pte_table(). *Neither* of these is freeing high-order page tables pages. It either frees plain old kernel data pages or it frees an order-0 page table page. In other words, the pmd_leaf() (mode #1) code is irrelevant to us. Those entries are all _PAGE_KERNEL so IOMMU accesses with user privilege can't do anything with them. Or have I just horribly confused myself? ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-26 16:21 ` Dave Hansen @ 2025-08-27 6:34 ` Baolu Lu 0 siblings, 0 replies; 51+ messages in thread From: Baolu Lu @ 2025-08-27 6:34 UTC (permalink / raw) To: Dave Hansen, Tian, Kevin, Jason Gunthorpe Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1, iommu@lists.linux.dev, security@kernel.org, linux-kernel@vger.kernel.org On 8/27/25 00:21, Dave Hansen wrote: > On 8/25/25 19:49, Baolu Lu wrote: >> What's strange is that order is almost always 0, except in the path of >> remove_pmd_table() -> free_hugepage_table(), where order can be greater >> than 0. However, in this context path, free_hugepage_table() isn't used >> to free a page table page itself. Instead, it's used to free the actual >> pages that a leaf PMD is pointing to. > When I first read this, I thought you meant that remove_pmd_table() was > trying to free a high-order page composed of multiple pte pages. I don't > think it is doing that. > > Just to be clear: remove_pmd_table() has two modes: > > 1. The pmd_leaf() code that calls free_hugepage_table(). It is > removing the memory pointed to by a PMD*entry*. It is*NOT* > removing page tables themselves. > 2. The !pmd_leaf() code that does remove the pointers to > individual pte pages and frees them via free_pte_table(). > > *Neither* of these is freeing high-order page tables pages. It either > frees plain old kernel data pages or it frees an order-0 page table page. > > In other words, the pmd_leaf() (mode #1) code is irrelevant to us. Those > entries are all _PAGE_KERNEL so IOMMU accesses with user privilege can't > do anything with them. I have the same understanding, so case #3 (higher-order non-compound page table pages) doesn't exist. Page table pages always have an order equal to 0. I will head in the direction you pointed out in your previous reply and post the change later for further review. Thank you for the insights. > > Or have I just horribly confused myself? Thanks, baolu ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-07 19:51 ` Jason Gunthorpe 2025-08-08 2:57 ` Tian, Kevin @ 2025-08-08 5:08 ` Baolu Lu 1 sibling, 0 replies; 51+ messages in thread From: Baolu Lu @ 2025-08-08 5:08 UTC (permalink / raw) To: Jason Gunthorpe Cc: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security, linux-kernel, stable On 8/8/25 03:51, Jason Gunthorpe wrote: > On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote: >> +static void kernel_pte_work_func(struct work_struct *work) >> +{ >> + struct page *page, *next; >> + >> + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); >> + >> + guard(spinlock)(&kernel_pte_work.lock); >> + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) { >> + list_del_init(&page->lru); > Please don't add new usages of lru, we are trying to get rid of this. 🙁 > > I think the memory should be struct ptdesc, use that.. Yes, sure. Thanks, baolu ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-06 15:03 ` Dave Hansen 2025-08-06 15:52 ` Jason Gunthorpe @ 2025-08-07 6:53 ` Baolu Lu 1 sibling, 0 replies; 51+ messages in thread From: Baolu Lu @ 2025-08-07 6:53 UTC (permalink / raw) To: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jason Gunthorpe, Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai Cc: iommu, security, linux-kernel, stable On 8/6/25 23:03, Dave Hansen wrote: > On 8/5/25 22:25, Lu Baolu wrote: >> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware >> shares and walks the CPU's page tables. The Linux x86 architecture maps >> the kernel address space into the upper portion of every process’s page >> table. Consequently, in an SVA context, the IOMMU hardware can walk and >> cache kernel space mappings. However, the Linux kernel currently lacks >> a notification mechanism for kernel space mapping changes. This means >> the IOMMU driver is not aware of such changes, leading to a break in >> IOMMU cache coherence. > FWIW, I wouldn't use the term "cache coherence" in this context. I'd > probably just call them "stale IOTLB entries". > > I also think this over states the problem. There is currently no problem > with "kernel space mapping changes". The issue is solely when kernel > page table pages are freed and reused. > >> Modern IOMMUs often cache page table entries of the intermediate-level >> page table as long as the entry is valid, no matter the permissions, to >> optimize walk performance. Currently the iommu driver is notified only >> for changes of user VA mappings, so the IOMMU's internal caches may >> retain stale entries for kernel VA. When kernel page table mappings are >> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale >> entries, Use-After-Free (UAF) vulnerability condition arises. >> >> If these freed page table pages are reallocated for a different purpose, >> potentially by an attacker, the IOMMU could misinterpret the new data as >> valid page table entries. This allows the IOMMU to walk into attacker- >> controlled memory, leading to arbitrary physical memory DMA access or >> privilege escalation. > Note that it's not just use-after-free. It's literally that the IOMMU > will keep writing Accessed and Dirty bits while it thinks the page is > still a page table. The IOMMU will sit there happily setting bits. So, > it's_write_ after free too. > >> To mitigate this, introduce a new iommu interface to flush IOMMU caches. >> This interface should be invoked from architecture-specific code that >> manages combined user and kernel page tables, whenever a kernel page table >> update is done and the CPU TLB needs to be flushed. > There's one tidbit missing from this: > > Currently SVA contexts are all unprivileged. They can only > access user mappings and never kernel mappings. However, IOMMU > still walk kernel-only page tables all the way down to the leaf > where they realize that the entry is a kernel mapping and error > out. Thank you for the guidance. I will improve the commit message accordingly, as follows. iommu/sva: Invalidate stale IOTLB entries for kernel address space In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware shares and walks the CPU's page tables. The x86 architecture maps the kernel's virtual address space into the upper portion of every process's page table. Consequently, in an SVA context, the IOMMU hardware can walk and cache kernel page table entries. The Linux kernel currently lacks a notification mechanism for kernel page table changes, specifically when page table pages are freed and reused. The IOMMU driver is only notified of changes to user virtual address mappings. This can cause the IOMMU's internal caches to retain stale entries for kernel VA. A Use-After-Free (UAF) and Write-After-Free (WAF) condition arises when kernel page table pages are freed and later reallocated. The IOMMU could misinterpret the new data as valid page table entries. The IOMMU might then walk into attacker-controlled memory, leading to arbitrary physical memory DMA access or privilege escalation. This is also a Write-After-Free issue, as the IOMMU will potentially continue to write Accessed and Dirty bits to the freed memory while attempting to walk the stale page tables. Currently, SVA contexts are unprivileged and cannot access kernel mappings. However, the IOMMU will still walk kernel-only page tables all the way down to the leaf entries, where it realizes the mapping is for the kernel and errors out. This means the IOMMU still caches these intermediate page table entries, making the described vulnerability a real concern. To mitigate this, a new IOMMU interface is introduced to flush IOTLB entries for the kernel address space. This interface is invoked from the x86 architecture code that manages combined user and kernel page tables, specifically when a kernel page table update requires a CPU TLB flush. Thanks, baolu ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-06 5:25 [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush Lu Baolu 2025-08-06 15:03 ` Dave Hansen @ 2025-08-14 4:48 ` Ethan Zhao 2025-08-15 7:48 ` Baolu Lu 1 sibling, 1 reply; 51+ messages in thread From: Ethan Zhao @ 2025-08-14 4:48 UTC (permalink / raw) To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai Cc: iommu, security, linux-kernel, stable On 8/6/2025 1:25 PM, Lu Baolu wrote: > In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware > shares and walks the CPU's page tables. The Linux x86 architecture maps > the kernel address space into the upper portion of every process’s page > table. Consequently, in an SVA context, the IOMMU hardware can walk and > cache kernel space mappings. However, the Linux kernel currently lacks > a notification mechanism for kernel space mapping changes. This means > the IOMMU driver is not aware of such changes, leading to a break in > IOMMU cache coherence. > > Modern IOMMUs often cache page table entries of the intermediate-level > page table as long as the entry is valid, no matter the permissions, to > optimize walk performance. Currently the iommu driver is notified only > for changes of user VA mappings, so the IOMMU's internal caches may > retain stale entries for kernel VA. When kernel page table mappings are > changed (e.g., by vfree()), but the IOMMU's internal caches retain stale > entries, Use-After-Free (UAF) vulnerability condition arises. > > If these freed page table pages are reallocated for a different purpose, > potentially by an attacker, the IOMMU could misinterpret the new data as > valid page table entries. This allows the IOMMU to walk into attacker- > controlled memory, leading to arbitrary physical memory DMA access or > privilege escalation. > > To mitigate this, introduce a new iommu interface to flush IOMMU caches. > This interface should be invoked from architecture-specific code that > manages combined user and kernel page tables, whenever a kernel page table > update is done and the CPU TLB needs to be flushed. > > Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices") > Cc: stable@vger.kernel.org > Suggested-by: Jann Horn <jannh@google.com> > Co-developed-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> > Reviewed-by: Kevin Tian <kevin.tian@intel.com> > Tested-by: Yi Lai <yi1.lai@intel.com> > --- > arch/x86/mm/tlb.c | 4 +++ > drivers/iommu/iommu-sva.c | 60 ++++++++++++++++++++++++++++++++++++++- > include/linux/iommu.h | 4 +++ > 3 files changed, 67 insertions(+), 1 deletion(-) > > Change log: > v3: > - iommu_sva_mms is an unbound list; iterating it in an atomic context > could introduce significant latency issues. Schedule it in a kernel > thread and replace the spinlock with a mutex. > - Replace the static key with a normal bool; it can be brought back if > data shows the benefit. > - Invalidate KVA range in the flush_tlb_all() paths. > - All previous reviewed-bys are preserved. Please let me know if there > are any objections. > > v2: > - https://lore.kernel.org/linux-iommu/20250709062800.651521-1-baolu.lu@linux.intel.com/ > - Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range); > - Replace the mutex with a spinlock to make the interface usable in the > critical regions. > > v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux.intel.com/ > > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 39f80111e6f1..3b85e7d3ba44 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -12,6 +12,7 @@ > #include <linux/task_work.h> > #include <linux/mmu_notifier.h> > #include <linux/mmu_context.h> > +#include <linux/iommu.h> > > #include <asm/tlbflush.h> > #include <asm/mmu_context.h> > @@ -1478,6 +1479,8 @@ void flush_tlb_all(void) > else > /* Fall back to the IPI-based invalidation. */ > on_each_cpu(do_flush_tlb_all, NULL, 1); > + > + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); Establishing such a simple one-to-one connection between CPU TLB flush and IOMMU TLB flush is debatable. At the very least, not every process is attached to an IOMMU SVA domain. Currently, devices and IOMMU operating in scalable mode are not commonly applied to every process. Thanks, Ethan> } > > /* Flush an arbitrarily large range of memory with INVLPGB. */ > @@ -1540,6 +1543,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) > kernel_tlb_flush_range(info); > > put_flush_tlb_info(); > + iommu_sva_invalidate_kva_range(start, end); > } > > /* > diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c > index 1a51cfd82808..d0da2b3fd64b 100644 > --- a/drivers/iommu/iommu-sva.c > +++ b/drivers/iommu/iommu-sva.c > @@ -10,6 +10,8 @@ > #include "iommu-priv.h" > > static DEFINE_MUTEX(iommu_sva_lock); > +static bool iommu_sva_present; > +static LIST_HEAD(iommu_sva_mms); > static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, > struct mm_struct *mm); > > @@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de > return ERR_PTR(-ENOSPC); > } > iommu_mm->pasid = pasid; > + iommu_mm->mm = mm; > INIT_LIST_HEAD(&iommu_mm->sva_domains); > /* > * Make sure the write to mm->iommu_mm is not reordered in front of > @@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm > if (ret) > goto out_free_domain; > domain->users = 1; > - list_add(&domain->next, &mm->iommu_mm->sva_domains); > > + if (list_empty(&iommu_mm->sva_domains)) { > + if (list_empty(&iommu_sva_mms)) > + WRITE_ONCE(iommu_sva_present, true); > + list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms); > + } > + list_add(&domain->next, &iommu_mm->sva_domains); > out: > refcount_set(&handle->users, 1); > mutex_unlock(&iommu_sva_lock); > @@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle) > list_del(&domain->next); > iommu_domain_free(domain); > } > + > + if (list_empty(&iommu_mm->sva_domains)) { > + list_del(&iommu_mm->mm_list_elm); > + if (list_empty(&iommu_sva_mms)) > + WRITE_ONCE(iommu_sva_present, false); > + } > + > mutex_unlock(&iommu_sva_lock); > kfree(handle); > } > @@ -312,3 +327,46 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, > > return domain; > } > + > +struct kva_invalidation_work_data { > + struct work_struct work; > + unsigned long start; > + unsigned long end; > +}; > + > +static void invalidate_kva_func(struct work_struct *work) > +{ > + struct kva_invalidation_work_data *data = > + container_of(work, struct kva_invalidation_work_data, work); > + struct iommu_mm_data *iommu_mm; > + > + guard(mutex)(&iommu_sva_lock); > + list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm) > + mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, > + data->start, data->end); > + > + kfree(data); > +} > + > +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) > +{ > + struct kva_invalidation_work_data *data; > + > + if (likely(!READ_ONCE(iommu_sva_present))) > + return; > + > + /* will be freed in the task function */ > + data = kzalloc(sizeof(*data), GFP_ATOMIC); > + if (!data) > + return; > + > + data->start = start; > + data->end = end; > + INIT_WORK(&data->work, invalidate_kva_func); > + > + /* > + * Since iommu_sva_mms is an unbound list, iterating it in an atomic > + * context could introduce significant latency issues. > + */ > + schedule_work(&data->work); > +} > diff --git a/include/linux/iommu.h b/include/linux/iommu.h > index c30d12e16473..66e4abb2df0d 100644 > --- a/include/linux/iommu.h > +++ b/include/linux/iommu.h > @@ -1134,7 +1134,9 @@ struct iommu_sva { > > struct iommu_mm_data { > u32 pasid; > + struct mm_struct *mm; > struct list_head sva_domains; > + struct list_head mm_list_elm; > }; > > int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode); > @@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, > struct mm_struct *mm); > void iommu_sva_unbind_device(struct iommu_sva *handle); > u32 iommu_sva_get_pasid(struct iommu_sva *handle); > +void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end); > #else > static inline struct iommu_sva * > iommu_sva_bind_device(struct device *dev, struct mm_struct *mm) > @@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm) > } > > static inline void mm_pasid_drop(struct mm_struct *mm) {} > +static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {} > #endif /* CONFIG_IOMMU_SVA */ > > #ifdef CONFIG_IOMMU_IOPF ^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush 2025-08-14 4:48 ` Ethan Zhao @ 2025-08-15 7:48 ` Baolu Lu 0 siblings, 0 replies; 51+ messages in thread From: Baolu Lu @ 2025-08-15 7:48 UTC (permalink / raw) To: Ethan Zhao, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen, Alistair Popple, Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai Cc: baolu.lu, iommu, security, linux-kernel, stable On 8/14/2025 12:48 PM, Ethan Zhao wrote: > > > On 8/6/2025 1:25 PM, Lu Baolu wrote: >> In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware >> shares and walks the CPU's page tables. The Linux x86 architecture maps >> the kernel address space into the upper portion of every process’s page >> table. Consequently, in an SVA context, the IOMMU hardware can walk and >> cache kernel space mappings. However, the Linux kernel currently lacks >> a notification mechanism for kernel space mapping changes. This means >> the IOMMU driver is not aware of such changes, leading to a break in >> IOMMU cache coherence. >> >> Modern IOMMUs often cache page table entries of the intermediate-level >> page table as long as the entry is valid, no matter the permissions, to >> optimize walk performance. Currently the iommu driver is notified only >> for changes of user VA mappings, so the IOMMU's internal caches may >> retain stale entries for kernel VA. When kernel page table mappings are >> changed (e.g., by vfree()), but the IOMMU's internal caches retain stale >> entries, Use-After-Free (UAF) vulnerability condition arises. >> >> If these freed page table pages are reallocated for a different purpose, >> potentially by an attacker, the IOMMU could misinterpret the new data as >> valid page table entries. This allows the IOMMU to walk into attacker- >> controlled memory, leading to arbitrary physical memory DMA access or >> privilege escalation. >> >> To mitigate this, introduce a new iommu interface to flush IOMMU caches. >> This interface should be invoked from architecture-specific code that >> manages combined user and kernel page tables, whenever a kernel page >> table >> update is done and the CPU TLB needs to be flushed. >> >> Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices") >> Cc: stable@vger.kernel.org >> Suggested-by: Jann Horn <jannh@google.com> >> Co-developed-by: Jason Gunthorpe <jgg@nvidia.com> >> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> >> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> >> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> >> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> >> Reviewed-by: Kevin Tian <kevin.tian@intel.com> >> Tested-by: Yi Lai <yi1.lai@intel.com> >> --- >> arch/x86/mm/tlb.c | 4 +++ >> drivers/iommu/iommu-sva.c | 60 ++++++++++++++++++++++++++++++++++++++- >> include/linux/iommu.h | 4 +++ >> 3 files changed, 67 insertions(+), 1 deletion(-) >> >> Change log: >> v3: >> - iommu_sva_mms is an unbound list; iterating it in an atomic context >> could introduce significant latency issues. Schedule it in a kernel >> thread and replace the spinlock with a mutex. >> - Replace the static key with a normal bool; it can be brought back if >> data shows the benefit. >> - Invalidate KVA range in the flush_tlb_all() paths. >> - All previous reviewed-bys are preserved. Please let me know if there >> are any objections. >> >> v2: >> - https://lore.kernel.org/linux-iommu/20250709062800.651521-1- >> baolu.lu@linux.intel.com/ >> - Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range); >> - Replace the mutex with a spinlock to make the interface usable in the >> critical regions. >> >> v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1- >> baolu.lu@linux.intel.com/ >> >> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c >> index 39f80111e6f1..3b85e7d3ba44 100644 >> --- a/arch/x86/mm/tlb.c >> +++ b/arch/x86/mm/tlb.c >> @@ -12,6 +12,7 @@ >> #include <linux/task_work.h> >> #include <linux/mmu_notifier.h> >> #include <linux/mmu_context.h> >> +#include <linux/iommu.h> >> #include <asm/tlbflush.h> >> #include <asm/mmu_context.h> >> @@ -1478,6 +1479,8 @@ void flush_tlb_all(void) >> else >> /* Fall back to the IPI-based invalidation. */ >> on_each_cpu(do_flush_tlb_all, NULL, 1); >> + >> + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL); > Establishing such a simple one-to-one connection between CPU TLB flush > and IOMMU TLB flush is debatable. At the very least, not every process > is attached to an IOMMU SVA domain. Currently, devices and IOMMU > operating in scalable mode are not commonly applied to every process. You're right. As discussed, I'll defer the IOTLB invalidation to a scheduled kernel work in pte_free_kernel(). The sva_invalidation_kva_range() function on the IOMMU side is actually a no-op if there's no SVA domain in use. Thanks, baolu ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2025-08-28 19:39 UTC | newest] Thread overview: 51+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-06 5:25 [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush Lu Baolu 2025-08-06 15:03 ` Dave Hansen 2025-08-06 15:52 ` Jason Gunthorpe 2025-08-06 16:04 ` Dave Hansen 2025-08-06 16:09 ` Jason Gunthorpe 2025-08-06 16:34 ` Dave Hansen 2025-08-06 16:42 ` Jason Gunthorpe 2025-08-07 14:40 ` Baolu Lu 2025-08-07 15:31 ` Dave Hansen 2025-08-08 5:15 ` Baolu Lu 2025-08-10 7:19 ` Ethan Zhao 2025-08-11 9:15 ` Uladzislau Rezki 2025-08-11 12:55 ` Jason Gunthorpe 2025-08-15 9:23 ` Baolu Lu 2025-08-11 13:55 ` Dave Hansen 2025-08-11 14:56 ` Uladzislau Rezki 2025-08-12 1:17 ` Ethan Zhao 2025-08-15 14:35 ` Dave Hansen 2025-08-11 12:57 ` Jason Gunthorpe 2025-08-13 3:17 ` Ethan Zhao 2025-08-18 1:34 ` Baolu Lu 2025-08-07 19:51 ` Jason Gunthorpe 2025-08-08 2:57 ` Tian, Kevin 2025-08-15 9:16 ` Baolu Lu 2025-08-15 9:46 ` Tian, Kevin 2025-08-18 5:58 ` Baolu Lu 2025-08-15 14:31 ` Dave Hansen 2025-08-18 6:08 ` Baolu Lu 2025-08-18 6:21 ` Baolu Lu 2025-08-21 7:05 ` Tian, Kevin 2025-08-23 3:26 ` Baolu Lu 2025-08-25 22:36 ` Dave Hansen 2025-08-26 1:25 ` Baolu Lu 2025-08-26 2:49 ` Baolu Lu 2025-08-26 14:22 ` Dave Hansen 2025-08-26 14:33 ` Matthew Wilcox 2025-08-26 14:57 ` Dave Hansen 2025-08-27 10:58 ` Baolu Lu 2025-08-27 23:31 ` Dave Hansen 2025-08-28 5:31 ` Baolu Lu 2025-08-28 7:08 ` Tian, Kevin 2025-08-28 18:56 ` Dave Hansen 2025-08-28 19:10 ` Jason Gunthorpe 2025-08-28 19:31 ` Dave Hansen 2025-08-28 19:39 ` Matthew Wilcox 2025-08-26 16:21 ` Dave Hansen 2025-08-27 6:34 ` Baolu Lu 2025-08-08 5:08 ` Baolu Lu 2025-08-07 6:53 ` Baolu Lu 2025-08-14 4:48 ` Ethan Zhao 2025-08-15 7:48 ` Baolu Lu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).