* [PATCH v4 0/8] Fix stale IOTLB entries for kernel address space
@ 2025-09-05 5:50 Lu Baolu
2025-09-05 5:50 ` [PATCH v4 1/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
` (7 more replies)
0 siblings, 8 replies; 31+ messages in thread
From: Lu Baolu @ 2025-09-05 5:50 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, linux-kernel, Lu Baolu
This proposes a fix for a security vulnerability related to IOMMU Shared
Virtual Addressing (SVA). In an SVA context, an IOMMU can cache kernel
page table entries. When a kernel page table page is freed and
reallocated for another purpose, the IOMMU might still hold stale,
incorrect entries. This can be exploited to cause a use-after-free or
write-after-free condition, potentially leading to privilege escalation
or data corruption.
This solution introduces a deferred freeing mechanism for kernel page
table pages, which provides a safe window to notify the IOMMU to
invalidate its caches before the page is reused.
Change log:
v4:
- Introduce a mechanism to defer the freeing of page-table pages for
KVA mappings. Call iommu_sva_invalidate_kva_range() in the deferred
work thread before freeing the pages.
v3:
- https://lore.kernel.org/linux-iommu/20250806052505.3113108-1-baolu.lu@linux.intel.com/
- iommu_sva_mms is an unbound list; iterating it in an atomic context
could introduce significant latency issues. Schedule it in a kernel
thread and replace the spinlock with a mutex.
- Replace the static key with a normal bool; it can be brought back if
data shows the benefit.
- Invalidate KVA range in the flush_tlb_all() paths.
- All previous reviewed-bys are preserved. Please let me know if there
are any objections.
v2:
- https://lore.kernel.org/linux-iommu/20250709062800.651521-1-baolu.lu@linux.intel.com/
- Remove EXPORT_SYMBOL_GPL(iommu_sva_invalidate_kva_range);
- Replace the mutex with a spinlock to make the interface usable in the
critical regions.
v1: https://lore.kernel.org/linux-iommu/20250704133056.4023816-1-baolu.lu@linux.intel.com/
Dave Hansen (6):
mm: Add a ptdesc flag to mark kernel page tables
mm: Actually mark kernel page table pages
x86/mm: Use 'ptdesc' when freeing PMD pages
mm: Introduce pure page table freeing function
mm: Introduce deferred freeing for kernel page tables
mm: Hook up Kconfig options for async page table freeing
Lu Baolu (2):
x86/mm: Use pagetable_free()
iommu/sva: Invalidate stale IOTLB entries for kernel address space
arch/x86/Kconfig | 1 +
arch/x86/mm/init_64.c | 2 +-
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/mm/pgtable.c | 12 ++++-----
drivers/iommu/iommu-sva.c | 29 +++++++++++++++++++-
include/asm-generic/pgalloc.h | 18 +++++++++++++
include/linux/iommu.h | 4 +++
include/linux/mm.h | 24 ++++++++++++++---
include/linux/page-flags.h | 51 +++++++++++++++++++++++++++++++++++
mm/Kconfig | 3 +++
mm/pgtable-generic.c | 41 ++++++++++++++++++++++++++++
11 files changed, 175 insertions(+), 12 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v4 1/8] mm: Add a ptdesc flag to mark kernel page tables
2025-09-05 5:50 [PATCH v4 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
@ 2025-09-05 5:50 ` Lu Baolu
2025-09-05 18:24 ` Jason Gunthorpe
2025-09-12 7:58 ` Tian, Kevin
2025-09-05 5:50 ` [PATCH v4 2/8] mm: Actually mark kernel page table pages Lu Baolu
` (6 subsequent siblings)
7 siblings, 2 replies; 31+ messages in thread
From: Lu Baolu @ 2025-09-05 5:50 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, linux-kernel, Dave Hansen, Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
The page tables used to map the kernel and userspace often have very
different handling rules. There are frequently *_kernel() variants of
functions just for kernel page tables. That's not great and has lead
to code duplication.
Instead of having completely separate call paths, allow a 'ptdesc' to
be marked as being for kernel mappings. Introduce helpers to set and
clear this status.
Note: this uses the PG_referenced bit. Page flags are a great fit for
this since it is truly a single bit of information. Use PG_referenced
itself because it's a fairly benign flag (as opposed to things like
PG_lock). It's also (according to Willy) unlikely to go away any time
soon.
PG_referenced is not in PAGE_FLAGS_CHECK_AT_FREE. It does not need to
be cleared before freeing the page, and pages coming out of the
allocator should have it cleared. Regardless, introduce an API to
clear it anyway. Having symmetry in the API makes it easier to change
the underlying implementation later, like if there was a need to move
to a PAGE_FLAGS_CHECK_AT_FREE bit.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
include/linux/page-flags.h | 51 ++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 8d3fa3a91ce4..d1a9ee0a5337 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1244,6 +1244,57 @@ static inline int folio_has_private(const struct folio *folio)
return !!(folio->flags & PAGE_FLAGS_PRIVATE);
}
+/**
+ * ptdesc_set_kernel - Mark a ptdesc used to map the kernel
+ * @ptdesc: The ptdesc to be marked
+ *
+ * Kernel page tables often need special handling. Set a flag so that
+ * the handling code knows this ptdesc will not be used for userspace.
+ */
+static inline void ptdesc_set_kernel(struct ptdesc *ptdesc)
+{
+ struct folio *folio = ptdesc_folio(ptdesc);
+
+ folio_set_referenced(folio);
+}
+
+/**
+ * ptdesc_clear_kernel - Mark a ptdesc as no longer used to map the kernel
+ * @ptdesc: The ptdesc to be unmarked
+ *
+ * Use when the ptdesc is no longer used to map the kernel and no longer
+ * needs special handling.
+ */
+static inline void ptdesc_clear_kernel(struct ptdesc *ptdesc)
+{
+ struct folio *folio = ptdesc_folio(ptdesc);
+
+ /*
+ * Note: the 'PG_referenced' bit does not strictly need to be
+ * cleared before freeing the page. But this is nice for
+ * symmetry.
+ */
+ folio_clear_referenced(folio);
+}
+
+/**
+ * ptdesc_test_kernel - Check if a ptdesc is used to map the kernel
+ * @ptdesc: The ptdesc being tested
+ *
+ * Call to tell if the ptdesc used to map the kernel.
+ */
+static inline bool ptdesc_test_kernel(struct ptdesc *ptdesc)
+{
+ struct folio *folio = ptdesc_folio(ptdesc);
+
+ /*
+ * Note: the 'PG_referenced' bit does not strictly need to be
+ * tested before freeing the page. But this is nice for
+ * symmetry.
+ */
+ return folio_test_referenced(folio);
+}
+
#undef PF_ANY
#undef PF_HEAD
#undef PF_NO_TAIL
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 2/8] mm: Actually mark kernel page table pages
2025-09-05 5:50 [PATCH v4 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
2025-09-05 5:50 ` [PATCH v4 1/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
@ 2025-09-05 5:50 ` Lu Baolu
2025-09-05 18:24 ` Jason Gunthorpe
2025-09-12 7:59 ` Tian, Kevin
2025-09-05 5:50 ` [PATCH v4 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
` (5 subsequent siblings)
7 siblings, 2 replies; 31+ messages in thread
From: Lu Baolu @ 2025-09-05 5:50 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, linux-kernel, Dave Hansen, Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
Now that the API is in place, mark kernel page table pages just
after they are allocated. Unmark them just before they are freed.
Note: Unconditionally clearing the 'kernel' marking (via
ptdesc_clear_kernel()) would be functionally identical to what
is here. But having the if() makes it logically clear that this
function can be used for kernel and non-kernel page tables.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
include/asm-generic/pgalloc.h | 18 ++++++++++++++++++
include/linux/mm.h | 3 +++
2 files changed, 21 insertions(+)
diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 3c8ec3bfea44..b9d2a7c79b93 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -28,6 +28,8 @@ static inline pte_t *__pte_alloc_one_kernel_noprof(struct mm_struct *mm)
return NULL;
}
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define __pte_alloc_one_kernel(...) alloc_hooks(__pte_alloc_one_kernel_noprof(__VA_ARGS__))
@@ -146,6 +148,10 @@ static inline pmd_t *pmd_alloc_one_noprof(struct mm_struct *mm, unsigned long ad
pagetable_free(ptdesc);
return NULL;
}
+
+ if (mm == &init_mm)
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define pmd_alloc_one(...) alloc_hooks(pmd_alloc_one_noprof(__VA_ARGS__))
@@ -179,6 +185,10 @@ static inline pud_t *__pud_alloc_one_noprof(struct mm_struct *mm, unsigned long
return NULL;
pagetable_pud_ctor(ptdesc);
+
+ if (mm == &init_mm)
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define __pud_alloc_one(...) alloc_hooks(__pud_alloc_one_noprof(__VA_ARGS__))
@@ -233,6 +243,10 @@ static inline p4d_t *__p4d_alloc_one_noprof(struct mm_struct *mm, unsigned long
return NULL;
pagetable_p4d_ctor(ptdesc);
+
+ if (mm == &init_mm)
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define __p4d_alloc_one(...) alloc_hooks(__p4d_alloc_one_noprof(__VA_ARGS__))
@@ -277,6 +291,10 @@ static inline pgd_t *__pgd_alloc_noprof(struct mm_struct *mm, unsigned int order
return NULL;
pagetable_pgd_ctor(ptdesc);
+
+ if (mm == &init_mm)
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define __pgd_alloc(...) alloc_hooks(__pgd_alloc_noprof(__VA_ARGS__))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ae97a0b8ec7..f3db3a5ebefe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2895,6 +2895,9 @@ static inline void pagetable_free(struct ptdesc *pt)
{
struct page *page = ptdesc_page(pt);
+ if (ptdesc_test_kernel(pt))
+ ptdesc_clear_kernel(pt);
+
__free_pages(page, compound_order(page));
}
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages
2025-09-05 5:50 [PATCH v4 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
2025-09-05 5:50 ` [PATCH v4 1/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
2025-09-05 5:50 ` [PATCH v4 2/8] mm: Actually mark kernel page table pages Lu Baolu
@ 2025-09-05 5:50 ` Lu Baolu
2025-09-05 18:25 ` Jason Gunthorpe
2025-09-12 8:03 ` Tian, Kevin
2025-09-05 5:50 ` [PATCH v4 4/8] mm: Introduce pure page table freeing function Lu Baolu
` (4 subsequent siblings)
7 siblings, 2 replies; 31+ messages in thread
From: Lu Baolu @ 2025-09-05 5:50 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, linux-kernel, Dave Hansen, Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
There are a billion ways to refer to a physical memory address.
One of the x86 PMD freeing code location chooses to use a 'pte_t *' to
point to a PMD page and then call a PTE-specific freeing function for
it. That's a bit wonky.
Just use a 'struct ptdesc *' instead. Its entire purpose is to refer
to page table pages. It also means being able to remove an explicit
cast.
Right now, pte_free_kernel() is a one-liner that calls
pagetable_dtor_free(). Effectively, all this patch does is
remove one superfluous __pa(__va(paddr)) conversion and then
call pagetable_dtor_free() directly instead of through a helper.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
arch/x86/mm/pgtable.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ddf248c3ee7d..2e5ecfdce73c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -729,7 +729,7 @@ int pmd_clear_huge(pmd_t *pmd)
int pud_free_pmd_page(pud_t *pud, unsigned long addr)
{
pmd_t *pmd, *pmd_sv;
- pte_t *pte;
+ struct ptdesc *pt;
int i;
pmd = pud_pgtable(*pud);
@@ -750,8 +750,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
for (i = 0; i < PTRS_PER_PMD; i++) {
if (!pmd_none(pmd_sv[i])) {
- pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
- pte_free_kernel(&init_mm, pte);
+ pt = page_ptdesc(pmd_page(pmd_sv[i]));
+ pagetable_dtor_free(pt);
}
}
@@ -772,15 +772,15 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
*/
int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
{
- pte_t *pte;
+ struct ptdesc *pt;
- pte = (pte_t *)pmd_page_vaddr(*pmd);
+ pt = page_ptdesc(pmd_page(*pmd));
pmd_clear(pmd);
/* INVLPG to clear all paging-structure caches */
flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1);
- pte_free_kernel(&init_mm, pte);
+ pagetable_dtor_free(pt);
return 1;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 4/8] mm: Introduce pure page table freeing function
2025-09-05 5:50 [PATCH v4 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
` (2 preceding siblings ...)
2025-09-05 5:50 ` [PATCH v4 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
@ 2025-09-05 5:50 ` Lu Baolu
2025-09-05 18:31 ` Jason Gunthorpe
2025-09-12 8:04 ` Tian, Kevin
2025-09-05 5:51 ` [PATCH v4 5/8] x86/mm: Use pagetable_free() Lu Baolu
` (3 subsequent siblings)
7 siblings, 2 replies; 31+ messages in thread
From: Lu Baolu @ 2025-09-05 5:50 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, linux-kernel, Dave Hansen, Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
The pages used for ptdescs are currently freed back to the allocator
in a single location. They will shortly be freed from a second
location.
Create a simple helper that just frees them back to the allocator.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
include/linux/mm.h | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3db3a5ebefe..668d519edc0f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2884,6 +2884,13 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
}
#define pagetable_alloc(...) alloc_hooks(pagetable_alloc_noprof(__VA_ARGS__))
+static inline void __pagetable_free(struct ptdesc *pt)
+{
+ struct page *page = ptdesc_page(pt);
+
+ __free_pages(page, compound_order(page));
+}
+
/**
* pagetable_free - Free pagetables
* @pt: The page table descriptor
@@ -2893,12 +2900,10 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
*/
static inline void pagetable_free(struct ptdesc *pt)
{
- struct page *page = ptdesc_page(pt);
-
if (ptdesc_test_kernel(pt))
ptdesc_clear_kernel(pt);
- __free_pages(page, compound_order(page));
+ __pagetable_free(pt);
}
#if defined(CONFIG_SPLIT_PTE_PTLOCKS)
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 5/8] x86/mm: Use pagetable_free()
2025-09-05 5:50 [PATCH v4 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
` (3 preceding siblings ...)
2025-09-05 5:50 ` [PATCH v4 4/8] mm: Introduce pure page table freeing function Lu Baolu
@ 2025-09-05 5:51 ` Lu Baolu
2025-09-05 18:41 ` Jason Gunthorpe
2025-09-05 5:51 ` [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
` (2 subsequent siblings)
7 siblings, 1 reply; 31+ messages in thread
From: Lu Baolu @ 2025-09-05 5:51 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, linux-kernel, Lu Baolu
The kernel's memory management subsystem provides a dedicated interface,
pagetable_free(), for freeing page table pages. Updates two call sites to
use pagetable_free() instead of the lower-level __free_page() or
free_pages(). This improves code consistency and clarity, and ensures the
correct freeing mechanism is used.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
arch/x86/mm/init_64.c | 2 +-
arch/x86/mm/pat/set_memory.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 76e33bd7c556..86b4297c1984 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1013,7 +1013,7 @@ static void __meminit free_pagetable(struct page *page, int order)
free_reserved_pages(page, nr_pages);
#endif
} else {
- free_pages((unsigned long)page_address(page), order);
+ pagetable_free(page_ptdesc(page));
}
}
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 8834c76f91c9..8b78a8855024 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -438,7 +438,7 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
list_del(&ptdesc->pt_list);
- __free_page(ptdesc_page(ptdesc));
+ pagetable_free(ptdesc);
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables
2025-09-05 5:50 [PATCH v4 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
` (4 preceding siblings ...)
2025-09-05 5:51 ` [PATCH v4 5/8] x86/mm: Use pagetable_free() Lu Baolu
@ 2025-09-05 5:51 ` Lu Baolu
2025-09-05 18:43 ` Jason Gunthorpe
2025-09-12 8:14 ` Tian, Kevin
2025-09-05 5:51 ` [PATCH v4 7/8] mm: Hook up Kconfig options for async page table freeing Lu Baolu
2025-09-05 5:51 ` [PATCH v4 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space Lu Baolu
7 siblings, 2 replies; 31+ messages in thread
From: Lu Baolu @ 2025-09-05 5:51 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, linux-kernel, Dave Hansen, Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
On x86 and other architectures that map the kernel's virtual address space
into the upper portion of every process's page table, the IOMMU's paging
structure caches can become stale when the CPU page table is shared with
IOMMU in the Shared Virtual Address (SVA) context. This occurs when a page
used for the kernel's page tables is freed and reused without the IOMMU
being notified.
While the IOMMU driver is notified of changes to user virtual address
mappings, there is no similar notification mechanism for kernel page
table changes. This can lead to data corruption or system instability
when Shared Virtual Address (SVA) is enabled, as the IOMMU's internal
caches may retain stale entries for kernel virtual addresses.
This introduces a conditional asynchronous mechanism, enabled by
CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the freeing
of pages that are used as page tables for kernel address mappings. These
pages are now queued to a work struct instead of being freed immediately.
This deferred freeing provides a safe context for a future patch to add
an IOMMU-specific callback, which might be expensive on large-scale
systems. This ensures the necessary IOMMU cache invalidation is performed
before the page is finally returned to the page allocator outside of any
critical, non-sleepable path.
In the current kernel, some page table pages are allocated with an
associated struct ptdesc, while others are not. Those without a ptdesc are
freed using free_pages() and its variants, which bypasses the destructor
that pagetable_dtor_free() would run. While the long-term plan is to
convert all page table pages to use struct ptdesc, this uses a temporary
flag within ptdesc to indicate whether a page needs a destructor,
considering that this aims to fix a potential security issue in IOMMU SVA.
The flag and its associated logic can be removed once the conversion is
complete.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
include/linux/mm.h | 16 +++++++++++++---
mm/pgtable-generic.c | 39 +++++++++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 668d519edc0f..6d10715ecbb0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2891,6 +2891,14 @@ static inline void __pagetable_free(struct ptdesc *pt)
__free_pages(page, compound_order(page));
}
+#ifdef CONFIG_ASYNC_PGTABLE_FREE
+void pagetable_free_async(struct ptdesc *pt);
+#else
+static inline void pagetable_free_async(struct ptdesc *pt)
+{
+ __pagetable_free(pt);
+}
+#endif
/**
* pagetable_free - Free pagetables
* @pt: The page table descriptor
@@ -2900,10 +2908,12 @@ static inline void __pagetable_free(struct ptdesc *pt)
*/
static inline void pagetable_free(struct ptdesc *pt)
{
- if (ptdesc_test_kernel(pt))
+ if (ptdesc_test_kernel(pt)) {
ptdesc_clear_kernel(pt);
-
- __pagetable_free(pt);
+ pagetable_free_async(pt);
+ } else {
+ __pagetable_free(pt);
+ }
}
#if defined(CONFIG_SPLIT_PTE_PTLOCKS)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..cf884e8300ca 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -406,3 +406,42 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
pte_unmap_unlock(pte, ptl);
goto again;
}
+
+#ifdef CONFIG_ASYNC_PGTABLE_FREE
+static void kernel_pgtable_work_func(struct work_struct *work);
+
+static struct {
+ struct list_head list;
+ /* protect above ptdesc lists */
+ spinlock_t lock;
+ struct work_struct work;
+} kernel_pgtable_work = {
+ .list = LIST_HEAD_INIT(kernel_pgtable_work.list),
+ .lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock),
+ .work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
+};
+
+static void kernel_pgtable_work_func(struct work_struct *work)
+{
+ struct ptdesc *pt, *next;
+ LIST_HEAD(page_list);
+
+ spin_lock(&kernel_pgtable_work.lock);
+ list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
+ spin_unlock(&kernel_pgtable_work.lock);
+
+ list_for_each_entry_safe(pt, next, &page_list, pt_list) {
+ list_del(&pt->pt_list);
+ __pagetable_free(pt);
+ }
+}
+
+void pagetable_free_async(struct ptdesc *pt)
+{
+ spin_lock(&kernel_pgtable_work.lock);
+ list_add(&pt->pt_list, &kernel_pgtable_work.list);
+ spin_unlock(&kernel_pgtable_work.lock);
+
+ schedule_work(&kernel_pgtable_work.work);
+}
+#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 7/8] mm: Hook up Kconfig options for async page table freeing
2025-09-05 5:50 [PATCH v4 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
` (5 preceding siblings ...)
2025-09-05 5:51 ` [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
@ 2025-09-05 5:51 ` Lu Baolu
2025-09-05 18:44 ` Jason Gunthorpe
2025-09-12 8:19 ` Tian, Kevin
2025-09-05 5:51 ` [PATCH v4 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space Lu Baolu
7 siblings, 2 replies; 31+ messages in thread
From: Lu Baolu @ 2025-09-05 5:51 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, linux-kernel, Dave Hansen, Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
The CONFIG_ASYNC_PGTABLE_FREE option controls whether an architecture
requires asynchronous page table freeing. On x86, this is selected if
IOMMU_SVA is enabled, because both Intel and AMD IOMMU architectures
could potentially cache kernel page table entries in their paging
structure cache, regardless of the permission.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
arch/x86/Kconfig | 1 +
mm/Kconfig | 3 +++
2 files changed, 4 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 58d890fe2100..1b2326d81681 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -281,6 +281,7 @@ config X86
select HAVE_PCI
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
+ select ASYNC_PGTABLE_FREE if IOMMU_SVA
select MMU_GATHER_RCU_TABLE_FREE
select MMU_GATHER_MERGE_VMAS
select HAVE_POSIX_CPU_TIMERS_TASK_WORK
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf..1576409cec03 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -920,6 +920,9 @@ config PAGE_MAPCOUNT
config PGTABLE_HAS_HUGE_LEAVES
def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
+config ASYNC_PGTABLE_FREE
+ def_bool n
+
# TODO: Allow to be enabled without THP
config ARCH_SUPPORTS_HUGE_PFNMAP
def_bool n
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v4 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space
2025-09-05 5:50 [PATCH v4 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
` (6 preceding siblings ...)
2025-09-05 5:51 ` [PATCH v4 7/8] mm: Hook up Kconfig options for async page table freeing Lu Baolu
@ 2025-09-05 5:51 ` Lu Baolu
7 siblings, 0 replies; 31+ messages in thread
From: Lu Baolu @ 2025-09-05 5:51 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai
Cc: iommu, security, linux-kernel, Lu Baolu, stable
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The x86 architecture maps the
kernel's virtual address space into the upper portion of every process's
page table. Consequently, in an SVA context, the IOMMU hardware can walk
and cache kernel page table entries.
The Linux kernel currently lacks a notification mechanism for kernel page
table changes, specifically when page table pages are freed and reused.
The IOMMU driver is only notified of changes to user virtual address
mappings. This can cause the IOMMU's internal caches to retain stale
entries for kernel VA.
A Use-After-Free (UAF) and Write-After-Free (WAF) condition arises when
kernel page table pages are freed and later reallocated. The IOMMU could
misinterpret the new data as valid page table entries. The IOMMU might
then walk into attacker-controlled memory, leading to arbitrary physical
memory DMA access or privilege escalation. This is also a Write-After-Free
issue, as the IOMMU will potentially continue to write Accessed and Dirty
bits to the freed memory while attempting to walk the stale page tables.
Currently, SVA contexts are unprivileged and cannot access kernel
mappings. However, the IOMMU will still walk kernel-only page tables
all the way down to the leaf entries, where it realizes the mapping
is for the kernel and errors out. This means the IOMMU still caches
these intermediate page table entries, making the described vulnerability
a real concern.
To mitigate this, a new IOMMU interface is introduced to flush IOTLB
entries for the kernel address space. This interface is invoked from the
x86 architecture code that manages combined user and kernel page tables,
specifically when a kernel page table update requires a CPU TLB flush.
Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
Cc: stable@vger.kernel.org
Suggested-by: Jann Horn <jannh@google.com>
Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
drivers/iommu/iommu-sva.c | 29 ++++++++++++++++++++++++++++-
include/linux/iommu.h | 4 ++++
mm/pgtable-generic.c | 2 ++
3 files changed, 34 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 1a51cfd82808..d236aef80a8d 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -10,6 +10,8 @@
#include "iommu-priv.h"
static DEFINE_MUTEX(iommu_sva_lock);
+static bool iommu_sva_present;
+static LIST_HEAD(iommu_sva_mms);
static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
struct mm_struct *mm);
@@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de
return ERR_PTR(-ENOSPC);
}
iommu_mm->pasid = pasid;
+ iommu_mm->mm = mm;
INIT_LIST_HEAD(&iommu_mm->sva_domains);
/*
* Make sure the write to mm->iommu_mm is not reordered in front of
@@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
if (ret)
goto out_free_domain;
domain->users = 1;
- list_add(&domain->next, &mm->iommu_mm->sva_domains);
+ if (list_empty(&iommu_mm->sva_domains)) {
+ if (list_empty(&iommu_sva_mms))
+ iommu_sva_present = true;
+ list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
+ }
+ list_add(&domain->next, &iommu_mm->sva_domains);
out:
refcount_set(&handle->users, 1);
mutex_unlock(&iommu_sva_lock);
@@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
list_del(&domain->next);
iommu_domain_free(domain);
}
+
+ if (list_empty(&iommu_mm->sva_domains)) {
+ list_del(&iommu_mm->mm_list_elm);
+ if (list_empty(&iommu_sva_mms))
+ iommu_sva_present = false;
+ }
+
mutex_unlock(&iommu_sva_lock);
kfree(handle);
}
@@ -312,3 +327,15 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
return domain;
}
+
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
+{
+ struct iommu_mm_data *iommu_mm;
+
+ guard(mutex)(&iommu_sva_lock);
+ if (!iommu_sva_present)
+ return;
+
+ list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
+ mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end);
+}
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c30d12e16473..66e4abb2df0d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1134,7 +1134,9 @@ struct iommu_sva {
struct iommu_mm_data {
u32 pasid;
+ struct mm_struct *mm;
struct list_head sva_domains;
+ struct list_head mm_list_elm;
};
int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode);
@@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
struct mm_struct *mm);
void iommu_sva_unbind_device(struct iommu_sva *handle);
u32 iommu_sva_get_pasid(struct iommu_sva *handle);
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end);
#else
static inline struct iommu_sva *
iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
@@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm)
}
static inline void mm_pasid_drop(struct mm_struct *mm) {}
+static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {}
#endif /* CONFIG_IOMMU_SVA */
#ifdef CONFIG_IOMMU_IOPF
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index cf884e8300ca..4c81ca8b196f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/mm_inline.h>
+#include <linux/iommu.h>
#include <asm/pgalloc.h>
#include <asm/tlb.h>
@@ -430,6 +431,7 @@ static void kernel_pgtable_work_func(struct work_struct *work)
list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
spin_unlock(&kernel_pgtable_work.lock);
+ iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
list_for_each_entry_safe(pt, next, &page_list, pt_list) {
list_del(&pt->pt_list);
__pagetable_free(pt);
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v4 1/8] mm: Add a ptdesc flag to mark kernel page tables
2025-09-05 5:50 ` [PATCH v4 1/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
@ 2025-09-05 18:24 ` Jason Gunthorpe
2025-09-12 7:58 ` Tian, Kevin
1 sibling, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:24 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Dave Hansen, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
iommu, security, linux-kernel, Dave Hansen
On Fri, Sep 05, 2025 at 01:50:56PM +0800, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The page tables used to map the kernel and userspace often have very
> different handling rules. There are frequently *_kernel() variants of
> functions just for kernel page tables. That's not great and has lead
> to code duplication.
>
> Instead of having completely separate call paths, allow a 'ptdesc' to
> be marked as being for kernel mappings. Introduce helpers to set and
> clear this status.
>
> Note: this uses the PG_referenced bit. Page flags are a great fit for
> this since it is truly a single bit of information. Use PG_referenced
> itself because it's a fairly benign flag (as opposed to things like
> PG_lock). It's also (according to Willy) unlikely to go away any time
> soon.
>
> PG_referenced is not in PAGE_FLAGS_CHECK_AT_FREE. It does not need to
> be cleared before freeing the page, and pages coming out of the
> allocator should have it cleared. Regardless, introduce an API to
> clear it anyway. Having symmetry in the API makes it easier to change
> the underlying implementation later, like if there was a need to move
> to a PAGE_FLAGS_CHECK_AT_FREE bit.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> include/linux/page-flags.h | 51 ++++++++++++++++++++++++++++++++++++++
> 1 file changed, 51 insertions(+)
Seems like a good idea
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> +/**
> + * ptdesc_test_kernel - Check if a ptdesc is used to map the kernel
> + * @ptdesc: The ptdesc being tested
> + *
> + * Call to tell if the ptdesc used to map the kernel.
> + */
> +static inline bool ptdesc_test_kernel(struct ptdesc *ptdesc)
> +{
> + struct folio *folio = ptdesc_folio(ptdesc);
> +
> + /*
> + * Note: the 'PG_referenced' bit does not strictly need to be
> + * tested before freeing the page. But this is nice for
> + * symmetry.
> + */
> + return folio_test_referenced(folio);
Wrong comment, it was copied from the prior function
Jason
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 2/8] mm: Actually mark kernel page table pages
2025-09-05 5:50 ` [PATCH v4 2/8] mm: Actually mark kernel page table pages Lu Baolu
@ 2025-09-05 18:24 ` Jason Gunthorpe
2025-09-12 7:59 ` Tian, Kevin
1 sibling, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:24 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Dave Hansen, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
iommu, security, linux-kernel, Dave Hansen
On Fri, Sep 05, 2025 at 01:50:57PM +0800, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Now that the API is in place, mark kernel page table pages just
> after they are allocated. Unmark them just before they are freed.
>
> Note: Unconditionally clearing the 'kernel' marking (via
> ptdesc_clear_kernel()) would be functionally identical to what
> is here. But having the if() makes it logically clear that this
> function can be used for kernel and non-kernel page tables.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> include/asm-generic/pgalloc.h | 18 ++++++++++++++++++
> include/linux/mm.h | 3 +++
> 2 files changed, 21 insertions(+)
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Jason
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages
2025-09-05 5:50 ` [PATCH v4 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
@ 2025-09-05 18:25 ` Jason Gunthorpe
2025-09-12 8:03 ` Tian, Kevin
1 sibling, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:25 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Dave Hansen, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
iommu, security, linux-kernel, Dave Hansen
On Fri, Sep 05, 2025 at 01:50:58PM +0800, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> There are a billion ways to refer to a physical memory address.
> One of the x86 PMD freeing code location chooses to use a 'pte_t *' to
> point to a PMD page and then call a PTE-specific freeing function for
> it. That's a bit wonky.
>
> Just use a 'struct ptdesc *' instead. Its entire purpose is to refer
> to page table pages. It also means being able to remove an explicit
> cast.
>
> Right now, pte_free_kernel() is a one-liner that calls
> pagetable_dtor_free(). Effectively, all this patch does is
> remove one superfluous __pa(__va(paddr)) conversion and then
> call pagetable_dtor_free() directly instead of through a helper.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> arch/x86/mm/pgtable.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
Much better
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Jason
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 4/8] mm: Introduce pure page table freeing function
2025-09-05 5:50 ` [PATCH v4 4/8] mm: Introduce pure page table freeing function Lu Baolu
@ 2025-09-05 18:31 ` Jason Gunthorpe
2025-09-12 8:04 ` Tian, Kevin
1 sibling, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:31 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Dave Hansen, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
iommu, security, linux-kernel, Dave Hansen
On Fri, Sep 05, 2025 at 01:50:59PM +0800, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The pages used for ptdescs are currently freed back to the allocator
> in a single location. They will shortly be freed from a second
> location.
>
> Create a simple helper that just frees them back to the allocator.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> include/linux/mm.h | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Having the only way to free pages take in the struct pdesc is the
right thing..
Jason
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 5/8] x86/mm: Use pagetable_free()
2025-09-05 5:51 ` [PATCH v4 5/8] x86/mm: Use pagetable_free() Lu Baolu
@ 2025-09-05 18:41 ` Jason Gunthorpe
2025-09-05 19:22 ` Dave Hansen
2025-09-05 20:11 ` Dave Hansen
0 siblings, 2 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:41 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Dave Hansen, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
iommu, security, linux-kernel
On Fri, Sep 05, 2025 at 01:51:00PM +0800, Lu Baolu wrote:
> The kernel's memory management subsystem provides a dedicated interface,
> pagetable_free(), for freeing page table pages. Updates two call sites to
> use pagetable_free() instead of the lower-level __free_page() or
> free_pages(). This improves code consistency and clarity, and ensures the
> correct freeing mechanism is used.
>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> arch/x86/mm/init_64.c | 2 +-
> arch/x86/mm/pat/set_memory.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 76e33bd7c556..86b4297c1984 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1013,7 +1013,7 @@ static void __meminit free_pagetable(struct page *page, int order)
> free_reserved_pages(page, nr_pages);
> #endif
> } else {
> - free_pages((unsigned long)page_address(page), order);
> + pagetable_free(page_ptdesc(page));
> }
> }
Er.. So if bootmem happens to be under the table and we happen to free
it due to memory hotplug we don't go through the SVA fixing path?
Seems wrong??
Bootmem still has a ptdesc and a usable linked list, right?
So maybe this should be redone into an arch hook in/around
__pagetable_free() and all the above frees just use the normal ptdesc
free path, including the SVA work queue?
Jason
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables
2025-09-05 5:51 ` [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
@ 2025-09-05 18:43 ` Jason Gunthorpe
2025-09-05 19:26 ` Dave Hansen
2025-09-12 8:17 ` Tian, Kevin
2025-09-12 8:14 ` Tian, Kevin
1 sibling, 2 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:43 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Dave Hansen, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
iommu, security, linux-kernel, Dave Hansen
On Fri, Sep 05, 2025 at 01:51:01PM +0800, Lu Baolu wrote:
> +#ifdef CONFIG_ASYNC_PGTABLE_FREE
> +void pagetable_free_async(struct ptdesc *pt);
> +#else
> +static inline void pagetable_free_async(struct ptdesc *pt)
> +{
> + __pagetable_free(pt);
> +}
> +#endif
I'd probably call this function pagetable_free_kernel() ? Weird to
call it async when it isn't async..
ptdesc_clear_kernel()?
> + list_for_each_entry_safe(pt, next, &page_list, pt_list) {
> + list_del(&pt->pt_list);
The list_del isn't necessary, it doesn't zero the list, just _safe
iteration is fine.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Jason
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 7/8] mm: Hook up Kconfig options for async page table freeing
2025-09-05 5:51 ` [PATCH v4 7/8] mm: Hook up Kconfig options for async page table freeing Lu Baolu
@ 2025-09-05 18:44 ` Jason Gunthorpe
2025-09-12 8:19 ` Tian, Kevin
1 sibling, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 18:44 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Dave Hansen, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
iommu, security, linux-kernel, Dave Hansen
On Fri, Sep 05, 2025 at 01:51:02PM +0800, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The CONFIG_ASYNC_PGTABLE_FREE option controls whether an architecture
> requires asynchronous page table freeing. On x86, this is selected if
> IOMMU_SVA is enabled, because both Intel and AMD IOMMU architectures
> could potentially cache kernel page table entries in their paging
> structure cache, regardless of the permission.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> arch/x86/Kconfig | 1 +
> mm/Kconfig | 3 +++
> 2 files changed, 4 insertions(+)
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Jason
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 5/8] x86/mm: Use pagetable_free()
2025-09-05 18:41 ` Jason Gunthorpe
@ 2025-09-05 19:22 ` Dave Hansen
2025-09-05 20:11 ` Dave Hansen
1 sibling, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2025-09-05 19:22 UTC (permalink / raw)
To: Jason Gunthorpe, Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security,
linux-kernel
On 9/5/25 11:41, Jason Gunthorpe wrote:
...
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 76e33bd7c556..86b4297c1984 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -1013,7 +1013,7 @@ static void __meminit free_pagetable(struct page *page, int order)
>> free_reserved_pages(page, nr_pages);
>> #endif
>> } else {
>> - free_pages((unsigned long)page_address(page), order);
>> + pagetable_free(page_ptdesc(page));
>> }
>> }
>
> Er.. So if bootmem happens to be under the table and we happen to free
> it due to memory hotplug we don't go through the SVA fixing path?
>
> Seems wrong??
Yes, it does.
> Bootmem still has a ptdesc and a usable linked list, right?
Yep, I think so.
> So maybe this should be redone into an arch hook in/around
> __pagetable_free() and all the above frees just use the normal ptdesc
> free path, including the SVA work queue?
There are reserved pages and the put_page_bootmem() ones too. The
put_page_bootmem() ones are especially annoying because they can't
always be freed.
put_page_bootmem() only has three call sites. Maybe we should just make
it return if the caller *should* free the page instead of actually
freeing it.
Then we get something like:
free_pagetable()
{
if (PageReserved(page)) {
if (is_bootmem(page) && put_page_bootmem(page)) {
// refcount didn't drop to 0, can't free
return;
}
unreserve_page(page);
}
pagetable_free(page_ptdesc(page));
}
Which doesn't seem too bad.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables
2025-09-05 18:43 ` Jason Gunthorpe
@ 2025-09-05 19:26 ` Dave Hansen
2025-09-12 8:17 ` Tian, Kevin
1 sibling, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2025-09-05 19:26 UTC (permalink / raw)
To: Jason Gunthorpe, Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security,
linux-kernel, Dave Hansen
On 9/5/25 11:43, Jason Gunthorpe wrote:
> On Fri, Sep 05, 2025 at 01:51:01PM +0800, Lu Baolu wrote:
>> +#ifdef CONFIG_ASYNC_PGTABLE_FREE
>> +void pagetable_free_async(struct ptdesc *pt);
>> +#else
>> +static inline void pagetable_free_async(struct ptdesc *pt)
>> +{
>> + __pagetable_free(pt);
>> +}
>> +#endif
>
> I'd probably call this function pagetable_free_kernel() ? Weird to
> call it async when it isn't async..
>
> ptdesc_clear_kernel()?
Fine with me.
>> + list_for_each_entry_safe(pt, next, &page_list, pt_list) {
>> + list_del(&pt->pt_list);
>
> The list_del isn't necessary, it doesn't zero the list, just _safe
> iteration is fine.
Ahh, yes, agreed. The list_head is on the stack and going away. The
global list is now empty after the list splice.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 5/8] x86/mm: Use pagetable_free()
2025-09-05 18:41 ` Jason Gunthorpe
2025-09-05 19:22 ` Dave Hansen
@ 2025-09-05 20:11 ` Dave Hansen
2025-09-05 23:04 ` Jason Gunthorpe
2025-09-19 5:31 ` Baolu Lu
1 sibling, 2 replies; 31+ messages in thread
From: Dave Hansen @ 2025-09-05 20:11 UTC (permalink / raw)
To: Jason Gunthorpe, Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security,
linux-kernel
On 9/5/25 11:41, Jason Gunthorpe wrote:
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -1013,7 +1013,7 @@ static void __meminit free_pagetable(struct page *page, int order)
>> free_reserved_pages(page, nr_pages);
>> #endif
>> } else {
>> - free_pages((unsigned long)page_address(page), order);
>> + pagetable_free(page_ptdesc(page));
>> }
>> }
> Er.. So if bootmem happens to be under the table and we happen to free
> it due to memory hotplug we don't go through the SVA fixing path?
>
> Seems wrong??
On second thought...
Yes, freeing bootmem with no SVA fixing is wrong. It should be fixed.
Period. But, it's wrong one time for something super rare: memory unplug
of memory that was present at boot. It also can't be triggered by
unprivileged users.
As-is, this series fixes vfree(). That path is not nearly rare, can
happen an arbitrary number of times on each boot, and might even be
triggered by folks that are less than root.
So I kinda think we should just make clear that this series leaves
_some_ holes, but I do think it should go in mostly as-is.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 5/8] x86/mm: Use pagetable_free()
2025-09-05 20:11 ` Dave Hansen
@ 2025-09-05 23:04 ` Jason Gunthorpe
2025-09-19 5:31 ` Baolu Lu
1 sibling, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 23:04 UTC (permalink / raw)
To: Dave Hansen
Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jann Horn, Vasant Hegde, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
iommu, security, linux-kernel
On Fri, Sep 05, 2025 at 01:11:15PM -0700, Dave Hansen wrote:
> On 9/5/25 11:41, Jason Gunthorpe wrote:
> >> --- a/arch/x86/mm/init_64.c
> >> +++ b/arch/x86/mm/init_64.c
> >> @@ -1013,7 +1013,7 @@ static void __meminit free_pagetable(struct page *page, int order)
> >> free_reserved_pages(page, nr_pages);
> >> #endif
> >> } else {
> >> - free_pages((unsigned long)page_address(page), order);
> >> + pagetable_free(page_ptdesc(page));
> >> }
> >> }
> > Er.. So if bootmem happens to be under the table and we happen to free
> > it due to memory hotplug we don't go through the SVA fixing path?
> >
> > Seems wrong??
>
> On second thought...
>
> Yes, freeing bootmem with no SVA fixing is wrong. It should be fixed.
> Period. But, it's wrong one time for something super rare: memory unplug
> of memory that was present at boot. It also can't be triggered by
> unprivileged users.
>
> As-is, this series fixes vfree(). That path is not nearly rare, can
> happen an arbitrary number of times on each boot, and might even be
> triggered by folks that are less than root.
>
> So I kinda think we should just make clear that this series leaves
> _some_ holes, but I do think it should go in mostly as-is.
That's reasonable, but also your suggested change is pretty
simple. I'd put an arch hook:
static inline void pagetable_free_kernel(struct ptdesc *pt)
{
struct page *page = ptdesc_page(pt);
ptdesc_clear_kernel(pt);
if (!arch_pagetable_free_kernel(pt))
return;
__pagetable_free((page);
}
With what you showed
Also, probably need to ensure whatever allocates the bootmem in the
first place calls ptdesc_set_kernel()..
Jason
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [PATCH v4 1/8] mm: Add a ptdesc flag to mark kernel page tables
2025-09-05 5:50 ` [PATCH v4 1/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
2025-09-05 18:24 ` Jason Gunthorpe
@ 2025-09-12 7:58 ` Tian, Kevin
1 sibling, 0 replies; 31+ messages in thread
From: Tian, Kevin @ 2025-09-12 7:58 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Hansen, Dave,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1
Cc: iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, Dave Hansen
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Friday, September 5, 2025 1:51 PM
>
> +/**
> + * ptdesc_clear_kernel - Mark a ptdesc as no longer used to map the kernel
> + * @ptdesc: The ptdesc to be unmarked
> + *
> + * Use when the ptdesc is no longer used to map the kernel and no longer
> + * needs special handling.
> + */
> +static inline void ptdesc_clear_kernel(struct ptdesc *ptdesc)
> +{
> + struct folio *folio = ptdesc_folio(ptdesc);
> +
> + /*
> + * Note: the 'PG_referenced' bit does not strictly need to be
> + * cleared before freeing the page. But this is nice for
> + * symmetry.
> + */
> + folio_clear_referenced(folio);
the comment belongs to the caller?
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [PATCH v4 2/8] mm: Actually mark kernel page table pages
2025-09-05 5:50 ` [PATCH v4 2/8] mm: Actually mark kernel page table pages Lu Baolu
2025-09-05 18:24 ` Jason Gunthorpe
@ 2025-09-12 7:59 ` Tian, Kevin
1 sibling, 0 replies; 31+ messages in thread
From: Tian, Kevin @ 2025-09-12 7:59 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Hansen, Dave,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1
Cc: iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, Dave Hansen
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Friday, September 5, 2025 1:51 PM
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Now that the API is in place, mark kernel page table pages just
> after they are allocated. Unmark them just before they are freed.
>
> Note: Unconditionally clearing the 'kernel' marking (via
> ptdesc_clear_kernel()) would be functionally identical to what
> is here. But having the if() makes it logically clear that this
> function can be used for kernel and non-kernel page tables.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [PATCH v4 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages
2025-09-05 5:50 ` [PATCH v4 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
2025-09-05 18:25 ` Jason Gunthorpe
@ 2025-09-12 8:03 ` Tian, Kevin
1 sibling, 0 replies; 31+ messages in thread
From: Tian, Kevin @ 2025-09-12 8:03 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Hansen, Dave,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1
Cc: iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, Dave Hansen
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Friday, September 5, 2025 1:51 PM
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> There are a billion ways to refer to a physical memory address.
> One of the x86 PMD freeing code location chooses to use a 'pte_t *' to
> point to a PMD page and then call a PTE-specific freeing function for
> it. That's a bit wonky.
>
> Just use a 'struct ptdesc *' instead. Its entire purpose is to refer
> to page table pages. It also means being able to remove an explicit
> cast.
>
> Right now, pte_free_kernel() is a one-liner that calls
> pagetable_dtor_free(). Effectively, all this patch does is
> remove one superfluous __pa(__va(paddr)) conversion and then
> call pagetable_dtor_free() directly instead of through a helper.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [PATCH v4 4/8] mm: Introduce pure page table freeing function
2025-09-05 5:50 ` [PATCH v4 4/8] mm: Introduce pure page table freeing function Lu Baolu
2025-09-05 18:31 ` Jason Gunthorpe
@ 2025-09-12 8:04 ` Tian, Kevin
1 sibling, 0 replies; 31+ messages in thread
From: Tian, Kevin @ 2025-09-12 8:04 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Hansen, Dave,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1
Cc: iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, Dave Hansen
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Friday, September 5, 2025 1:51 PM
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The pages used for ptdescs are currently freed back to the allocator
> in a single location. They will shortly be freed from a second
> location.
>
> Create a simple helper that just frees them back to the allocator.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables
2025-09-05 5:51 ` [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
2025-09-05 18:43 ` Jason Gunthorpe
@ 2025-09-12 8:14 ` Tian, Kevin
2025-09-15 1:16 ` Baolu Lu
1 sibling, 1 reply; 31+ messages in thread
From: Tian, Kevin @ 2025-09-12 8:14 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Hansen, Dave,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1
Cc: iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, Dave Hansen
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Friday, September 5, 2025 1:51 PM
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> On x86 and other architectures that map the kernel's virtual address space
> into the upper portion of every process's page table, the IOMMU's paging
> structure caches can become stale when the CPU page table is shared with
> IOMMU in the Shared Virtual Address (SVA) context. This occurs when a page
> used for the kernel's page tables is freed and reused without the IOMMU
> being notified.
>
> While the IOMMU driver is notified of changes to user virtual address
> mappings, there is no similar notification mechanism for kernel page
> table changes. This can lead to data corruption or system instability
> when Shared Virtual Address (SVA) is enabled, as the IOMMU's internal
> caches may retain stale entries for kernel virtual addresses.
above could be saved to the last patch.
>
> This introduces a conditional asynchronous mechanism, enabled by
> CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the
> freeing
> of pages that are used as page tables for kernel address mappings. These
> pages are now queued to a work struct instead of being freed immediately.
>
> This deferred freeing provides a safe context for a future patch to add
> an IOMMU-specific callback, which might be expensive on large-scale
> systems. This ensures the necessary IOMMU cache invalidation is performed
> before the page is finally returned to the page allocator outside of any
> critical, non-sleepable path.
>
> In the current kernel, some page table pages are allocated with an
> associated struct ptdesc, while others are not. Those without a ptdesc are
> freed using free_pages() and its variants, which bypasses the destructor
> that pagetable_dtor_free() would run. While the long-term plan is to
> convert all page table pages to use struct ptdesc, this uses a temporary
> flag within ptdesc to indicate whether a page needs a destructor,
> considering that this aims to fix a potential security issue in IOMMU SVA.
> The flag and its associated logic can be removed once the conversion is
> complete.
stale comment?
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables
2025-09-05 18:43 ` Jason Gunthorpe
2025-09-05 19:26 ` Dave Hansen
@ 2025-09-12 8:17 ` Tian, Kevin
2025-09-15 11:35 ` Jason Gunthorpe
1 sibling, 1 reply; 31+ messages in thread
From: Tian, Kevin @ 2025-09-12 8:17 UTC (permalink / raw)
To: Jason Gunthorpe, Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn, Vasant Hegde,
Hansen, Dave, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1,
iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, Dave Hansen
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, September 6, 2025 2:44 AM
>
> On Fri, Sep 05, 2025 at 01:51:01PM +0800, Lu Baolu wrote:
>
> > + list_for_each_entry_safe(pt, next, &page_list, pt_list) {
> > + list_del(&pt->pt_list);
>
> The list_del isn't necessary, it doesn't zero the list, just _safe
> iteration is fine.
>
if no list_del why need to keep the _safe iteration?
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [PATCH v4 7/8] mm: Hook up Kconfig options for async page table freeing
2025-09-05 5:51 ` [PATCH v4 7/8] mm: Hook up Kconfig options for async page table freeing Lu Baolu
2025-09-05 18:44 ` Jason Gunthorpe
@ 2025-09-12 8:19 ` Tian, Kevin
1 sibling, 0 replies; 31+ messages in thread
From: Tian, Kevin @ 2025-09-12 8:19 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Hansen, Dave,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1
Cc: iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, Dave Hansen
> From: Lu Baolu <baolu.lu@linux.intel.com>
> Sent: Friday, September 5, 2025 1:51 PM
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The CONFIG_ASYNC_PGTABLE_FREE option controls whether an architecture
> requires asynchronous page table freeing. On x86, this is selected if
> IOMMU_SVA is enabled, because both Intel and AMD IOMMU architectures
> could potentially cache kernel page table entries in their paging
> structure cache, regardless of the permission.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> ---
> arch/x86/Kconfig | 1 +
> mm/Kconfig | 3 +++
> 2 files changed, 4 insertions(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 58d890fe2100..1b2326d81681 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -281,6 +281,7 @@ config X86
> select HAVE_PCI
> select HAVE_PERF_REGS
> select HAVE_PERF_USER_STACK_DUMP
> + select ASYNC_PGTABLE_FREE if IOMMU_SVA
> select MMU_GATHER_RCU_TABLE_FREE
> select MMU_GATHER_MERGE_VMAS
> select HAVE_POSIX_CPU_TIMERS_TASK_WORK
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e443fe8cd6cf..1576409cec03 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -920,6 +920,9 @@ config PAGE_MAPCOUNT
> config PGTABLE_HAS_HUGE_LEAVES
> def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
>
> +config ASYNC_PGTABLE_FREE
> + def_bool n
> +
be more accurate as ASYNC_KERN_PGTABLE_FREE? not a strong
opinion but doing so avoids giving one the impression that it's
a generic async mechanism for all pgtable pages.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables
2025-09-12 8:14 ` Tian, Kevin
@ 2025-09-15 1:16 ` Baolu Lu
0 siblings, 0 replies; 31+ messages in thread
From: Baolu Lu @ 2025-09-15 1:16 UTC (permalink / raw)
To: Tian, Kevin, Joerg Roedel, Will Deacon, Robin Murphy,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Hansen, Dave,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Lai, Yi1
Cc: iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, Dave Hansen
On 9/12/25 16:14, Tian, Kevin wrote:
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>> Sent: Friday, September 5, 2025 1:51 PM
>>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> On x86 and other architectures that map the kernel's virtual address space
>> into the upper portion of every process's page table, the IOMMU's paging
>> structure caches can become stale when the CPU page table is shared with
>> IOMMU in the Shared Virtual Address (SVA) context. This occurs when a page
>> used for the kernel's page tables is freed and reused without the IOMMU
>> being notified.
>>
>> While the IOMMU driver is notified of changes to user virtual address
>> mappings, there is no similar notification mechanism for kernel page
>> table changes. This can lead to data corruption or system instability
>> when Shared Virtual Address (SVA) is enabled, as the IOMMU's internal
>> caches may retain stale entries for kernel virtual addresses.
>
> above could be saved to the last patch.
Yes.
>
>>
>> This introduces a conditional asynchronous mechanism, enabled by
>> CONFIG_ASYNC_PGTABLE_FREE. When enabled, this mechanism defers the
>> freeing
>> of pages that are used as page tables for kernel address mappings. These
>> pages are now queued to a work struct instead of being freed immediately.
>>
>> This deferred freeing provides a safe context for a future patch to add
>> an IOMMU-specific callback, which might be expensive on large-scale
>> systems. This ensures the necessary IOMMU cache invalidation is performed
>> before the page is finally returned to the page allocator outside of any
>> critical, non-sleepable path.
>>
>> In the current kernel, some page table pages are allocated with an
>> associated struct ptdesc, while others are not. Those without a ptdesc are
>> freed using free_pages() and its variants, which bypasses the destructor
>> that pagetable_dtor_free() would run. While the long-term plan is to
>> convert all page table pages to use struct ptdesc, this uses a temporary
>> flag within ptdesc to indicate whether a page needs a destructor,
>> considering that this aims to fix a potential security issue in IOMMU SVA.
>> The flag and its associated logic can be removed once the conversion is
>> complete.
>
> stale comment?
Yes. Fixed.
>
>>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Thanks,
baolu
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables
2025-09-12 8:17 ` Tian, Kevin
@ 2025-09-15 11:35 ` Jason Gunthorpe
2025-09-19 8:18 ` Tian, Kevin
0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2025-09-15 11:35 UTC (permalink / raw)
To: Tian, Kevin
Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn,
Vasant Hegde, Hansen, Dave, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
Lai, Yi1, iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, Dave Hansen
On Fri, Sep 12, 2025 at 08:17:23AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, September 6, 2025 2:44 AM
> >
> > On Fri, Sep 05, 2025 at 01:51:01PM +0800, Lu Baolu wrote:
> >
> > > + list_for_each_entry_safe(pt, next, &page_list, pt_list) {
> > > + list_del(&pt->pt_list);
> >
> > The list_del isn't necessary, it doesn't zero the list, just _safe
> > iteration is fine.
> >
>
> if no list_del why need to keep the _safe iteration?
__pagetable_free() can change the memory holding pt->pt-list, it is
like kfree. So _safe is needed to allow that.
_safe is not just about deletion, anything that could change the list
element must use a safe iteration.
Jason
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v4 5/8] x86/mm: Use pagetable_free()
2025-09-05 20:11 ` Dave Hansen
2025-09-05 23:04 ` Jason Gunthorpe
@ 2025-09-19 5:31 ` Baolu Lu
1 sibling, 0 replies; 31+ messages in thread
From: Baolu Lu @ 2025-09-19 5:31 UTC (permalink / raw)
To: Dave Hansen, Jason Gunthorpe
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian, Jann Horn,
Vasant Hegde, Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, iommu, security,
linux-kernel
On 9/6/25 04:11, Dave Hansen wrote:
> On 9/5/25 11:41, Jason Gunthorpe wrote:
>>> --- a/arch/x86/mm/init_64.c
>>> +++ b/arch/x86/mm/init_64.c
>>> @@ -1013,7 +1013,7 @@ static void __meminit free_pagetable(struct page *page, int order)
>>> free_reserved_pages(page, nr_pages);
>>> #endif
>>> } else {
>>> - free_pages((unsigned long)page_address(page), order);
>>> + pagetable_free(page_ptdesc(page));
>>> }
>>> }
>> Er.. So if bootmem happens to be under the table and we happen to free
>> it due to memory hotplug we don't go through the SVA fixing path?
>>
>> Seems wrong??
>
> On second thought...
>
> Yes, freeing bootmem with no SVA fixing is wrong. It should be fixed.
> Period. But, it's wrong one time for something super rare: memory unplug
> of memory that was present at boot. It also can't be triggered by
> unprivileged users.
>
> As-is, this series fixes vfree(). That path is not nearly rare, can
> happen an arbitrary number of times on each boot, and might even be
> triggered by folks that are less than root.
>
> So I kinda think we should just make clear that this series leaves
> _some_ holes, but I do think it should go in mostly as-is.
>
Okay, I will put this in the commit message and update this series with
a new version.
Thanks,
baolu
^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables
2025-09-15 11:35 ` Jason Gunthorpe
@ 2025-09-19 8:18 ` Tian, Kevin
0 siblings, 0 replies; 31+ messages in thread
From: Tian, Kevin @ 2025-09-19 8:18 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Jann Horn,
Vasant Hegde, Hansen, Dave, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski,
Lai, Yi1, iommu@lists.linux.dev, security@kernel.org,
linux-kernel@vger.kernel.org, Dave Hansen
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, September 15, 2025 7:35 PM
>
> On Fri, Sep 12, 2025 at 08:17:23AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Saturday, September 6, 2025 2:44 AM
> > >
> > > On Fri, Sep 05, 2025 at 01:51:01PM +0800, Lu Baolu wrote:
> > >
> > > > + list_for_each_entry_safe(pt, next, &page_list, pt_list) {
> > > > + list_del(&pt->pt_list);
> > >
> > > The list_del isn't necessary, it doesn't zero the list, just _safe
> > > iteration is fine.
> > >
> >
> > if no list_del why need to keep the _safe iteration?
>
> __pagetable_free() can change the memory holding pt->pt-list, it is
> like kfree. So _safe is needed to allow that.
>
> _safe is not just about deletion, anything that could change the list
> element must use a safe iteration.
>
yeah, my oversight.
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2025-09-19 8:18 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-05 5:50 [PATCH v4 0/8] Fix stale IOTLB entries for kernel address space Lu Baolu
2025-09-05 5:50 ` [PATCH v4 1/8] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
2025-09-05 18:24 ` Jason Gunthorpe
2025-09-12 7:58 ` Tian, Kevin
2025-09-05 5:50 ` [PATCH v4 2/8] mm: Actually mark kernel page table pages Lu Baolu
2025-09-05 18:24 ` Jason Gunthorpe
2025-09-12 7:59 ` Tian, Kevin
2025-09-05 5:50 ` [PATCH v4 3/8] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
2025-09-05 18:25 ` Jason Gunthorpe
2025-09-12 8:03 ` Tian, Kevin
2025-09-05 5:50 ` [PATCH v4 4/8] mm: Introduce pure page table freeing function Lu Baolu
2025-09-05 18:31 ` Jason Gunthorpe
2025-09-12 8:04 ` Tian, Kevin
2025-09-05 5:51 ` [PATCH v4 5/8] x86/mm: Use pagetable_free() Lu Baolu
2025-09-05 18:41 ` Jason Gunthorpe
2025-09-05 19:22 ` Dave Hansen
2025-09-05 20:11 ` Dave Hansen
2025-09-05 23:04 ` Jason Gunthorpe
2025-09-19 5:31 ` Baolu Lu
2025-09-05 5:51 ` [PATCH v4 6/8] mm: Introduce deferred freeing for kernel page tables Lu Baolu
2025-09-05 18:43 ` Jason Gunthorpe
2025-09-05 19:26 ` Dave Hansen
2025-09-12 8:17 ` Tian, Kevin
2025-09-15 11:35 ` Jason Gunthorpe
2025-09-19 8:18 ` Tian, Kevin
2025-09-12 8:14 ` Tian, Kevin
2025-09-15 1:16 ` Baolu Lu
2025-09-05 5:51 ` [PATCH v4 7/8] mm: Hook up Kconfig options for async page table freeing Lu Baolu
2025-09-05 18:44 ` Jason Gunthorpe
2025-09-12 8:19 ` Tian, Kevin
2025-09-05 5:51 ` [PATCH v4 8/8] iommu/sva: Invalidate stale IOTLB entries for kernel address space Lu Baolu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox