* [PATCH v6 1/7] mm: Add a ptdesc flag to mark kernel page tables
2025-10-14 13:04 [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space Lu Baolu
@ 2025-10-14 13:04 ` Lu Baolu
2025-10-16 19:26 ` David Hildenbrand
2025-10-14 13:04 ` [PATCH v6 2/7] mm: Actually mark kernel page table pages Lu Baolu
` (7 subsequent siblings)
8 siblings, 1 reply; 31+ messages in thread
From: Lu Baolu @ 2025-10-14 13:04 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
Michal Hocko, Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen,
Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
The page tables used to map the kernel and userspace often have very
different handling rules. There are frequently *_kernel() variants of
functions just for kernel page tables. That's not great and has lead
to code duplication.
Instead of having completely separate call paths, allow a 'ptdesc' to
be marked as being for kernel mappings. Introduce helpers to set and
clear this status.
Note: this uses the PG_referenced bit. Page flags are a great fit for
this since it is truly a single bit of information. Use PG_referenced
itself because it's a fairly benign flag (as opposed to things like
PG_lock). It's also (according to Willy) unlikely to go away any time
soon.
PG_referenced is not in PAGE_FLAGS_CHECK_AT_FREE. It does not need to
be cleared before freeing the page, and pages coming out of the
allocator should have it cleared. Regardless, introduce an API to
clear it anyway. Having symmetry in the API makes it easier to change
the underlying implementation later, like if there was a need to move
to a PAGE_FLAGS_CHECK_AT_FREE bit.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
include/linux/mm.h | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16b33bacc32..9741affc574e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2940,6 +2940,7 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
#endif /* CONFIG_MMU */
enum pt_flags {
+ PT_kernel = PG_referenced,
PT_reserved = PG_reserved,
/* High bits are used for zone/node/section */
};
@@ -2965,6 +2966,46 @@ static inline bool pagetable_is_reserved(struct ptdesc *pt)
return test_bit(PT_reserved, &pt->pt_flags.f);
}
+/**
+ * ptdesc_set_kernel - Mark a ptdesc used to map the kernel
+ * @ptdesc: The ptdesc to be marked
+ *
+ * Kernel page tables often need special handling. Set a flag so that
+ * the handling code knows this ptdesc will not be used for userspace.
+ */
+static inline void ptdesc_set_kernel(struct ptdesc *ptdesc)
+{
+ set_bit(PT_kernel, &ptdesc->pt_flags.f);
+}
+
+/**
+ * ptdesc_clear_kernel - Mark a ptdesc as no longer used to map the kernel
+ * @ptdesc: The ptdesc to be unmarked
+ *
+ * Use when the ptdesc is no longer used to map the kernel and no longer
+ * needs special handling.
+ */
+static inline void ptdesc_clear_kernel(struct ptdesc *ptdesc)
+{
+ /*
+ * Note: the 'PG_referenced' bit does not strictly need to be
+ * cleared before freeing the page. But this is nice for
+ * symmetry.
+ */
+ clear_bit(PT_kernel, &ptdesc->pt_flags.f);
+}
+
+/**
+ * ptdesc_test_kernel - Check if a ptdesc is used to map the kernel
+ * @ptdesc: The ptdesc being tested
+ *
+ * Call to tell if the ptdesc used to map the kernel.
+ */
+static inline bool ptdesc_test_kernel(struct ptdesc *ptdesc)
+{
+ return test_bit(PT_kernel, &ptdesc->pt_flags.f);
+}
+
/**
* pagetable_alloc - Allocate pagetables
* @gfp: GFP flags
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread* Re: [PATCH v6 1/7] mm: Add a ptdesc flag to mark kernel page tables
2025-10-14 13:04 ` [PATCH v6 1/7] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
@ 2025-10-16 19:26 ` David Hildenbrand
0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-10-16 19:26 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, Lorenzo Stoakes, Liam R . Howlett,
Andrew Morton, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 14.10.25 15:04, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The page tables used to map the kernel and userspace often have very
> different handling rules. There are frequently *_kernel() variants of
> functions just for kernel page tables. That's not great and has lead
> to code duplication.
>
> Instead of having completely separate call paths, allow a 'ptdesc' to
> be marked as being for kernel mappings. Introduce helpers to set and
> clear this status.
>
> Note: this uses the PG_referenced bit. Page flags are a great fit for
> this since it is truly a single bit of information. Use PG_referenced
> itself because it's a fairly benign flag (as opposed to things like
> PG_lock). It's also (according to Willy) unlikely to go away any time
> soon.
>
> PG_referenced is not in PAGE_FLAGS_CHECK_AT_FREE. It does not need to
> be cleared before freeing the page, and pages coming out of the
> allocator should have it cleared. Regardless, introduce an API to
> clear it anyway. Having symmetry in the API makes it easier to change
> the underlying implementation later, like if there was a need to move
> to a PAGE_FLAGS_CHECK_AT_FREE bit.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
[...]
> +
> +/**
> + * ptdesc_test_kernel - Check if a ptdesc is used to map the kernel
> + * @ptdesc: The ptdesc being tested
> + *
> + * Call to tell if the ptdesc used to map the kernel.
> + */
> +static inline bool ptdesc_test_kernel(struct ptdesc *ptdesc)
We fancy const-correctness now:
static inline bool ptdesc_test_kernel(const struct ptdesc *ptdesc)
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v6 2/7] mm: Actually mark kernel page table pages
2025-10-14 13:04 [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space Lu Baolu
2025-10-14 13:04 ` [PATCH v6 1/7] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
@ 2025-10-14 13:04 ` Lu Baolu
2025-10-14 13:04 ` [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
` (6 subsequent siblings)
8 siblings, 0 replies; 31+ messages in thread
From: Lu Baolu @ 2025-10-14 13:04 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
Michal Hocko, Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen,
Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
Now that the API is in place, mark kernel page table pages just
after they are allocated. Unmark them just before they are freed.
Note: Unconditionally clearing the 'kernel' marking (via
ptdesc_clear_kernel()) would be functionally identical to what
is here. But having the if() makes it logically clear that this
function can be used for kernel and non-kernel page tables.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/asm-generic/pgalloc.h | 18 ++++++++++++++++++
include/linux/mm.h | 3 +++
2 files changed, 21 insertions(+)
diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 3c8ec3bfea44..b9d2a7c79b93 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -28,6 +28,8 @@ static inline pte_t *__pte_alloc_one_kernel_noprof(struct mm_struct *mm)
return NULL;
}
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define __pte_alloc_one_kernel(...) alloc_hooks(__pte_alloc_one_kernel_noprof(__VA_ARGS__))
@@ -146,6 +148,10 @@ static inline pmd_t *pmd_alloc_one_noprof(struct mm_struct *mm, unsigned long ad
pagetable_free(ptdesc);
return NULL;
}
+
+ if (mm == &init_mm)
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define pmd_alloc_one(...) alloc_hooks(pmd_alloc_one_noprof(__VA_ARGS__))
@@ -179,6 +185,10 @@ static inline pud_t *__pud_alloc_one_noprof(struct mm_struct *mm, unsigned long
return NULL;
pagetable_pud_ctor(ptdesc);
+
+ if (mm == &init_mm)
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define __pud_alloc_one(...) alloc_hooks(__pud_alloc_one_noprof(__VA_ARGS__))
@@ -233,6 +243,10 @@ static inline p4d_t *__p4d_alloc_one_noprof(struct mm_struct *mm, unsigned long
return NULL;
pagetable_p4d_ctor(ptdesc);
+
+ if (mm == &init_mm)
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define __p4d_alloc_one(...) alloc_hooks(__p4d_alloc_one_noprof(__VA_ARGS__))
@@ -277,6 +291,10 @@ static inline pgd_t *__pgd_alloc_noprof(struct mm_struct *mm, unsigned int order
return NULL;
pagetable_pgd_ctor(ptdesc);
+
+ if (mm == &init_mm)
+ ptdesc_set_kernel(ptdesc);
+
return ptdesc_address(ptdesc);
}
#define __pgd_alloc(...) alloc_hooks(__pgd_alloc_noprof(__VA_ARGS__))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9741affc574e..15ce0c415d36 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3035,6 +3035,9 @@ static inline void pagetable_free(struct ptdesc *pt)
{
struct page *page = ptdesc_page(pt);
+ if (ptdesc_test_kernel(pt))
+ ptdesc_clear_kernel(pt);
+
__free_pages(page, compound_order(page));
}
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread* [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages
2025-10-14 13:04 [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space Lu Baolu
2025-10-14 13:04 ` [PATCH v6 1/7] mm: Add a ptdesc flag to mark kernel page tables Lu Baolu
2025-10-14 13:04 ` [PATCH v6 2/7] mm: Actually mark kernel page table pages Lu Baolu
@ 2025-10-14 13:04 ` Lu Baolu
2025-10-14 23:19 ` Dave Hansen
2025-10-16 19:33 ` David Hildenbrand
2025-10-14 13:04 ` [PATCH v6 4/7] mm: Introduce pure page table freeing function Lu Baolu
` (5 subsequent siblings)
8 siblings, 2 replies; 31+ messages in thread
From: Lu Baolu @ 2025-10-14 13:04 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
Michal Hocko, Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen,
Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
There are a billion ways to refer to a physical memory address.
One of the x86 PMD freeing code location chooses to use a 'pte_t *' to
point to a PMD page and then call a PTE-specific freeing function for
it. That's a bit wonky.
Just use a 'struct ptdesc *' instead. Its entire purpose is to refer
to page table pages. It also means being able to remove an explicit
cast.
Right now, pte_free_kernel() is a one-liner that calls
pagetable_dtor_free(). Effectively, all this patch does is
remove one superfluous __pa(__va(paddr)) conversion and then
call pagetable_dtor_free() directly instead of through a helper.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
include/linux/mm.h | 6 ++++--
arch/x86/mm/pgtable.c | 12 ++++++------
2 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 15ce0c415d36..94e2ec6c5685 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3203,8 +3203,7 @@ pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
NULL: pte_offset_kernel(pmd, address))
-#if defined(CONFIG_SPLIT_PMD_PTLOCKS)
-
+#if defined(CONFIG_SPLIT_PMD_PTLOCKS) || defined(CONFIG_X86_64)
static inline struct page *pmd_pgtable_page(pmd_t *pmd)
{
unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
@@ -3215,6 +3214,9 @@ static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
{
return page_ptdesc(pmd_pgtable_page(pmd));
}
+#endif
+
+#if defined(CONFIG_SPLIT_PMD_PTLOCKS)
static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
{
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ddf248c3ee7d..c830ccbc2fd8 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -729,7 +729,7 @@ int pmd_clear_huge(pmd_t *pmd)
int pud_free_pmd_page(pud_t *pud, unsigned long addr)
{
pmd_t *pmd, *pmd_sv;
- pte_t *pte;
+ struct ptdesc *pt;
int i;
pmd = pud_pgtable(*pud);
@@ -750,8 +750,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
for (i = 0; i < PTRS_PER_PMD; i++) {
if (!pmd_none(pmd_sv[i])) {
- pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
- pte_free_kernel(&init_mm, pte);
+ pt = pmd_ptdesc(&pmd_sv[i]);
+ pagetable_dtor_free(pt);
}
}
@@ -772,15 +772,15 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
*/
int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
{
- pte_t *pte;
+ struct ptdesc *pt;
- pte = (pte_t *)pmd_page_vaddr(*pmd);
+ pt = pmd_ptdesc(pmd);
pmd_clear(pmd);
/* INVLPG to clear all paging-structure caches */
flush_tlb_kernel_range(addr, addr + PAGE_SIZE-1);
- pte_free_kernel(&init_mm, pte);
+ pagetable_dtor_free(pt);
return 1;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread* Re: [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages
2025-10-14 13:04 ` [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
@ 2025-10-14 23:19 ` Dave Hansen
2025-10-15 5:19 ` Baolu Lu
2025-10-16 19:33 ` David Hildenbrand
1 sibling, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2025-10-14 23:19 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
Andrew Morton, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 10/14/25 06:04, Lu Baolu wrote:
> -#if defined(CONFIG_SPLIT_PMD_PTLOCKS)
> -
> +#if defined(CONFIG_SPLIT_PMD_PTLOCKS) || defined(CONFIG_X86_64)
What's with the #ifdef munging? It's not mentioned in the changelog.
I went looking at this because pmd_free_pte_page() is right in the
middle of the action on this reported use after free:
> https://lore.kernel.org/all/68eeb99e.050a0220.91a22.0220.GAE@google.com/
so something fishy is going on. It would be great to narrow that report
down to a _specific_ patch in the series.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages
2025-10-14 23:19 ` Dave Hansen
@ 2025-10-15 5:19 ` Baolu Lu
0 siblings, 0 replies; 31+ messages in thread
From: Baolu Lu @ 2025-10-15 5:19 UTC (permalink / raw)
To: Dave Hansen, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
Andrew Morton, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 10/15/25 07:19, Dave Hansen wrote:
> On 10/14/25 06:04, Lu Baolu wrote:
>> -#if defined(CONFIG_SPLIT_PMD_PTLOCKS)
>> -
>> +#if defined(CONFIG_SPLIT_PMD_PTLOCKS) || defined(CONFIG_X86_64)
>
> What's with the #ifdef munging? It's not mentioned in the changelog.
>
> I went looking at this because pmd_free_pte_page() is right in the
> middle of the action on this reported use after free:
>
>> https://lore.kernel.org/all/68eeb99e.050a0220.91a22.0220.GAE@google.com/
>
> so something fishy is going on. It would be great to narrow that report
> down to a _specific_ patch in the series.
Yes. I am trying to reproduce the issue on my local machine and then
narrow down the reported issue.
Thanks,
baolu
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages
2025-10-14 13:04 ` [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
2025-10-14 23:19 ` Dave Hansen
@ 2025-10-16 19:33 ` David Hildenbrand
1 sibling, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-10-16 19:33 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, Lorenzo Stoakes, Liam R . Howlett,
Andrew Morton, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 14.10.25 15:04, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> There are a billion ways to refer to a physical memory address.
> One of the x86 PMD freeing code location chooses to use a 'pte_t *' to
> point to a PMD page and then call a PTE-specific freeing function for
> it. That's a bit wonky.
>
> Just use a 'struct ptdesc *' instead. Its entire purpose is to refer
> to page table pages. It also means being able to remove an explicit
> cast.
>
> Right now, pte_free_kernel() is a one-liner that calls
> pagetable_dtor_free(). Effectively, all this patch does is
> remove one superfluous __pa(__va(paddr)) conversion and then
> call pagetable_dtor_free() directly instead of through a helper.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
> include/linux/mm.h | 6 ++++--
> arch/x86/mm/pgtable.c | 12 ++++++------
> 2 files changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 15ce0c415d36..94e2ec6c5685 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3203,8 +3203,7 @@ pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
> ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
> NULL: pte_offset_kernel(pmd, address))
>
> -#if defined(CONFIG_SPLIT_PMD_PTLOCKS)
> -
> +#if defined(CONFIG_SPLIT_PMD_PTLOCKS) || defined(CONFIG_X86_64)
Yeah, that is weird. I'd have thought we can simply move this out of
the ifdef? The CONFIG_X86_64 stuff certainly has to go one way or the other.
As PTE tables always fit in a single page, pgtable_page(pte) is sufficient.
PMD tables can exceed a single page on some archs, so we have to find
the first page first that we can then cast.
Can't immediately see why that causes compile issues.
> static inline struct page *pmd_pgtable_page(pmd_t *pmd)
> {
> unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
> @@ -3215,6 +3214,9 @@ static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
> {
> return page_ptdesc(pmd_pgtable_page(pmd));
> }
> +#endif
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v6 4/7] mm: Introduce pure page table freeing function
2025-10-14 13:04 [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space Lu Baolu
` (2 preceding siblings ...)
2025-10-14 13:04 ` [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages Lu Baolu
@ 2025-10-14 13:04 ` Lu Baolu
2025-10-14 13:04 ` [PATCH v6 5/7] x86/mm: Use pagetable_free() Lu Baolu
` (4 subsequent siblings)
8 siblings, 0 replies; 31+ messages in thread
From: Lu Baolu @ 2025-10-14 13:04 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
Michal Hocko, Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen,
Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
The pages used for ptdescs are currently freed back to the allocator
in a single location. They will shortly be freed from a second
location.
Create a simple helper that just frees them back to the allocator.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
include/linux/mm.h | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 94e2ec6c5685..bb235a9f991e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3024,6 +3024,13 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
}
#define pagetable_alloc(...) alloc_hooks(pagetable_alloc_noprof(__VA_ARGS__))
+static inline void __pagetable_free(struct ptdesc *pt)
+{
+ struct page *page = ptdesc_page(pt);
+
+ __free_pages(page, compound_order(page));
+}
+
/**
* pagetable_free - Free pagetables
* @pt: The page table descriptor
@@ -3033,12 +3040,10 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
*/
static inline void pagetable_free(struct ptdesc *pt)
{
- struct page *page = ptdesc_page(pt);
-
if (ptdesc_test_kernel(pt))
ptdesc_clear_kernel(pt);
- __free_pages(page, compound_order(page));
+ __pagetable_free(pt);
}
#if defined(CONFIG_SPLIT_PTE_PTLOCKS)
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread* [PATCH v6 5/7] x86/mm: Use pagetable_free()
2025-10-14 13:04 [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space Lu Baolu
` (3 preceding siblings ...)
2025-10-14 13:04 ` [PATCH v6 4/7] mm: Introduce pure page table freeing function Lu Baolu
@ 2025-10-14 13:04 ` Lu Baolu
2025-10-14 13:04 ` [PATCH v6 6/7] mm: Introduce deferred freeing for kernel page tables Lu Baolu
` (3 subsequent siblings)
8 siblings, 0 replies; 31+ messages in thread
From: Lu Baolu @ 2025-10-14 13:04 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
Michal Hocko, Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Lu Baolu
The kernel's memory management subsystem provides a dedicated interface,
pagetable_free(), for freeing page table pages. Updates two call sites to
use pagetable_free() instead of the lower-level __free_page() or
free_pages(). This improves code consistency and clarity, and ensures the
correct freeing mechanism is used.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
arch/x86/mm/init_64.c | 2 +-
arch/x86/mm/pat/set_memory.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0e4270e20fad..3d9a5e4ccaa4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1031,7 +1031,7 @@ static void __meminit free_pagetable(struct page *page, int order)
free_reserved_pages(page, nr_pages);
#endif
} else {
- __free_pages(page, order);
+ pagetable_free(page_ptdesc(page));
}
}
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index d2d54b8c4dbb..4366e7d11afd 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -429,7 +429,7 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
list_del(&ptdesc->pt_list);
- __free_page(ptdesc_page(ptdesc));
+ pagetable_free(ptdesc);
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread* [PATCH v6 6/7] mm: Introduce deferred freeing for kernel page tables
2025-10-14 13:04 [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space Lu Baolu
` (4 preceding siblings ...)
2025-10-14 13:04 ` [PATCH v6 5/7] x86/mm: Use pagetable_free() Lu Baolu
@ 2025-10-14 13:04 ` Lu Baolu
2025-10-16 19:35 ` David Hildenbrand
2025-10-14 13:04 ` [PATCH v6 7/7] iommu/sva: Invalidate stale IOTLB entries for kernel address space Lu Baolu
` (2 subsequent siblings)
8 siblings, 1 reply; 31+ messages in thread
From: Lu Baolu @ 2025-10-14 13:04 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
Michal Hocko, Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen,
Lu Baolu
From: Dave Hansen <dave.hansen@linux.intel.com>
This introduces a conditional asynchronous mechanism, enabled by
CONFIG_ASYNC_KERNEL_PGTABLE_FREE. When enabled, this mechanism defers the
freeing of pages that are used as page tables for kernel address mappings.
These pages are now queued to a work struct instead of being freed
immediately.
This deferred freeing allows for batch-freeing of page tables, providing
a safe context for performing a single expensive operation (TLB flush)
for a batch of kernel page tables instead of performing that expensive
operation for each page table.
On x86, CONFIG_ASYNC_KERNEL_PGTABLE_FREE is selected if CONFIG_IOMMU_SVA
is enabled, because both Intel and AMD IOMMU architectures could
potentially cache kernel page table entries in their paging structure
cache, regardless of the permission.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
arch/x86/Kconfig | 1 +
mm/Kconfig | 3 +++
include/linux/mm.h | 16 +++++++++++++---
mm/pgtable-generic.c | 37 +++++++++++++++++++++++++++++++++++++
4 files changed, 54 insertions(+), 3 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fa3b616af03a..ded29ee848fd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -279,6 +279,7 @@ config X86
select HAVE_PCI
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
+ select ASYNC_KERNEL_PGTABLE_FREE if IOMMU_SVA
select MMU_GATHER_RCU_TABLE_FREE
select MMU_GATHER_MERGE_VMAS
select HAVE_POSIX_CPU_TIMERS_TASK_WORK
diff --git a/mm/Kconfig b/mm/Kconfig
index 0e26f4fc8717..a83df9934acd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -908,6 +908,9 @@ config PAGE_MAPCOUNT
config PGTABLE_HAS_HUGE_LEAVES
def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
+config ASYNC_KERNEL_PGTABLE_FREE
+ def_bool n
+
# TODO: Allow to be enabled without THP
config ARCH_SUPPORTS_HUGE_PFNMAP
def_bool n
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb235a9f991e..fe5515725c46 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3031,6 +3031,14 @@ static inline void __pagetable_free(struct ptdesc *pt)
__free_pages(page, compound_order(page));
}
+#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
+void pagetable_free_kernel(struct ptdesc *pt);
+#else
+static inline void pagetable_free_kernel(struct ptdesc *pt)
+{
+ __pagetable_free(pt);
+}
+#endif
/**
* pagetable_free - Free pagetables
* @pt: The page table descriptor
@@ -3040,10 +3048,12 @@ static inline void __pagetable_free(struct ptdesc *pt)
*/
static inline void pagetable_free(struct ptdesc *pt)
{
- if (ptdesc_test_kernel(pt))
+ if (ptdesc_test_kernel(pt)) {
ptdesc_clear_kernel(pt);
-
- __pagetable_free(pt);
+ pagetable_free_kernel(pt);
+ } else {
+ __pagetable_free(pt);
+ }
}
#if defined(CONFIG_SPLIT_PTE_PTLOCKS)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..1c7caa8ef164 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -406,3 +406,40 @@ pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
pte_unmap_unlock(pte, ptl);
goto again;
}
+
+#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
+static void kernel_pgtable_work_func(struct work_struct *work);
+
+static struct {
+ struct list_head list;
+ /* protect above ptdesc lists */
+ spinlock_t lock;
+ struct work_struct work;
+} kernel_pgtable_work = {
+ .list = LIST_HEAD_INIT(kernel_pgtable_work.list),
+ .lock = __SPIN_LOCK_UNLOCKED(kernel_pgtable_work.lock),
+ .work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
+};
+
+static void kernel_pgtable_work_func(struct work_struct *work)
+{
+ struct ptdesc *pt, *next;
+ LIST_HEAD(page_list);
+
+ spin_lock(&kernel_pgtable_work.lock);
+ list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
+ spin_unlock(&kernel_pgtable_work.lock);
+
+ list_for_each_entry_safe(pt, next, &page_list, pt_list)
+ __pagetable_free(pt);
+}
+
+void pagetable_free_kernel(struct ptdesc *pt)
+{
+ spin_lock(&kernel_pgtable_work.lock);
+ list_add(&pt->pt_list, &kernel_pgtable_work.list);
+ spin_unlock(&kernel_pgtable_work.lock);
+
+ schedule_work(&kernel_pgtable_work.work);
+}
+#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread* Re: [PATCH v6 6/7] mm: Introduce deferred freeing for kernel page tables
2025-10-14 13:04 ` [PATCH v6 6/7] mm: Introduce deferred freeing for kernel page tables Lu Baolu
@ 2025-10-16 19:35 ` David Hildenbrand
2025-10-17 1:29 ` Baolu Lu
0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-10-16 19:35 UTC (permalink / raw)
To: Lu Baolu, Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, Lorenzo Stoakes, Liam R . Howlett,
Andrew Morton, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 14.10.25 15:04, Lu Baolu wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> This introduces a conditional asynchronous mechanism, enabled by
> CONFIG_ASYNC_KERNEL_PGTABLE_FREE. When enabled, this mechanism defers the
> freeing of pages that are used as page tables for kernel address mappings.
> These pages are now queued to a work struct instead of being freed
> immediately.
>
> This deferred freeing allows for batch-freeing of page tables, providing
> a safe context for performing a single expensive operation (TLB flush)
> for a batch of kernel page tables instead of performing that expensive
> operation for each page table.
>
> On x86, CONFIG_ASYNC_KERNEL_PGTABLE_FREE is selected if CONFIG_IOMMU_SVA
> is enabled, because both Intel and AMD IOMMU architectures could
> potentially cache kernel page table entries in their paging structure
> cache, regardless of the permission.
See below, I assume this is patch #7 material.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
> arch/x86/Kconfig | 1 +
> mm/Kconfig | 3 +++
> include/linux/mm.h | 16 +++++++++++++---
> mm/pgtable-generic.c | 37 +++++++++++++++++++++++++++++++++++++
> 4 files changed, 54 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index fa3b616af03a..ded29ee848fd 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -279,6 +279,7 @@ config X86
> select HAVE_PCI
> select HAVE_PERF_REGS
> select HAVE_PERF_USER_STACK_DUMP
> + select ASYNC_KERNEL_PGTABLE_FREE if IOMMU_SVA
That should belong into patch #7, no?
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 6/7] mm: Introduce deferred freeing for kernel page tables
2025-10-16 19:35 ` David Hildenbrand
@ 2025-10-17 1:29 ` Baolu Lu
0 siblings, 0 replies; 31+ messages in thread
From: Baolu Lu @ 2025-10-17 1:29 UTC (permalink / raw)
To: David Hildenbrand, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jason Gunthorpe, Jann Horn, Vasant Hegde,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Alistair Popple, Peter Zijlstra, Uladzislau Rezki,
Jean-Philippe Brucker, Andy Lutomirski, Yi Lai, Lorenzo Stoakes,
Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
Michal Hocko, Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Dave Hansen
On 10/17/25 03:35, David Hildenbrand wrote:
> On 14.10.25 15:04, Lu Baolu wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> This introduces a conditional asynchronous mechanism, enabled by
>> CONFIG_ASYNC_KERNEL_PGTABLE_FREE. When enabled, this mechanism defers the
>> freeing of pages that are used as page tables for kernel address
>> mappings.
>> These pages are now queued to a work struct instead of being freed
>> immediately.
>>
>> This deferred freeing allows for batch-freeing of page tables, providing
>> a safe context for performing a single expensive operation (TLB flush)
>> for a batch of kernel page tables instead of performing that expensive
>> operation for each page table.
>>
>> On x86, CONFIG_ASYNC_KERNEL_PGTABLE_FREE is selected if CONFIG_IOMMU_SVA
>> is enabled, because both Intel and AMD IOMMU architectures could
>> potentially cache kernel page table entries in their paging structure
>> cache, regardless of the permission.
>
> See below, I assume this is patch #7 material.
>
>>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>> ---
>> arch/x86/Kconfig | 1 +
>> mm/Kconfig | 3 +++
>> include/linux/mm.h | 16 +++++++++++++---
>> mm/pgtable-generic.c | 37 +++++++++++++++++++++++++++++++++++++
>> 4 files changed, 54 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index fa3b616af03a..ded29ee848fd 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -279,6 +279,7 @@ config X86
>> select HAVE_PCI
>> select HAVE_PERF_REGS
>> select HAVE_PERF_USER_STACK_DUMP
>> + select ASYNC_KERNEL_PGTABLE_FREE if IOMMU_SVA
>
> That should belong into patch #7, no?
Yes. Done.
Thanks,
baolu
^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v6 7/7] iommu/sva: Invalidate stale IOTLB entries for kernel address space
2025-10-14 13:04 [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space Lu Baolu
` (5 preceding siblings ...)
2025-10-14 13:04 ` [PATCH v6 6/7] mm: Introduce deferred freeing for kernel page tables Lu Baolu
@ 2025-10-14 13:04 ` Lu Baolu
2025-10-14 20:59 ` [syzbot ci] Re: Fix " syzbot ci
2025-10-15 0:43 ` [PATCH v6 0/7] " Andrew Morton
8 siblings, 0 replies; 31+ messages in thread
From: Lu Baolu @ 2025-10-14 13:04 UTC (permalink / raw)
To: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
Liam R . Howlett, Andrew Morton, Vlastimil Babka, Mike Rapoport,
Michal Hocko, Matthew Wilcox
Cc: iommu, security, x86, linux-mm, linux-kernel, Lu Baolu, stable
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The x86 architecture maps the
kernel's virtual address space into the upper portion of every process's
page table. Consequently, in an SVA context, the IOMMU hardware can walk
and cache kernel page table entries.
The Linux kernel currently lacks a notification mechanism for kernel page
table changes, specifically when page table pages are freed and reused.
The IOMMU driver is only notified of changes to user virtual address
mappings. This can cause the IOMMU's internal caches to retain stale
entries for kernel VA.
A Use-After-Free (UAF) and Write-After-Free (WAF) condition arises when
kernel page table pages are freed and later reallocated. The IOMMU could
misinterpret the new data as valid page table entries. The IOMMU might
then walk into attacker-controlled memory, leading to arbitrary physical
memory DMA access or privilege escalation. This is also a Write-After-Free
issue, as the IOMMU will potentially continue to write Accessed and Dirty
bits to the freed memory while attempting to walk the stale page tables.
Currently, SVA contexts are unprivileged and cannot access kernel
mappings. However, the IOMMU will still walk kernel-only page tables
all the way down to the leaf entries, where it realizes the mapping
is for the kernel and errors out. This means the IOMMU still caches
these intermediate page table entries, making the described vulnerability
a real concern.
To mitigate this, a new IOMMU interface is introduced to flush IOTLB
entries for the kernel address space. This interface is invoked from the
x86 architecture code that manages combined user and kernel page tables,
specifically before any kernel page table page is freed and reused.
This addresses the main issue with vfree() which is a common occurrence
and can be triggered by unprivileged users. While this resolves the
primary problem, it doesn't address some extremely rare case related to
memory unplug of memory that was present as reserved memory at boot,
which cannot be triggered by unprivileged users. The discussion can be
found at the link below.
Fixes: 26b25a2b98e4 ("iommu: Bind process address spaces to devices")
Cc: stable@vger.kernel.org
Suggested-by: Jann Horn <jannh@google.com>
Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/linux-iommu/04983c62-3b1d-40d4-93ae-34ca04b827e5@intel.com/
---
include/linux/iommu.h | 4 ++++
drivers/iommu/iommu-sva.c | 29 ++++++++++++++++++++++++++++-
mm/pgtable-generic.c | 2 ++
3 files changed, 34 insertions(+), 1 deletion(-)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c30d12e16473..66e4abb2df0d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1134,7 +1134,9 @@ struct iommu_sva {
struct iommu_mm_data {
u32 pasid;
+ struct mm_struct *mm;
struct list_head sva_domains;
+ struct list_head mm_list_elm;
};
int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode);
@@ -1615,6 +1617,7 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev,
struct mm_struct *mm);
void iommu_sva_unbind_device(struct iommu_sva *handle);
u32 iommu_sva_get_pasid(struct iommu_sva *handle);
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end);
#else
static inline struct iommu_sva *
iommu_sva_bind_device(struct device *dev, struct mm_struct *mm)
@@ -1639,6 +1642,7 @@ static inline u32 mm_get_enqcmd_pasid(struct mm_struct *mm)
}
static inline void mm_pasid_drop(struct mm_struct *mm) {}
+static inline void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end) {}
#endif /* CONFIG_IOMMU_SVA */
#ifdef CONFIG_IOMMU_IOPF
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 1a51cfd82808..d236aef80a8d 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -10,6 +10,8 @@
#include "iommu-priv.h"
static DEFINE_MUTEX(iommu_sva_lock);
+static bool iommu_sva_present;
+static LIST_HEAD(iommu_sva_mms);
static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
struct mm_struct *mm);
@@ -42,6 +44,7 @@ static struct iommu_mm_data *iommu_alloc_mm_data(struct mm_struct *mm, struct de
return ERR_PTR(-ENOSPC);
}
iommu_mm->pasid = pasid;
+ iommu_mm->mm = mm;
INIT_LIST_HEAD(&iommu_mm->sva_domains);
/*
* Make sure the write to mm->iommu_mm is not reordered in front of
@@ -132,8 +135,13 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm
if (ret)
goto out_free_domain;
domain->users = 1;
- list_add(&domain->next, &mm->iommu_mm->sva_domains);
+ if (list_empty(&iommu_mm->sva_domains)) {
+ if (list_empty(&iommu_sva_mms))
+ iommu_sva_present = true;
+ list_add(&iommu_mm->mm_list_elm, &iommu_sva_mms);
+ }
+ list_add(&domain->next, &iommu_mm->sva_domains);
out:
refcount_set(&handle->users, 1);
mutex_unlock(&iommu_sva_lock);
@@ -175,6 +183,13 @@ void iommu_sva_unbind_device(struct iommu_sva *handle)
list_del(&domain->next);
iommu_domain_free(domain);
}
+
+ if (list_empty(&iommu_mm->sva_domains)) {
+ list_del(&iommu_mm->mm_list_elm);
+ if (list_empty(&iommu_sva_mms))
+ iommu_sva_present = false;
+ }
+
mutex_unlock(&iommu_sva_lock);
kfree(handle);
}
@@ -312,3 +327,15 @@ static struct iommu_domain *iommu_sva_domain_alloc(struct device *dev,
return domain;
}
+
+void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
+{
+ struct iommu_mm_data *iommu_mm;
+
+ guard(mutex)(&iommu_sva_lock);
+ if (!iommu_sva_present)
+ return;
+
+ list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
+ mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm, start, end);
+}
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 1c7caa8ef164..8c22be79b734 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/mm_inline.h>
+#include <linux/iommu.h>
#include <asm/pgalloc.h>
#include <asm/tlb.h>
@@ -430,6 +431,7 @@ static void kernel_pgtable_work_func(struct work_struct *work)
list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
spin_unlock(&kernel_pgtable_work.lock);
+ iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
list_for_each_entry_safe(pt, next, &page_list, pt_list)
__pagetable_free(pt);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 31+ messages in thread* [syzbot ci] Re: Fix stale IOTLB entries for kernel address space
2025-10-14 13:04 [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space Lu Baolu
` (6 preceding siblings ...)
2025-10-14 13:04 ` [PATCH v6 7/7] iommu/sva: Invalidate stale IOTLB entries for kernel address space Lu Baolu
@ 2025-10-14 20:59 ` syzbot ci
2025-10-15 16:25 ` Dave Hansen
2025-10-15 0:43 ` [PATCH v6 0/7] " Andrew Morton
8 siblings, 1 reply; 31+ messages in thread
From: syzbot ci @ 2025-10-14 20:59 UTC (permalink / raw)
To: akpm, apopple, baolu.lu, bp, dave.hansen, dave.hansen, david,
iommu, jannh, jean-philippe, jgg, joro, kevin.tian, liam.howlett,
linux-kernel, linux-mm, lorenzo.stoakes, luto, mhocko, mingo,
peterz, robin.murphy, rppt, security, stable, tglx, urezki,
vasant.hegde, vbabka, will, willy, x86, yi1.lai
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v6] Fix stale IOTLB entries for kernel address space
https://lore.kernel.org/all/20251014130437.1090448-1-baolu.lu@linux.intel.com
* [PATCH v6 1/7] mm: Add a ptdesc flag to mark kernel page tables
* [PATCH v6 2/7] mm: Actually mark kernel page table pages
* [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages
* [PATCH v6 4/7] mm: Introduce pure page table freeing function
* [PATCH v6 5/7] x86/mm: Use pagetable_free()
* [PATCH v6 6/7] mm: Introduce deferred freeing for kernel page tables
* [PATCH v6 7/7] iommu/sva: Invalidate stale IOTLB entries for kernel address space
and found the following issues:
* KASAN: use-after-free Read in pmd_set_huge
* KASAN: use-after-free Read in vmap_range_noflush
* PANIC: double fault in search_extable
Full report is available here:
https://ci.syzbot.org/series/9d75a765-d6b2-4839-8db9-2f2e64e78cdd
***
KASAN: use-after-free Read in pmd_set_huge
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: 0d97f2067c166eb495771fede9f7b73999c67f66
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/68e38247-432a-45b2-b187-a533b7040841/config
syz repro: https://ci.syzbot.org/findings/ce54ec93-1f21-4deb-b2f8-d34917bd1be2/syz_repro
==================================================================
BUG: KASAN: use-after-free in pmd_set_huge+0xd8/0x340 arch/x86/mm/pgtable.c:676
Read of size 8 at addr ffff888100efa960 by task syz.0.20/5965
CPU: 1 UID: 0 PID: 5965 Comm: syz.0.20 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xca/0x240 mm/kasan/report.c:482
kasan_report+0x118/0x150 mm/kasan/report.c:595
pmd_set_huge+0xd8/0x340 arch/x86/mm/pgtable.c:676
vmap_try_huge_pmd mm/vmalloc.c:161 [inline]
vmap_pmd_range mm/vmalloc.c:177 [inline]
vmap_pud_range mm/vmalloc.c:233 [inline]
vmap_p4d_range mm/vmalloc.c:284 [inline]
vmap_range_noflush+0x7b3/0xf80 mm/vmalloc.c:308
__vmap_pages_range_noflush+0xd31/0xf30 mm/vmalloc.c:661
vmap_pages_range_noflush mm/vmalloc.c:681 [inline]
vmap_pages_range mm/vmalloc.c:701 [inline]
__vmalloc_area_node mm/vmalloc.c:3766 [inline]
__vmalloc_node_range_noprof+0xe8c/0x12d0 mm/vmalloc.c:3897
__kvmalloc_node_noprof+0x674/0x910 mm/slub.c:7058
nf_tables_newset+0x1330/0x2540 net/netfilter/nf_tables_api.c:5548
nfnetlink_rcv_batch net/netfilter/nfnetlink.c:526 [inline]
nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:649 [inline]
nfnetlink_rcv+0x11d9/0x2590 net/netfilter/nfnetlink.c:667
netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline]
netlink_unicast+0x82f/0x9e0 net/netlink/af_netlink.c:1346
netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1896
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg+0x21c/0x270 net/socket.c:742
____sys_sendmsg+0x505/0x830 net/socket.c:2630
___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684
__sys_sendmsg net/socket.c:2716 [inline]
__do_sys_sendmsg net/socket.c:2721 [inline]
__se_sys_sendmsg net/socket.c:2719 [inline]
__x64_sys_sendmsg+0x19b/0x260 net/socket.c:2719
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fc5fff8eec9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fc600ecb038 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007fc6001e5fa0 RCX: 00007fc5fff8eec9
RDX: 0000000004008100 RSI: 00002000000000c0 RDI: 0000000000000003
RBP: 00007fc600011f91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fc6001e6038 R14: 00007fc6001e5fa0 R15: 00007ffed63a0428
</TASK>
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x100efa
flags: 0x17ff00000000000(node=0|zone=2|lastcpupid=0x7ff)
raw: 017ff00000000000 ffffea0004772f88 ffff88823c6403a0 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as freed
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x40100(__GFP_ZERO|__GFP_COMP), pid 0, tgid 0 (swapper/0), ts 1659724794, free_ts 71235002142
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1850
prep_new_page mm/page_alloc.c:1858 [inline]
get_page_from_freelist+0x2365/0x2440 mm/page_alloc.c:3884
__alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5183
alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416
alloc_frozen_pages_noprof mm/mempolicy.c:2487 [inline]
alloc_pages_noprof+0xa9/0x190 mm/mempolicy.c:2507
pagetable_alloc_noprof include/linux/mm.h:3016 [inline]
pmd_alloc_one_noprof include/asm-generic/pgalloc.h:144 [inline]
__pmd_alloc+0x3a/0x5d0 mm/memory.c:6573
pmd_alloc_track mm/pgalloc-track.h:37 [inline]
vmap_pages_pmd_range mm/vmalloc.c:564 [inline]
vmap_pages_pud_range mm/vmalloc.c:587 [inline]
vmap_pages_p4d_range mm/vmalloc.c:605 [inline]
vmap_small_pages_range_noflush mm/vmalloc.c:627 [inline]
__vmap_pages_range_noflush+0x9cc/0xf30 mm/vmalloc.c:656
vmap_pages_range_noflush mm/vmalloc.c:681 [inline]
vmap_pages_range mm/vmalloc.c:701 [inline]
vmap+0x1ca/0x310 mm/vmalloc.c:3521
map_irq_stack arch/x86/kernel/irq_64.c:49 [inline]
irq_init_percpu_irqstack+0x342/0x4a0 arch/x86/kernel/irq_64.c:76
init_IRQ+0x15c/0x1c0 arch/x86/kernel/irqinit.c:90
start_kernel+0x1cd/0x410 init/main.c:1016
x86_64_start_reservations+0x24/0x30 arch/x86/kernel/head64.c:310
x86_64_start_kernel+0x143/0x1c0 arch/x86/kernel/head64.c:291
common_startup_64+0x13e/0x147
page last free pid 5965 tgid 5964 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
free_pages_prepare mm/page_alloc.c:1394 [inline]
__free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2906
pmd_free_pte_page+0xa1/0xc0 arch/x86/mm/pgtable.c:783
vmap_try_huge_pmd mm/vmalloc.c:158 [inline]
vmap_pmd_range mm/vmalloc.c:177 [inline]
vmap_pud_range mm/vmalloc.c:233 [inline]
vmap_p4d_range mm/vmalloc.c:284 [inline]
vmap_range_noflush+0x774/0xf80 mm/vmalloc.c:308
__vmap_pages_range_noflush+0xd31/0xf30 mm/vmalloc.c:661
vmap_pages_range_noflush mm/vmalloc.c:681 [inline]
vmap_pages_range mm/vmalloc.c:701 [inline]
__vmalloc_area_node mm/vmalloc.c:3766 [inline]
__vmalloc_node_range_noprof+0xe8c/0x12d0 mm/vmalloc.c:3897
__kvmalloc_node_noprof+0x674/0x910 mm/slub.c:7058
nf_tables_newset+0x1330/0x2540 net/netfilter/nf_tables_api.c:5548
nfnetlink_rcv_batch net/netfilter/nfnetlink.c:526 [inline]
nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:649 [inline]
nfnetlink_rcv+0x11d9/0x2590 net/netfilter/nfnetlink.c:667
netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline]
netlink_unicast+0x82f/0x9e0 net/netlink/af_netlink.c:1346
netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1896
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg+0x21c/0x270 net/socket.c:742
____sys_sendmsg+0x505/0x830 net/socket.c:2630
___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684
__sys_sendmsg net/socket.c:2716 [inline]
__do_sys_sendmsg net/socket.c:2721 [inline]
__se_sys_sendmsg net/socket.c:2719 [inline]
__x64_sys_sendmsg+0x19b/0x260 net/socket.c:2719
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Memory state around the buggy address:
ffff888100efa800: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff888100efa880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff888100efa900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff888100efa980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff888100efaa00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
***
KASAN: use-after-free Read in vmap_range_noflush
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: 0d97f2067c166eb495771fede9f7b73999c67f66
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/68e38247-432a-45b2-b187-a533b7040841/config
C repro: https://ci.syzbot.org/findings/b676cfe4-8c9a-435c-aa8f-7315912fa378/c_repro
syz repro: https://ci.syzbot.org/findings/b676cfe4-8c9a-435c-aa8f-7315912fa378/syz_repro
==================================================================
BUG: KASAN: use-after-free in vmap_try_huge_pmd mm/vmalloc.c:158 [inline]
BUG: KASAN: use-after-free in vmap_pmd_range mm/vmalloc.c:177 [inline]
BUG: KASAN: use-after-free in vmap_pud_range mm/vmalloc.c:233 [inline]
BUG: KASAN: use-after-free in vmap_p4d_range mm/vmalloc.c:284 [inline]
BUG: KASAN: use-after-free in vmap_range_noflush+0x743/0xf80 mm/vmalloc.c:308
Read of size 8 at addr ffff888100efa128 by task syz.0.17/5955
CPU: 1 UID: 0 PID: 5955 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xca/0x240 mm/kasan/report.c:482
kasan_report+0x118/0x150 mm/kasan/report.c:595
vmap_try_huge_pmd mm/vmalloc.c:158 [inline]
vmap_pmd_range mm/vmalloc.c:177 [inline]
vmap_pud_range mm/vmalloc.c:233 [inline]
vmap_p4d_range mm/vmalloc.c:284 [inline]
vmap_range_noflush+0x743/0xf80 mm/vmalloc.c:308
__vmap_pages_range_noflush+0xd31/0xf30 mm/vmalloc.c:661
vmap_pages_range_noflush mm/vmalloc.c:681 [inline]
vmap_pages_range mm/vmalloc.c:701 [inline]
__vmalloc_area_node mm/vmalloc.c:3766 [inline]
__vmalloc_node_range_noprof+0xe8c/0x12d0 mm/vmalloc.c:3897
__kvmalloc_node_noprof+0x674/0x910 mm/slub.c:7058
kvmalloc_array_node_noprof include/linux/slab.h:1122 [inline]
bpf_uprobe_multi_link_attach+0x54b/0xee0 kernel/trace/bpf_trace.c:3228
link_create+0x673/0x850 kernel/bpf/syscall.c:5721
__sys_bpf+0x6be/0x860 kernel/bpf/syscall.c:6204
__do_sys_bpf kernel/bpf/syscall.c:6244 [inline]
__se_sys_bpf kernel/bpf/syscall.c:6242 [inline]
__x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:6242
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f3d8e78eec9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f3d8f64c038 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00007f3d8e9e5fa0 RCX: 00007f3d8e78eec9
RDX: 0000000000000040 RSI: 00002000000005c0 RDI: 000000000000001c
RBP: 00007f3d8e811f91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f3d8e9e6038 R14: 00007f3d8e9e5fa0 R15: 00007ffe8caa72c8
</TASK>
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x100efa
flags: 0x17ff00000000000(node=0|zone=2|lastcpupid=0x7ff)
raw: 017ff00000000000 ffffea00044109c8 ffff88823c6403a0 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as freed
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x40100(__GFP_ZERO|__GFP_COMP), pid 0, tgid 0 (swapper/0), ts 1684936790, free_ts 91274246476
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1850
prep_new_page mm/page_alloc.c:1858 [inline]
get_page_from_freelist+0x2365/0x2440 mm/page_alloc.c:3884
__alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5183
alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416
alloc_frozen_pages_noprof mm/mempolicy.c:2487 [inline]
alloc_pages_noprof+0xa9/0x190 mm/mempolicy.c:2507
pagetable_alloc_noprof include/linux/mm.h:3016 [inline]
pmd_alloc_one_noprof include/asm-generic/pgalloc.h:144 [inline]
__pmd_alloc+0x3a/0x5d0 mm/memory.c:6573
pmd_alloc_track mm/pgalloc-track.h:37 [inline]
vmap_pages_pmd_range mm/vmalloc.c:564 [inline]
vmap_pages_pud_range mm/vmalloc.c:587 [inline]
vmap_pages_p4d_range mm/vmalloc.c:605 [inline]
vmap_small_pages_range_noflush mm/vmalloc.c:627 [inline]
__vmap_pages_range_noflush+0x9cc/0xf30 mm/vmalloc.c:656
vmap_pages_range_noflush mm/vmalloc.c:681 [inline]
vmap_pages_range mm/vmalloc.c:701 [inline]
vmap+0x1ca/0x310 mm/vmalloc.c:3521
map_irq_stack arch/x86/kernel/irq_64.c:49 [inline]
irq_init_percpu_irqstack+0x342/0x4a0 arch/x86/kernel/irq_64.c:76
init_IRQ+0x15c/0x1c0 arch/x86/kernel/irqinit.c:90
start_kernel+0x1cd/0x410 init/main.c:1016
x86_64_start_reservations+0x24/0x30 arch/x86/kernel/head64.c:310
x86_64_start_kernel+0x143/0x1c0 arch/x86/kernel/head64.c:291
common_startup_64+0x13e/0x147
page last free pid 5892 tgid 5892 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
free_pages_prepare mm/page_alloc.c:1394 [inline]
__free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2906
__pagetable_free include/linux/mm.h:3026 [inline]
kernel_pgtable_work_func+0x276/0x2e0 mm/pgtable-generic.c:436
process_one_work kernel/workqueue.c:3263 [inline]
process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3346
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3427
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
Memory state around the buggy address:
ffff888100efa000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff888100efa080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff888100efa100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff888100efa180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff888100efa200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
***
PANIC: double fault in search_extable
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: 0d97f2067c166eb495771fede9f7b73999c67f66
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/68e38247-432a-45b2-b187-a533b7040841/config
syz repro: https://ci.syzbot.org/findings/967ed946-aab2-484a-8267-954586f5962b/syz_repro
traps: PANIC: double fault, error_code: 0x0
Oops: double fault: 0000 [#1] SMP KASAN PTI
CPU: 1 UID: 0 PID: 5921 Comm: syz-executor Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:search_extable+0x69/0xd0 lib/extable.c:115
Code: 8d 48 c7 44 24 10 20 50 40 8b 49 89 e5 49 c1 ed 03 48 b8 f1 f1 f1 f1 00 f3 f3 f3 49 bc 00 00 00 00 00 fc ff df 4b 89 44 25 00 <e8> 12 45 7f f6 48 89 5c 24 20 b9 0c 00 00 00 48 8d 7c 24 20 4c 89
RSP: 0018:ffffc90003e5f000 EFLAGS: 00010806
RAX: f3f3f300f1f1f1f1 RBX: ffffffff8b4b123e RCX: 0000000000001c56
RDX: ffffffff8b4b123e RSI: 0000000000000972 RDI: ffffffff8dc137d0
RBP: ffffc90003e5f0a0 R08: 0000000000000001 R09: 0000000000000002
R10: 0000000000000011 R11: 0000000000000000 R12: dffffc0000000000
R13: 1ffff920007cbe00 R14: 0000000000000972 R15: ffffffff8dc137d0
FS: 000055558b2ef500(0000) GS:ffff8882a9d0f000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc90003e5eff8 CR3: 00000001ba5ea000 CR4: 00000000000006f0
Call Trace:
<TASK>
search_kernel_exception_table kernel/extable.c:49 [inline]
search_exception_tables+0x3a/0x60 kernel/extable.c:58
fixup_exception+0xb1/0x20b0 arch/x86/mm/extable.c:319
kernelmode_fixup_or_oops+0x68/0xf0 arch/x86/mm/fault.c:726
__bad_area_nosemaphore+0x11a/0x780 arch/x86/mm/fault.c:783
handle_page_fault arch/x86/mm/fault.c:1474 [inline]
exc_page_fault+0xcf/0x100 arch/x86/mm/fault.c:1532
asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:623
RIP: 0010:in_irq_stack arch/x86/kernel/dumpstack_64.c:165 [inline]
RIP: 0010:get_stack_info_noinstr+0xee/0x130 arch/x86/kernel/dumpstack_64.c:182
Code: 08 48 8d 90 08 80 ff ff 49 39 d7 40 0f 92 c6 49 39 cf 40 0f 93 c7 40 08 f7 75 27 41 c7 06 02 00 00 00 49 89 56 08 49 89 4e 10 <48> 8b 00 49 89 46 18 89 d8 5b 41 5c 41 5d 41 5e 41 5f e9 8b 12 03
RSP: 0018:ffffc90003e5f470 EFLAGS: 00010046
RAX: ffffc90000a08ff8 RBX: ffff88816ac1ba01 RCX: ffffc90000a09000
RDX: ffffc90000a01000 RSI: ffffffff8d837700 RDI: ffffffff8bc07500
RBP: ffffc90003e5f630 R08: ffffc90003e5f500 R09: 0000000000000000
R10: ffffc90003e5f5a0 R11: fffff520007cbeb8 R12: ffff88816ac1ba00
R13: fffffe000004f000 R14: ffffc90003e5f5a0 R15: ffffc90000a08ff8
get_stack_guard_info arch/x86/include/asm/stacktrace.h:45 [inline]
page_fault_oops+0x12a/0xa10 arch/x86/mm/fault.c:663
__bad_area_nosemaphore+0x11a/0x780 arch/x86/mm/fault.c:783
handle_page_fault arch/x86/mm/fault.c:1474 [inline]
exc_page_fault+0xcf/0x100 arch/x86/mm/fault.c:1532
asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:623
RIP: 0010:instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1052 [inline]
RIP: 0010:sysvec_apic_timer_interrupt+0x8e/0xc0 arch/x86/kernel/apic/apic.c:1052
Code: 00 00 48 c7 c7 c0 b4 67 8b e8 ae 23 00 00 65 c6 05 50 d7 45 07 01 48 c7 c7 a0 b4 67 8b e8 9a 23 00 00 65 4c 8b 1d 02 d7 45 07 <49> 89 23 4c 89 dc e8 77 23 39 f6 48 89 df e8 4f 2f 25 f6 e8 8a 24
RSP: 0018:ffffc90003e5f830 EFLAGS: 00010082
RAX: 0000000000000001 RBX: ffffc90003e5f848 RCX: 4d01a0d08cb75600
RDX: 0000000000000000 RSI: ffffffff8b67b4a0 RDI: ffffffff8bc07560
RBP: 0000000000000000 R08: ffffffff8f9e1177 R09: 1ffffffff1f3c22e
R10: dffffc0000000000 R11: ffffc90000a08ff8 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
RIP: 0010:check_preemption_disabled+0x0/0x120 lib/smp_processor_id.c:13
Code: c7 00 75 c0 8b 48 c7 c6 40 75 c0 8b eb 1c 66 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <55> 41 57 41 56 53 48 83 ec 10 65 48 8b 05 ae b4 45 07 48 89 44 24
RSP: 0018:ffffc90003e5f8f0 EFLAGS: 00000282
RAX: 0000000000000000 RBX: 00007f5f1858e627 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: ffffffff8bc07540 RDI: ffffffff8bc07500
RBP: 0000000000000001 R08: 0000000000000022 R09: ffffffff81731d25
R10: ffffc90003e5f9b8 R11: ffffffff81abbe80 R12: ffff88816ac1ba00
R13: dffffc0000000000 R14: dffffc0000000000 R15: 1ffff920007cbf36
rcu_is_watching_curr_cpu include/linux/context_tracking.h:128 [inline]
rcu_is_watching+0x15/0xb0 kernel/rcu/tree.c:751
kernel_text_address+0x80/0xe0 kernel/extable.c:113
__kernel_text_address+0xd/0x40 kernel/extable.c:79
unwind_get_return_address+0x4d/0x90 arch/x86/kernel/unwind_orc.c:369
arch_stack_walk+0xfc/0x150 arch/x86/kernel/stacktrace.c:26
stack_trace_save+0x9c/0xe0 kernel/stacktrace.c:122
ref_tracker_free+0xef/0x7d0 lib/ref_tracker.c:307
__netns_tracker_free include/net/net_namespace.h:379 [inline]
put_net_track include/net/net_namespace.h:394 [inline]
__sk_destruct+0x3c3/0x660 net/core/sock.c:2368
sock_put include/net/sock.h:1972 [inline]
unix_release_sock+0xa7b/0xd50 net/unix/af_unix.c:732
unix_release+0x92/0xd0 net/unix/af_unix.c:1196
__sock_release net/socket.c:662 [inline]
sock_close+0xc3/0x240 net/socket.c:1455
__fput+0x44c/0xa70 fs/file_table.c:468
fput_close_sync+0x119/0x200 fs/file_table.c:573
__do_sys_close fs/open.c:1589 [inline]
__se_sys_close fs/open.c:1574 [inline]
__x64_sys_close+0x7f/0x110 fs/open.c:1574
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5f1858e627
Code: 44 00 00 48 c7 c2 a8 ff ff ff f7 d8 64 89 02 b8 ff ff ff ff eb bc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 c7 c2 a8 ff ff ff f7 d8 64 89 02 b8
RSP: 002b:00007ffec60e5be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f5f1858e627
RDX: 0000000000000000 RSI: 0000000000008933 RDI: 0000000000000005
RBP: 00007ffec60e5bf0 R08: 000000000000000a R09: 0000000000000001
R10: 000000000000000f R11: 0000000000000246 R12: 0000000000000024
R13: 000000000000002d R14: 00007f5f19314620 R15: 0000000000000024
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:search_extable+0x69/0xd0 lib/extable.c:115
Code: 8d 48 c7 44 24 10 20 50 40 8b 49 89 e5 49 c1 ed 03 48 b8 f1 f1 f1 f1 00 f3 f3 f3 49 bc 00 00 00 00 00 fc ff df 4b 89 44 25 00 <e8> 12 45 7f f6 48 89 5c 24 20 b9 0c 00 00 00 48 8d 7c 24 20 4c 89
RSP: 0018:ffffc90003e5f000 EFLAGS: 00010806
RAX: f3f3f300f1f1f1f1 RBX: ffffffff8b4b123e RCX: 0000000000001c56
RDX: ffffffff8b4b123e RSI: 0000000000000972 RDI: ffffffff8dc137d0
RBP: ffffc90003e5f0a0 R08: 0000000000000001 R09: 0000000000000002
R10: 0000000000000011 R11: 0000000000000000 R12: dffffc0000000000
R13: 1ffff920007cbe00 R14: 0000000000000972 R15: ffffffff8dc137d0
FS: 000055558b2ef500(0000) GS:ffff8882a9d0f000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc90003e5eff8 CR3: 00000001ba5ea000 CR4: 00000000000006f0
----------------
Code disassembly (best guess):
0: 8d 48 c7 lea -0x39(%rax),%ecx
3: 44 24 10 rex.R and $0x10,%al
6: 20 50 40 and %dl,0x40(%rax)
9: 8b 49 89 mov -0x77(%rcx),%ecx
c: e5 49 in $0x49,%eax
e: c1 ed 03 shr $0x3,%ebp
11: 48 b8 f1 f1 f1 f1 00 movabs $0xf3f3f300f1f1f1f1,%rax
18: f3 f3 f3
1b: 49 bc 00 00 00 00 00 movabs $0xdffffc0000000000,%r12
22: fc ff df
25: 4b 89 44 25 00 mov %rax,0x0(%r13,%r12,1)
* 2a: e8 12 45 7f f6 call 0xf67f4541 <-- trapping instruction
2f: 48 89 5c 24 20 mov %rbx,0x20(%rsp)
34: b9 0c 00 00 00 mov $0xc,%ecx
39: 48 8d 7c 24 20 lea 0x20(%rsp),%rdi
3e: 4c rex.WR
3f: 89 .byte 0x89
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [syzbot ci] Re: Fix stale IOTLB entries for kernel address space
2025-10-14 20:59 ` [syzbot ci] Re: Fix " syzbot ci
@ 2025-10-15 16:25 ` Dave Hansen
2025-10-16 8:00 ` Baolu Lu
0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2025-10-15 16:25 UTC (permalink / raw)
To: syzbot ci, akpm, apopple, baolu.lu, bp, dave.hansen, david, iommu,
jannh, jean-philippe, jgg, joro, kevin.tian, liam.howlett,
linux-kernel, linux-mm, lorenzo.stoakes, luto, mhocko, mingo,
peterz, robin.murphy, rppt, security, stable, tglx, urezki,
vasant.hegde, vbabka, will, willy, x86, yi1.lai
Cc: syzbot, syzkaller-bugs
Here's the part that confuses me:
On 10/14/25 13:59, syzbot ci wrote:
> page last free pid 5965 tgid 5964 stack trace:
> reset_page_owner include/linux/page_owner.h:25 [inline]
> free_pages_prepare mm/page_alloc.c:1394 [inline]
> __free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2906
> pmd_free_pte_page+0xa1/0xc0 arch/x86/mm/pgtable.c:783
> vmap_try_huge_pmd mm/vmalloc.c:158 [inline]
...
So, vmap_try_huge_pmd() did a pmd_free_pte_page(). Yet, somehow, the PMD
stuck around so that it *could* be used after being freed. It _looks_
like pmd_free_pte_page() freed the page, returned 0, and made
vmap_try_huge_pmd() return early, skipping the pmd pmd_set_huge().
But I don't know how that could possibly happen.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [syzbot ci] Re: Fix stale IOTLB entries for kernel address space
2025-10-15 16:25 ` Dave Hansen
@ 2025-10-16 8:00 ` Baolu Lu
2025-10-17 17:05 ` Dave Hansen
2025-10-17 17:10 ` David Hildenbrand
0 siblings, 2 replies; 31+ messages in thread
From: Baolu Lu @ 2025-10-16 8:00 UTC (permalink / raw)
To: Dave Hansen, syzbot ci, akpm, apopple, bp, dave.hansen, david,
iommu, jannh, jean-philippe, jgg, joro, kevin.tian, liam.howlett,
linux-kernel, linux-mm, lorenzo.stoakes, luto, mhocko, mingo,
peterz, robin.murphy, rppt, security, stable, tglx, urezki,
vasant.hegde, vbabka, will, willy, x86, yi1.lai
Cc: syzbot, syzkaller-bugs
On 10/16/25 00:25, Dave Hansen wrote:
> Here's the part that confuses me:
>
> On 10/14/25 13:59, syzbot ci wrote:
>> page last free pid 5965 tgid 5964 stack trace:
>> reset_page_owner include/linux/page_owner.h:25 [inline]
>> free_pages_prepare mm/page_alloc.c:1394 [inline]
>> __free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2906
>> pmd_free_pte_page+0xa1/0xc0 arch/x86/mm/pgtable.c:783
>> vmap_try_huge_pmd mm/vmalloc.c:158 [inline]
> ...
>
> So, vmap_try_huge_pmd() did a pmd_free_pte_page(). Yet, somehow, the PMD
> stuck around so that it *could* be used after being freed. It _looks_
> like pmd_free_pte_page() freed the page, returned 0, and made
> vmap_try_huge_pmd() return early, skipping the pmd pmd_set_huge().
>
> But I don't know how that could possibly happen.
The reported issue is only related to this patch:
- [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages
It appears that the pmd_ptdesc() helper can't be used directly here in
this patch. pmd_ptdesc() retrieves the page table page that the PMD
entry resides in:
static inline struct page *pmd_pgtable_page(pmd_t *pmd)
{
unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
return virt_to_page((void *)((unsigned long) pmd & mask));
}
static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
{
return page_ptdesc(pmd_pgtable_page(pmd));
}
while, in this patch, we need the page descriptor that a pmd entry
points to. Perhaps we should roll back to the previous approach used in
v5?
I'm sorry that I didn't discover this during my development testing.
Fortunately, I can reproduce it stably on my development machine now.
Thanks,
baolu
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [syzbot ci] Re: Fix stale IOTLB entries for kernel address space
2025-10-16 8:00 ` Baolu Lu
@ 2025-10-17 17:05 ` Dave Hansen
2025-10-17 17:10 ` David Hildenbrand
1 sibling, 0 replies; 31+ messages in thread
From: Dave Hansen @ 2025-10-17 17:05 UTC (permalink / raw)
To: Baolu Lu, syzbot ci, akpm, apopple, bp, dave.hansen, david, iommu,
jannh, jean-philippe, jgg, joro, kevin.tian, liam.howlett,
linux-kernel, linux-mm, lorenzo.stoakes, luto, mhocko, mingo,
peterz, robin.murphy, rppt, security, stable, tglx, urezki,
vasant.hegde, vbabka, will, willy, x86, yi1.lai
Cc: syzbot, syzkaller-bugs
On 10/16/25 01:00, Baolu Lu wrote:
> while, in this patch, we need the page descriptor that a pmd entry
> points to. Perhaps we should roll back to the previous approach used in
> v5?
Yeah, we should roll back to what v5 did. pmd_ptdesc() is a bit deceiving.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [syzbot ci] Re: Fix stale IOTLB entries for kernel address space
2025-10-16 8:00 ` Baolu Lu
2025-10-17 17:05 ` Dave Hansen
@ 2025-10-17 17:10 ` David Hildenbrand
2025-10-20 5:34 ` Baolu Lu
1 sibling, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-10-17 17:10 UTC (permalink / raw)
To: Baolu Lu, Dave Hansen, syzbot ci, akpm, apopple, bp, dave.hansen,
iommu, jannh, jean-philippe, jgg, joro, kevin.tian, liam.howlett,
linux-kernel, linux-mm, lorenzo.stoakes, luto, mhocko, mingo,
peterz, robin.murphy, rppt, security, stable, tglx, urezki,
vasant.hegde, vbabka, will, willy, x86, yi1.lai
Cc: syzbot, syzkaller-bugs
On 16.10.25 10:00, Baolu Lu wrote:
> On 10/16/25 00:25, Dave Hansen wrote:
>> Here's the part that confuses me:
>>
>> On 10/14/25 13:59, syzbot ci wrote:
>>> page last free pid 5965 tgid 5964 stack trace:
>>> reset_page_owner include/linux/page_owner.h:25 [inline]
>>> free_pages_prepare mm/page_alloc.c:1394 [inline]
>>> __free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2906
>>> pmd_free_pte_page+0xa1/0xc0 arch/x86/mm/pgtable.c:783
>>> vmap_try_huge_pmd mm/vmalloc.c:158 [inline]
>> ...
>>
>> So, vmap_try_huge_pmd() did a pmd_free_pte_page(). Yet, somehow, the PMD
>> stuck around so that it *could* be used after being freed. It _looks_
>> like pmd_free_pte_page() freed the page, returned 0, and made
>> vmap_try_huge_pmd() return early, skipping the pmd pmd_set_huge().
>>
>> But I don't know how that could possibly happen.
>
> The reported issue is only related to this patch:
>
> - [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages
>
> It appears that the pmd_ptdesc() helper can't be used directly here in
> this patch. pmd_ptdesc() retrieves the page table page that the PMD
> entry resides in:
>
> static inline struct page *pmd_pgtable_page(pmd_t *pmd)
> {
> unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
> return virt_to_page((void *)((unsigned long) pmd & mask));
> }
>
> static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
> {
> return page_ptdesc(pmd_pgtable_page(pmd));
> }
>
> while, in this patch, we need the page descriptor that a pmd entry
> points to.
Ah. But that's just pointing at a leaf page table, right?
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [syzbot ci] Re: Fix stale IOTLB entries for kernel address space
2025-10-17 17:10 ` David Hildenbrand
@ 2025-10-20 5:34 ` Baolu Lu
2025-10-20 14:26 ` David Hildenbrand
0 siblings, 1 reply; 31+ messages in thread
From: Baolu Lu @ 2025-10-20 5:34 UTC (permalink / raw)
To: David Hildenbrand, Dave Hansen, syzbot ci, akpm, apopple, bp,
dave.hansen, iommu, jannh, jean-philippe, jgg, joro, kevin.tian,
liam.howlett, linux-kernel, linux-mm, lorenzo.stoakes, luto,
mhocko, mingo, peterz, robin.murphy, rppt, security, stable, tglx,
urezki, vasant.hegde, vbabka, will, willy, x86, yi1.lai
Cc: syzbot, syzkaller-bugs
On 10/18/25 01:10, David Hildenbrand wrote:
> On 16.10.25 10:00, Baolu Lu wrote:
>> On 10/16/25 00:25, Dave Hansen wrote:
>>> Here's the part that confuses me:
>>>
>>> On 10/14/25 13:59, syzbot ci wrote:
>>>> page last free pid 5965 tgid 5964 stack trace:
>>>> reset_page_owner include/linux/page_owner.h:25 [inline]
>>>> free_pages_prepare mm/page_alloc.c:1394 [inline]
>>>> __free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2906
>>>> pmd_free_pte_page+0xa1/0xc0 arch/x86/mm/pgtable.c:783
>>>> vmap_try_huge_pmd mm/vmalloc.c:158 [inline]
>>> ...
>>>
>>> So, vmap_try_huge_pmd() did a pmd_free_pte_page(). Yet, somehow, the PMD
>>> stuck around so that it *could* be used after being freed. It _looks_
>>> like pmd_free_pte_page() freed the page, returned 0, and made
>>> vmap_try_huge_pmd() return early, skipping the pmd pmd_set_huge().
>>>
>>> But I don't know how that could possibly happen.
>>
>> The reported issue is only related to this patch:
>>
>> - [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages
>>
>> It appears that the pmd_ptdesc() helper can't be used directly here in
>> this patch. pmd_ptdesc() retrieves the page table page that the PMD
>> entry resides in:
>>
>> static inline struct page *pmd_pgtable_page(pmd_t *pmd)
>> {
>> unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
>> return virt_to_page((void *)((unsigned long) pmd & mask));
>> }
>>
>> static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
>> {
>> return page_ptdesc(pmd_pgtable_page(pmd));
>> }
>>
>> while, in this patch, we need the page descriptor that a pmd entry
>> points to.
>
> Ah. But that's just pointing at a leaf page table, right?
Yes, that points to a leaf page table.
These two helpers are called in vmap_try_huge_pmd/pud() to clean up the
low-level page tables and make room for pmd/pud_set_huge(). The huge
page entry case shouldn't go through these paths; otherwise, the code is
already broken.
Thanks,
baolu
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [syzbot ci] Re: Fix stale IOTLB entries for kernel address space
2025-10-20 5:34 ` Baolu Lu
@ 2025-10-20 14:26 ` David Hildenbrand
0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-10-20 14:26 UTC (permalink / raw)
To: Baolu Lu, Dave Hansen, syzbot ci, akpm, apopple, bp, dave.hansen,
iommu, jannh, jean-philippe, jgg, joro, kevin.tian, liam.howlett,
linux-kernel, linux-mm, lorenzo.stoakes, luto, mhocko, mingo,
peterz, robin.murphy, rppt, security, stable, tglx, urezki,
vasant.hegde, vbabka, will, willy, x86, yi1.lai
Cc: syzbot, syzkaller-bugs
On 20.10.25 07:34, Baolu Lu wrote:
> On 10/18/25 01:10, David Hildenbrand wrote:
>> On 16.10.25 10:00, Baolu Lu wrote:
>>> On 10/16/25 00:25, Dave Hansen wrote:
>>>> Here's the part that confuses me:
>>>>
>>>> On 10/14/25 13:59, syzbot ci wrote:
>>>>> page last free pid 5965 tgid 5964 stack trace:
>>>>> reset_page_owner include/linux/page_owner.h:25 [inline]
>>>>> free_pages_prepare mm/page_alloc.c:1394 [inline]
>>>>> __free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2906
>>>>> pmd_free_pte_page+0xa1/0xc0 arch/x86/mm/pgtable.c:783
>>>>> vmap_try_huge_pmd mm/vmalloc.c:158 [inline]
>>>> ...
>>>>
>>>> So, vmap_try_huge_pmd() did a pmd_free_pte_page(). Yet, somehow, the PMD
>>>> stuck around so that it *could* be used after being freed. It _looks_
>>>> like pmd_free_pte_page() freed the page, returned 0, and made
>>>> vmap_try_huge_pmd() return early, skipping the pmd pmd_set_huge().
>>>>
>>>> But I don't know how that could possibly happen.
>>>
>>> The reported issue is only related to this patch:
>>>
>>> - [PATCH v6 3/7] x86/mm: Use 'ptdesc' when freeing PMD pages
>>>
>>> It appears that the pmd_ptdesc() helper can't be used directly here in
>>> this patch. pmd_ptdesc() retrieves the page table page that the PMD
>>> entry resides in:
>>>
>>> static inline struct page *pmd_pgtable_page(pmd_t *pmd)
>>> {
>>> unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
>>> return virt_to_page((void *)((unsigned long) pmd & mask));
>>> }
>>>
>>> static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
>>> {
>>> return page_ptdesc(pmd_pgtable_page(pmd));
>>> }
>>>
>>> while, in this patch, we need the page descriptor that a pmd entry
>>> points to.
>>
>> Ah. But that's just pointing at a leaf page table, right?
>
> Yes, that points to a leaf page table.
Right, so I guess there is no simplifying that.
Like we have in pte_lockptr():
return ptlock_ptr(page_ptdesc(pmd_page(*pmd)));
So page_ptdesc(pmd_page()) is the way to go for now.
Sorry for the confusion.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space
2025-10-14 13:04 [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space Lu Baolu
` (7 preceding siblings ...)
2025-10-14 20:59 ` [syzbot ci] Re: Fix " syzbot ci
@ 2025-10-15 0:43 ` Andrew Morton
2025-10-15 5:38 ` Baolu Lu
8 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2025-10-15 0:43 UTC (permalink / raw)
To: Lu Baolu
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Matthew Wilcox, iommu, security, x86, linux-mm, linux-kernel
On Tue, 14 Oct 2025 21:04:30 +0800 Lu Baolu <baolu.lu@linux.intel.com> wrote:
> This proposes a fix for a security vulnerability related to IOMMU Shared
> Virtual Addressing (SVA). In an SVA context, an IOMMU can cache kernel
> page table entries. When a kernel page table page is freed and
> reallocated for another purpose, the IOMMU might still hold stale,
> incorrect entries. This can be exploited to cause a use-after-free or
> write-after-free condition, potentially leading to privilege escalation
> or data corruption.
Is only x86 affected?
> This solution introduces a deferred freeing mechanism for kernel page
> table pages, which provides a safe window to notify the IOMMU to
> invalidate its caches before the page is reused.
Thanks for working on this.
Can we expect any performance impact from this? Have any measurements
been performed?
Only [7/7] has a cc:stable, even though that patch is not at all
backportable. Please give some thought and suggestions regarding
whether you think we should backport this into earlier kernels.
If "yes" then the size and scope of the series looks problematic. Is
it possible to put together something simple and expedient just to plug
the hole in older kernels?
> arch/x86/Kconfig | 1 +
> mm/Kconfig | 3 ++
> include/asm-generic/pgalloc.h | 18 +++++++++
> include/linux/iommu.h | 4 ++
> include/linux/mm.h | 71 ++++++++++++++++++++++++++++++++---
> arch/x86/mm/init_64.c | 2 +-
> arch/x86/mm/pat/set_memory.c | 2 +-
> arch/x86/mm/pgtable.c | 12 +++---
> drivers/iommu/iommu-sva.c | 29 +++++++++++++-
> mm/pgtable-generic.c | 39 +++++++++++++++++++
> 10 files changed, 167 insertions(+), 14 deletions(-)
It isn't obvious which tree should carry this. Were you thinking the
x86 tree?
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space
2025-10-15 0:43 ` [PATCH v6 0/7] " Andrew Morton
@ 2025-10-15 5:38 ` Baolu Lu
2025-10-15 15:55 ` Dave Hansen
0 siblings, 1 reply; 31+ messages in thread
From: Baolu Lu @ 2025-10-15 5:38 UTC (permalink / raw)
To: Andrew Morton
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Matthew Wilcox, iommu, security, x86, linux-mm, linux-kernel
On 10/15/25 08:43, Andrew Morton wrote:
> On Tue, 14 Oct 2025 21:04:30 +0800 Lu Baolu <baolu.lu@linux.intel.com> wrote:
>
>> This proposes a fix for a security vulnerability related to IOMMU Shared
>> Virtual Addressing (SVA). In an SVA context, an IOMMU can cache kernel
>> page table entries. When a kernel page table page is freed and
>> reallocated for another purpose, the IOMMU might still hold stale,
>> incorrect entries. This can be exploited to cause a use-after-free or
>> write-after-free condition, potentially leading to privilege escalation
>> or data corruption.
>
> Is only x86 affected?
RISC-V is potentially another one. The RISC-V IOMMU driver doesn't
support SVA yet, but I believe it will be there soon.
>
>> This solution introduces a deferred freeing mechanism for kernel page
>> table pages, which provides a safe window to notify the IOMMU to
>> invalidate its caches before the page is reused.
>
> Thanks for working on this.
>
> Can we expect any performance impact from this? Have any measurements
> been performed?
This change only defers page table page freeing, allows for batch-
freeing of page table pages, and notifies the IOMMU driver to invalidate
the related caches. It doesn't impose any overhead in any critical path;
therefore, I don't see any potential performance impact.
>
> Only [7/7] has a cc:stable, even though that patch is not at all
> backportable. Please give some thought and suggestions regarding
> whether you think we should backport this into earlier kernels.
Yes. We should backport this series to stable kernels.
> If "yes" then the size and scope of the series looks problematic. Is
> it possible to put together something simple and expedient just to plug
> the hole in older kernels?
Squashing some patches is one way. But would it be workable to backport
this series manually? Say, could we send a pull request to the stable
mailing list after this series has landed?
>
>> arch/x86/Kconfig | 1 +
>> mm/Kconfig | 3 ++
>> include/asm-generic/pgalloc.h | 18 +++++++++
>> include/linux/iommu.h | 4 ++
>> include/linux/mm.h | 71 ++++++++++++++++++++++++++++++++---
>> arch/x86/mm/init_64.c | 2 +-
>> arch/x86/mm/pat/set_memory.c | 2 +-
>> arch/x86/mm/pgtable.c | 12 +++---
>> drivers/iommu/iommu-sva.c | 29 +++++++++++++-
>> mm/pgtable-generic.c | 39 +++++++++++++++++++
>> 10 files changed, 167 insertions(+), 14 deletions(-)
>
> It isn't obvious which tree should carry this. Were you thinking the
> x86 tree?
It could also be through linux-mm tree?
Thanks,
baolu
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space
2025-10-15 5:38 ` Baolu Lu
@ 2025-10-15 15:55 ` Dave Hansen
2025-10-17 1:42 ` Baolu Lu
0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2025-10-15 15:55 UTC (permalink / raw)
To: Baolu Lu, Andrew Morton
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Michal Hocko, Matthew Wilcox,
iommu, security, x86, linux-mm, linux-kernel
On 10/14/25 22:38, Baolu Lu wrote:
> On 10/15/25 08:43, Andrew Morton wrote:
>>> This solution introduces a deferred freeing mechanism for kernel page
>>> table pages, which provides a safe window to notify the IOMMU to
>>> invalidate its caches before the page is reused.
>>
>> Thanks for working on this.
>>
>> Can we expect any performance impact from this? Have any measurements
>> been performed?
>
> This change only defers page table page freeing, allows for batch-
> freeing of page table pages, and notifies the IOMMU driver to invalidate
> the related caches. It doesn't impose any overhead in any critical path;
> therefore, I don't see any potential performance impact.
Thankfully, freeing kernel page tables just isn't a common operation.
It's also done right next to a big fat flush_tlb_kernel_range() which is
going to IPI the whole world. So I'd expect this new gunk to be in the
noise behind all those IPIs.
We should double check that 0day has had a go at this series and hasn't
found anything. But other than that, I don't think we need to do any
specific performance testing on it.
>> Only [7/7] has a cc:stable, even though that patch is not at all
>> backportable. Please give some thought and suggestions regarding
>> whether you think we should backport this into earlier kernels.
>
> Yes. We should backport this series to stable kernels.
>
>> If "yes" then the size and scope of the series looks problematic. Is
>> it possible to put together something simple and expedient just to plug
>> the hole in older kernels?
>
> Squashing some patches is one way. But would it be workable to backport
> this series manually? Say, could we send a pull request to the stable
> mailing list after this series has landed?
I honestly think we should just disable SVA in old kernels at compile
time, or at least default it to be disabled at runtime. That's the
simplest thing.
The other alternative is to have arch_vmap_pmd_supported() return false
when SVA is active, or maybe when it's supported on the platform.
Either of those are 10-ish lines of code and easy to backport.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space
2025-10-15 15:55 ` Dave Hansen
@ 2025-10-17 1:42 ` Baolu Lu
2025-10-17 14:01 ` Jason Gunthorpe
0 siblings, 1 reply; 31+ messages in thread
From: Baolu Lu @ 2025-10-17 1:42 UTC (permalink / raw)
To: Dave Hansen, Andrew Morton
Cc: Joerg Roedel, Will Deacon, Robin Murphy, Kevin Tian,
Jason Gunthorpe, Jann Horn, Vasant Hegde, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Michal Hocko, Matthew Wilcox,
iommu, security, x86, linux-mm, linux-kernel
On 10/15/25 23:55, Dave Hansen wrote:
>>> Only [7/7] has acc:stable, even though that patch is not at all
>>> backportable. Please give some thought and suggestions regarding
>>> whether you think we should backport this into earlier kernels.
>> Yes. We should backport this series to stable kernels.
>>
>>> If "yes" then the size and scope of the series looks problematic. Is
>>> it possible to put together something simple and expedient just to plug
>>> the hole in older kernels?
>> Squashing some patches is one way. But would it be workable to backport
>> this series manually? Say, could we send a pull request to the stable
>> mailing list after this series has landed?
> I honestly think we should just disable SVA in old kernels at compile
> time, or at least default it to be disabled at runtime. That's the
> simplest thing.
>
> The other alternative is to have arch_vmap_pmd_supported() return false
> when SVA is active, or maybe when it's supported on the platform.
>
> Either of those are 10-ish lines of code and easy to backport.
Hi iommu folks, any insights on this?
Thanks,
baolu
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space
2025-10-17 1:42 ` Baolu Lu
@ 2025-10-17 14:01 ` Jason Gunthorpe
2025-10-17 17:28 ` Dave Hansen
0 siblings, 1 reply; 31+ messages in thread
From: Jason Gunthorpe @ 2025-10-17 14:01 UTC (permalink / raw)
To: Baolu Lu
Cc: Dave Hansen, Andrew Morton, Joerg Roedel, Will Deacon,
Robin Murphy, Kevin Tian, Jann Horn, Vasant Hegde,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Alistair Popple,
Peter Zijlstra, Uladzislau Rezki, Jean-Philippe Brucker,
Andy Lutomirski, Yi Lai, David Hildenbrand, Lorenzo Stoakes,
Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
Matthew Wilcox, iommu, security, x86, linux-mm, linux-kernel
On Fri, Oct 17, 2025 at 09:42:10AM +0800, Baolu Lu wrote:
> On 10/15/25 23:55, Dave Hansen wrote:
> > > > Only [7/7] has acc:stable, even though that patch is not at all
> > > > backportable. Please give some thought and suggestions regarding
> > > > whether you think we should backport this into earlier kernels.
> > > Yes. We should backport this series to stable kernels.
> > >
> > > > If "yes" then the size and scope of the series looks problematic. Is
> > > > it possible to put together something simple and expedient just to plug
> > > > the hole in older kernels?
> > > Squashing some patches is one way. But would it be workable to backport
> > > this series manually? Say, could we send a pull request to the stable
> > > mailing list after this series has landed?
> > I honestly think we should just disable SVA in old kernels at compile
> > time, or at least default it to be disabled at runtime. That's the
> > simplest thing.
> >
> > The other alternative is to have arch_vmap_pmd_supported() return false
> > when SVA is active, or maybe when it's supported on the platform.
> >
> > Either of those are 10-ish lines of code and easy to backport.
>
> Hi iommu folks, any insights on this?
IDK, the only SVA user on x86 I know is IDXD, so if you do the above
plan you break IDXD in all stable kernels. Doesn't sound OK?
Jason
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space
2025-10-17 14:01 ` Jason Gunthorpe
@ 2025-10-17 17:28 ` Dave Hansen
2025-10-17 17:31 ` Dave Hansen
2025-10-17 18:26 ` Vinicius Costa Gomes
0 siblings, 2 replies; 31+ messages in thread
From: Dave Hansen @ 2025-10-17 17:28 UTC (permalink / raw)
To: Jason Gunthorpe, Baolu Lu
Cc: Andrew Morton, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jann Horn, Vasant Hegde, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Michal Hocko, Matthew Wilcox,
iommu, security, x86, linux-mm, linux-kernel, Jiang, Dave,
Vinicius Costa Gomes
[-- Attachment #1: Type: text/plain, Size: 745 bytes --]
On 10/17/25 07:01, Jason Gunthorpe wrote:
>>> The other alternative is to have arch_vmap_pmd_supported() return false
>>> when SVA is active, or maybe when it's supported on the platform.
>>>
>>> Either of those are 10-ish lines of code and easy to backport.
>> Hi iommu folks, any insights on this?
> IDK, the only SVA user on x86 I know is IDXD, so if you do the above
> plan you break IDXD in all stable kernels. Doesn't sound OK?
Vinicius, any thoughts on this?
I'm thinking that even messing with arch_vmap_pmd_supported() would be
suboptimal. The easiest thing is to just stick the attached patch in
stable kernels and disable SVA at compile time.
There just aren't enough SVA users out in the wild to justify more
complexity than this.
[-- Attachment #2: svm.patch --]
[-- Type: text/x-patch, Size: 2868 bytes --]
diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index c9103a6fa06e..0b0e0283994f 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -124,7 +124,8 @@ bool emulate_vsyscall(unsigned long error_code,
if ((error_code & (X86_PF_WRITE | X86_PF_USER)) != X86_PF_USER)
return false;
- if (!(error_code & X86_PF_INSTR)) {
+ /* Avoid emulation unless userspace was executing from vsyscall page: */
+ if (address != regs->ip) {
/* Failed vsyscall read */
if (vsyscall_mode == EMULATE)
return false;
@@ -136,13 +137,16 @@ bool emulate_vsyscall(unsigned long error_code,
return false;
}
+
+ /* X86_PF_INSTR is only set when NX is supported: */
+ if (cpu_feature_enabled(X86_FEATURE_NX))
+ WARN_ON_ONCE(!(error_code & X86_PF_INSTR));
+
/*
* No point in checking CS -- the only way to get here is a user mode
* trap to a high address, which means that we're in 64-bit user code.
*/
- WARN_ON_ONCE(address != regs->ip);
-
if (vsyscall_mode == NONE) {
warn_bad_vsyscall(KERN_INFO, regs,
"vsyscall attempted with vsyscall=none");
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 39f80111e6f1..e3ce9b0b2447 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -665,6 +665,7 @@ static unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
static void cond_mitigation(struct task_struct *next)
{
unsigned long prev_mm, next_mm;
+ bool userspace_needs_ibpb = false;
if (!next || !next->mm)
return;
@@ -722,7 +723,7 @@ static void cond_mitigation(struct task_struct *next)
*/
if (next_mm != prev_mm &&
(next_mm | prev_mm) & LAST_USER_MM_IBPB)
- indirect_branch_prediction_barrier();
+ userspace_needs_ibpb = true;
}
if (static_branch_unlikely(&switch_mm_always_ibpb)) {
@@ -732,9 +733,11 @@ static void cond_mitigation(struct task_struct *next)
* last on this CPU.
*/
if ((prev_mm & ~LAST_USER_MM_SPEC_MASK) != (unsigned long)next->mm)
- indirect_branch_prediction_barrier();
+ userspace_needs_ibpb = true;
}
+ this_cpu_write(x86_ibpb_exit_to_user, userspace_needs_ibpb);
+
if (static_branch_unlikely(&switch_mm_cond_l1d_flush)) {
/*
* Flush L1D when the outgoing task requested it and/or
diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index f2f538c70650..a5d66bfd9e50 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -48,7 +48,10 @@ config INTEL_IOMMU_DEBUGFS
config INTEL_IOMMU_SVM
bool "Support for Shared Virtual Memory with Intel IOMMU"
- depends on X86_64
+ # The kernel does not invalidate IOTLB entries when freeing
+ # kernel page tables. This can lead to IOMMUs walking (and
+ # writing to) CPU page tables after they are freed.
+ depends on BROKEN
select MMU_NOTIFIER
select IOMMU_SVA
help
^ permalink raw reply related [flat|nested] 31+ messages in thread* Re: [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space
2025-10-17 17:28 ` Dave Hansen
@ 2025-10-17 17:31 ` Dave Hansen
2025-10-17 17:54 ` Jason Gunthorpe
2025-10-17 18:26 ` Vinicius Costa Gomes
1 sibling, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2025-10-17 17:31 UTC (permalink / raw)
To: Jason Gunthorpe, Baolu Lu
Cc: Andrew Morton, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jann Horn, Vasant Hegde, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Michal Hocko, Matthew Wilcox,
iommu, security, x86, linux-mm, linux-kernel, Jiang, Dave,
Vinicius Costa Gomes
On 10/17/25 10:28, Dave Hansen wrote:
> I'm thinking that even messing with arch_vmap_pmd_supported() would be
> suboptimal. The easiest thing is to just stick the attached patch in
> stable kernels and disable SVA at compile time.
Gah, please just ignore the hunks in that patch other than the
drivers/iommu/intel/Kconfig one.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space
2025-10-17 17:31 ` Dave Hansen
@ 2025-10-17 17:54 ` Jason Gunthorpe
0 siblings, 0 replies; 31+ messages in thread
From: Jason Gunthorpe @ 2025-10-17 17:54 UTC (permalink / raw)
To: Dave Hansen
Cc: Baolu Lu, Andrew Morton, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jann Horn, Vasant Hegde, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Michal Hocko, Matthew Wilcox,
iommu, security, x86, linux-mm, linux-kernel, Jiang, Dave,
Vinicius Costa Gomes
On Fri, Oct 17, 2025 at 10:31:54AM -0700, Dave Hansen wrote:
> On 10/17/25 10:28, Dave Hansen wrote:
> > I'm thinking that even messing with arch_vmap_pmd_supported() would be
> > suboptimal. The easiest thing is to just stick the attached patch in
> > stable kernels and disable SVA at compile time.
>
> Gah, please just ignore the hunks in that patch other than the
> drivers/iommu/intel/Kconfig one.
The AMD driver has to be disabled too and there is no kconfig for it.
I think it would be simpler to just patch iommu_sva_bind_device()
with like:
if (IS_ENABLED(CONFIG_X86))
return ERR_PTR(-EOPNOTSUPP);
Jason
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space
2025-10-17 17:28 ` Dave Hansen
2025-10-17 17:31 ` Dave Hansen
@ 2025-10-17 18:26 ` Vinicius Costa Gomes
2025-10-22 5:06 ` Baolu Lu
1 sibling, 1 reply; 31+ messages in thread
From: Vinicius Costa Gomes @ 2025-10-17 18:26 UTC (permalink / raw)
To: Dave Hansen, Jason Gunthorpe, Baolu Lu
Cc: Andrew Morton, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jann Horn, Vasant Hegde, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Michal Hocko, Matthew Wilcox,
iommu, security, x86, linux-mm, linux-kernel, Jiang, Dave
Dave Hansen <dave.hansen@intel.com> writes:
> On 10/17/25 07:01, Jason Gunthorpe wrote:
>>>> The other alternative is to have arch_vmap_pmd_supported() return false
>>>> when SVA is active, or maybe when it's supported on the platform.
>>>>
>>>> Either of those are 10-ish lines of code and easy to backport.
>>> Hi iommu folks, any insights on this?
>> IDK, the only SVA user on x86 I know is IDXD, so if you do the above
>> plan you break IDXD in all stable kernels. Doesn't sound OK?
>
> Vinicius, any thoughts on this?
>
This won't break IDXD exactly/totally, it would cause it to be
impossible for users to create shared DSA/IAA workqueues (which are the
nicer ones to use), and it will cause the driver to print some not happy
messages in the kernel logs. The in-kernel users of IDXD (iaa_crypto for
zswap, for example) will continue to work.
In short, I am not happy, but I think it's workable, even better if
there are alternatives in case people complain.
> I'm thinking that even messing with arch_vmap_pmd_supported() would be
> suboptimal. The easiest thing is to just stick the attached patch in
> stable kernels and disable SVA at compile time.
>
> There just aren't enough SVA users out in the wild to justify more
> complexity than this.
> diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
> index c9103a6fa06e..0b0e0283994f 100644
> --- a/arch/x86/entry/vsyscall/vsyscall_64.c
> +++ b/arch/x86/entry/vsyscall/vsyscall_64.c
> @@ -124,7 +124,8 @@ bool emulate_vsyscall(unsigned long error_code,
> if ((error_code & (X86_PF_WRITE | X86_PF_USER)) != X86_PF_USER)
> return false;
>
> - if (!(error_code & X86_PF_INSTR)) {
> + /* Avoid emulation unless userspace was executing from vsyscall page: */
> + if (address != regs->ip) {
> /* Failed vsyscall read */
> if (vsyscall_mode == EMULATE)
> return false;
> @@ -136,13 +137,16 @@ bool emulate_vsyscall(unsigned long error_code,
> return false;
> }
>
> +
> + /* X86_PF_INSTR is only set when NX is supported: */
> + if (cpu_feature_enabled(X86_FEATURE_NX))
> + WARN_ON_ONCE(!(error_code & X86_PF_INSTR));
> +
> /*
> * No point in checking CS -- the only way to get here is a user mode
> * trap to a high address, which means that we're in 64-bit user code.
> */
>
> - WARN_ON_ONCE(address != regs->ip);
> -
> if (vsyscall_mode == NONE) {
> warn_bad_vsyscall(KERN_INFO, regs,
> "vsyscall attempted with vsyscall=none");
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 39f80111e6f1..e3ce9b0b2447 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -665,6 +665,7 @@ static unsigned long mm_mangle_tif_spec_bits(struct task_struct *next)
> static void cond_mitigation(struct task_struct *next)
> {
> unsigned long prev_mm, next_mm;
> + bool userspace_needs_ibpb = false;
>
> if (!next || !next->mm)
> return;
> @@ -722,7 +723,7 @@ static void cond_mitigation(struct task_struct *next)
> */
> if (next_mm != prev_mm &&
> (next_mm | prev_mm) & LAST_USER_MM_IBPB)
> - indirect_branch_prediction_barrier();
> + userspace_needs_ibpb = true;
> }
>
> if (static_branch_unlikely(&switch_mm_always_ibpb)) {
> @@ -732,9 +733,11 @@ static void cond_mitigation(struct task_struct *next)
> * last on this CPU.
> */
> if ((prev_mm & ~LAST_USER_MM_SPEC_MASK) != (unsigned long)next->mm)
> - indirect_branch_prediction_barrier();
> + userspace_needs_ibpb = true;
> }
>
> + this_cpu_write(x86_ibpb_exit_to_user, userspace_needs_ibpb);
> +
> if (static_branch_unlikely(&switch_mm_cond_l1d_flush)) {
> /*
> * Flush L1D when the outgoing task requested it and/or
> diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
> index f2f538c70650..a5d66bfd9e50 100644
> --- a/drivers/iommu/intel/Kconfig
> +++ b/drivers/iommu/intel/Kconfig
> @@ -48,7 +48,10 @@ config INTEL_IOMMU_DEBUGFS
>
> config INTEL_IOMMU_SVM
> bool "Support for Shared Virtual Memory with Intel IOMMU"
> - depends on X86_64
> + # The kernel does not invalidate IOTLB entries when freeing
> + # kernel page tables. This can lead to IOMMUs walking (and
> + # writing to) CPU page tables after they are freed.
> + depends on BROKEN
> select MMU_NOTIFIER
> select IOMMU_SVA
> help
--
Vinicius
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [PATCH v6 0/7] Fix stale IOTLB entries for kernel address space
2025-10-17 18:26 ` Vinicius Costa Gomes
@ 2025-10-22 5:06 ` Baolu Lu
0 siblings, 0 replies; 31+ messages in thread
From: Baolu Lu @ 2025-10-22 5:06 UTC (permalink / raw)
To: Vinicius Costa Gomes, Dave Hansen, Jason Gunthorpe
Cc: Andrew Morton, Joerg Roedel, Will Deacon, Robin Murphy,
Kevin Tian, Jann Horn, Vasant Hegde, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Alistair Popple, Peter Zijlstra,
Uladzislau Rezki, Jean-Philippe Brucker, Andy Lutomirski, Yi Lai,
David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Michal Hocko, Matthew Wilcox,
iommu, security, x86, linux-mm, linux-kernel, Jiang, Dave
On 10/18/25 02:26, Vinicius Costa Gomes wrote:
> Dave Hansen<dave.hansen@intel.com> writes:
>
>> On 10/17/25 07:01, Jason Gunthorpe wrote:
>>>>> The other alternative is to have arch_vmap_pmd_supported() return false
>>>>> when SVA is active, or maybe when it's supported on the platform.
>>>>>
>>>>> Either of those are 10-ish lines of code and easy to backport.
>>>> Hi iommu folks, any insights on this?
>>> IDK, the only SVA user on x86 I know is IDXD, so if you do the above
>>> plan you break IDXD in all stable kernels. Doesn't sound OK?
>> Vinicius, any thoughts on this?
>>
> This won't break IDXD exactly/totally, it would cause it to be
> impossible for users to create shared DSA/IAA workqueues (which are the
> nicer ones to use), and it will cause the driver to print some not happy
> messages in the kernel logs. The in-kernel users of IDXD (iaa_crypto for
> zswap, for example) will continue to work.
>
> In short, I am not happy, but I think it's workable, even better if
> there are alternatives in case people complain.
Okay, so I will add an extra patch to disable SVA for x86 arch and re-
enable it after the kernel page table free callback is done.
Thanks,
baolu
^ permalink raw reply [flat|nested] 31+ messages in thread