* [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables
@ 2025-07-09 13:16 Harry Yoo
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel() Harry Yoo
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Harry Yoo @ 2025-07-09 13:16 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin, Arnd Bergmann,
Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
Cc: H . Peter Anvin, Alexander Potapenko, Andrey Konovalov,
Dmitry Vyukov, Vincenzo Frascino, Juergen Gross, Kevin Brodsky,
Muchun Song, Oscar Salvador, Joao Martins, Lorenzo Stoakes,
Jane Chu, Alistair Popple, Mike Rapoport, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
Harry Yoo
Hi all,
During our internal testing, we started observing intermittent boot
failures when the machine uses 4-level paging and has a large amount
of persistent memory:
BUG: unable to handle page fault for address: ffffe70000000034
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP NOPTI
RIP: 0010:__init_single_page+0x9/0x6d
Call Trace:
<TASK>
__init_zone_device_page+0x17/0x5d
memmap_init_zone_device+0x154/0x1bb
pagemap_range+0x2e0/0x40f
memremap_pages+0x10b/0x2f0
devm_memremap_pages+0x1e/0x60
dev_dax_probe+0xce/0x2ec [device_dax]
dax_bus_probe+0x6d/0xc9
[... snip ...]
</TASK>
It turns out that the kernel panics while initializing vmemmap
(struct page array) when the vmemmap region spans two PGD entries,
because the new PGD entry is only installed in init_mm.pgd,
but not in the page tables of other tasks.
And looking at __populate_section_memmap():
if (vmemmap_can_optimize(altmap, pgmap))
// does not sync top level page tables
r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
else
// sync top level page tables in x86
r = vmemmap_populate(start, end, nid, altmap);
In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c
synchronizes the top level page table (See commit 9b861528a801
("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping
changes")) so that all tasks in the system can see the new vmemmap area.
However, when vmemmap_can_optimize() returns true, the optimized path
skips synchronization of top-level page tables. This is because
vmemmap_populate_compound_pages() is implemented in core MM code, which
does not handle synchronization of the top-level page tables. Instead,
the core MM has historically relied on each architecture to perform this
synchronization manually.
We're not the first party to encounter a crash caused by not-sync'd
top level page tables: earlier this year, Gwan-gyeong Mun attempted to
address the issue [1] [2] after hitting a kernel panic when x86 code
accessed the vmemmap area before the corresponding top-level entries
were synced. At that time, the issue was believed to be triggered
only when struct page was enlarged for debugging purposes, and the patch
did not get further updates.
It turns out that current approach of relying on each arch to handle
the page table sync manually is fragile because 1) it's easy to forget
to sync the top level page table, and 2) it's also easy to overlook that
the kernel should not access vmemmap / direct mapping area before the sync.
To address this, Dave Hansen suggested [3] [4] introducing
{pgd,p4d}_populate_kernel() for updating kernel portion
of the page tables and allow each architecture to explicitly perform
synchronization when installing top-level entries. With this approach,
we no longer need to worry about missing the sync step, reducing the risk
of future regressions.
This patch series implements Dave Hansen's suggestion and hence added
Suggested-by: Dave Hansen.
This is an RFC primarily because this involves non-trivial change and
I would like to fix it in a way that aligns with community consensus.
Cc stable because lack of this series opens the door to intermittent
boot failures.
[1] https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@intel.com
[2] https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com
[3] https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel.com
[4] https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel.com
Harry Yoo (3):
mm: introduce and use {pgd,p4d}_populate_kernel()
x86/mm: define p*d_populate_kernel() and top level page table sync
x86/mm: convert {pgd,p4d}_populate{,_init} to _kernel variant
arch/x86/include/asm/pgalloc.h | 41 +++++++++
arch/x86/mm/init_64.c | 147 ++++++++++++++++-----------------
arch/x86/mm/kasan_init_64.c | 8 +-
include/asm-generic/pgalloc.h | 4 +
include/linux/pgalloc.h | 0
mm/kasan/init.c | 10 +--
mm/percpu.c | 4 +-
mm/sparse-vmemmap.c | 4 +-
8 files changed, 130 insertions(+), 88 deletions(-)
create mode 100644 include/linux/pgalloc.h
--
2.43.0
^ permalink raw reply [flat|nested] 14+ messages in thread
* [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel()
2025-07-09 13:16 [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables Harry Yoo
@ 2025-07-09 13:16 ` Harry Yoo
2025-07-11 16:18 ` David Hildenbrand
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 2/3] x86/mm: define p*d_populate_kernel() and top-level page table sync Harry Yoo
` (2 subsequent siblings)
3 siblings, 1 reply; 14+ messages in thread
From: Harry Yoo @ 2025-07-09 13:16 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin, Arnd Bergmann,
Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
Cc: H . Peter Anvin, Alexander Potapenko, Andrey Konovalov,
Dmitry Vyukov, Vincenzo Frascino, Juergen Gross, Kevin Brodsky,
Muchun Song, Oscar Salvador, Joao Martins, Lorenzo Stoakes,
Jane Chu, Alistair Popple, Mike Rapoport, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
Harry Yoo, stable
Intrdocue and use {pgd,p4d}_pouplate_kernel() in core MM code when
populating PGD and P4D entries corresponding to the kernel address
space. The main purpose of these helpers is to ensure synchronization of
the kernel portion of the top-level page tables whenever such an entry
is populated.
Until now, the kernel has relied on each architecture to handle
synchronization of top-level page tables in an ad-hoc manner.
For example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for
direct mapping and vmemmap mapping changes").
However, this approach has proven fragile, as it's easy to forget to
perform the necessary synchronization when introducing new changes.
To address this, introduce _kernel() varients of the page table
population helpers that invoke architecture-specific hooks to properly
synchronize the page tables.
For now, it only targets x86_64, so only PGD and P4D level helpers are
introduced. In theory, PUD and PMD level helpers can be added later if
needed by other architectures.
Currently it is no-op as no arch defines __HAVE_ARCH_SYNC_KERNEL_PGTABLES.
Cc: <stable@vger.kernel.org>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
include/asm-generic/pgalloc.h | 4 ++++
include/linux/pgalloc.h | 0
mm/kasan/init.c | 10 +++++-----
mm/percpu.c | 4 ++--
mm/sparse-vmemmap.c | 4 ++--
5 files changed, 13 insertions(+), 9 deletions(-)
create mode 100644 include/linux/pgalloc.h
diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 3c8ec3bfea44..6cac1ce64e01 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -295,6 +295,10 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
__pgd_free(mm, pgd);
}
#endif
+#ifndef __HAVE_ARCH_SYNC_KERNEL_PGTABLE
+#define pgd_populate_kernel(addr, pgd, p4d) pgd_populate(&init_mm, pgd, p4d)
+#define p4d_populate_kernel(addr, p4d, pud) p4d_populate(&init_mm, p4d, pud)
+#endif
#endif /* CONFIG_MMU */
diff --git a/include/linux/pgalloc.h b/include/linux/pgalloc.h
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index ced6b29fcf76..43de820ee282 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -191,7 +191,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
pud_t *pud;
pmd_t *pmd;
- p4d_populate(&init_mm, p4d,
+ p4d_populate_kernel(addr, p4d,
lm_alias(kasan_early_shadow_pud));
pud = pud_offset(p4d, addr);
pud_populate(&init_mm, pud,
@@ -212,7 +212,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
} else {
p = early_alloc(PAGE_SIZE, NUMA_NO_NODE);
pud_init(p);
- p4d_populate(&init_mm, p4d, p);
+ p4d_populate_kernel(addr, p4d, p);
}
}
zero_pud_populate(p4d, addr, next);
@@ -251,10 +251,10 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
* puds,pmds, so pgd_populate(), pud_populate()
* is noops.
*/
- pgd_populate(&init_mm, pgd,
+ pgd_populate_kernel(addr, pgd,
lm_alias(kasan_early_shadow_p4d));
p4d = p4d_offset(pgd, addr);
- p4d_populate(&init_mm, p4d,
+ p4d_populate_kernel(addr, p4d,
lm_alias(kasan_early_shadow_pud));
pud = pud_offset(p4d, addr);
pud_populate(&init_mm, pud,
@@ -273,7 +273,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
if (!p)
return -ENOMEM;
} else {
- pgd_populate(&init_mm, pgd,
+ pgd_populate_kernel(addr, pgd,
early_alloc(PAGE_SIZE, NUMA_NO_NODE));
}
}
diff --git a/mm/percpu.c b/mm/percpu.c
index 782cc148b39c..57450a03c432 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -3134,13 +3134,13 @@ void __init __weak pcpu_populate_pte(unsigned long addr)
if (pgd_none(*pgd)) {
p4d = memblock_alloc_or_panic(P4D_TABLE_SIZE, P4D_TABLE_SIZE);
- pgd_populate(&init_mm, pgd, p4d);
+ pgd_populate_kernel(addr, pgd, p4d);
}
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d)) {
pud = memblock_alloc_or_panic(PUD_TABLE_SIZE, PUD_TABLE_SIZE);
- p4d_populate(&init_mm, p4d, pud);
+ p4d_populate_kernel(addr, p4d, pud);
}
pud = pud_offset(p4d, addr);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index fd2ab5118e13..e275310ac708 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -229,7 +229,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node)
if (!p)
return NULL;
pud_init(p);
- p4d_populate(&init_mm, p4d, p);
+ p4d_populate_kernel(addr, p4d, p);
}
return p4d;
}
@@ -241,7 +241,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
- pgd_populate(&init_mm, pgd, p);
+ pgd_populate_kernel(addr, pgd, p);
}
return pgd;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC V1 PATCH mm-hotfixes 2/3] x86/mm: define p*d_populate_kernel() and top-level page table sync
2025-07-09 13:16 [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables Harry Yoo
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel() Harry Yoo
@ 2025-07-09 13:16 ` Harry Yoo
2025-07-09 21:13 ` Andrew Morton
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 3/3] x86/mm: convert {pgd,p4d}_populate{,_init} to _kernel variant Harry Yoo
2025-07-09 13:24 ` [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables Harry Yoo
3 siblings, 1 reply; 14+ messages in thread
From: Harry Yoo @ 2025-07-09 13:16 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin, Arnd Bergmann,
Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
Cc: H . Peter Anvin, Alexander Potapenko, Andrey Konovalov,
Dmitry Vyukov, Vincenzo Frascino, Juergen Gross, Kevin Brodsky,
Muchun Song, Oscar Salvador, Joao Martins, Lorenzo Stoakes,
Jane Chu, Alistair Popple, Mike Rapoport, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
Harry Yoo, stable
During our internal testing, we started observing intermittent boot
failures when the machine uses 4-level paging and has a large amount
of persistent memory:
BUG: unable to handle page fault for address: ffffe70000000034
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP NOPTI
RIP: 0010:__init_single_page+0x9/0x6d
Call Trace:
<TASK>
__init_zone_device_page+0x17/0x5d
memmap_init_zone_device+0x154/0x1bb
pagemap_range+0x2e0/0x40f
memremap_pages+0x10b/0x2f0
devm_memremap_pages+0x1e/0x60
dev_dax_probe+0xce/0x2ec [device_dax]
dax_bus_probe+0x6d/0xc9
[... snip ...]
</TASK>
It turns out that the kernel panics while initializing vmemmap
(struct page array) when the vmemmap region spans two PGD entries,
because the new PGD entry is only installed in init_mm.pgd,
but not in the page tables of other tasks.
And looking at __populate_section_memmap():
if (vmemmap_can_optimize(altmap, pgmap))
// does not sync top level page tables
r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
else
// sync top level page tables in x86
r = vmemmap_populate(start, end, nid, altmap);
In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c
synchronizes the top level page table (See commit 9b861528a801
("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping
changes")) so that all tasks in the system can see the new vmemmap area.
However, when vmemmap_can_optimize() returns true, the optimized path
skips synchronization of top-level page tables. This is because
vmemmap_populate_compound_pages() is implemented in core MM code, which
does not handle synchronization of the top-level page tables. Instead,
the core MM has historically relied on each architecture to perform this
synchronization manually.
It turns out that current approach of relying on each arch to handle
the page table sync manually is fragile because 1) it's easy to forget
to sync the top level page table, and 2) it's also easy to overlook that
the kernel should not access vmemmap / direct mapping area before the sync.
As suggested by Dave Hansen, define x86_64 versions of
{pgd,p4d}_populate_kernel() and arch_sync_kernel_pagetables(), and
explicitly perform top-level page table synchronization in
{pgd,p4d}_populate_kernel(). Top level page tables are synchronized in
pgd_pouplate_kernel() for 5-level paging and in p4d_populate_kernel()
for 4-level paging.
arch_sync_kernel_pagetables(addr) synchronizes the top level page table
entry for address. It calls sync_kernel_pagetables_{l4,l5} depending on
the page table levels and installs the page entry in all page tables
in the system to make it visible to all tasks.
Note that sync_kernel_pagetables_{l4,l5} are simply versions of
sync_global_pgds_{l4,l5} that synchronizes only a single page table entry
for specified address, instead of for all page table entries corresponding
to a range. No functional difference intended between sync_global_pgds_*
and sync_kernel_pagetables_* other than that.
This also fixes a crash in vmemmap_set_pmd() caused by accessing vmemmap
before sync_global_pgds() [1]:
BUG: unable to handle page fault for address: ffffeb3ff1200000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
Tainted: [W]=WARN
RIP: 0010:vmemmap_set_pmd+0xff/0x230
<TASK>
vmemmap_populate_hugepages+0x176/0x180
vmemmap_populate+0x34/0x80
__populate_section_memmap+0x41/0x90
sparse_add_section+0x121/0x3e0
__add_pages+0xba/0x150
add_pages+0x1d/0x70
memremap_pages+0x3dc/0x810
devm_memremap_pages+0x1c/0x60
xe_devm_add+0x8b/0x100 [xe]
xe_tile_init_noalloc+0x6a/0x70 [xe]
xe_device_probe+0x48c/0x740 [xe]
[... snip ...]
Cc: <stable@vger.kernel.org>
Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
Fixes: faf1c0008a33 ("x86/vmemmap: optimize for consecutive sections in partial populated PMDs")
Closes: https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com [1]
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
arch/x86/include/asm/pgalloc.h | 22 ++++++++++
arch/x86/mm/init_64.c | 80 ++++++++++++++++++++++++++++++++++
2 files changed, 102 insertions(+)
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index c88691b15f3c..d66f2db54b16 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -10,6 +10,7 @@
#define __HAVE_ARCH_PTE_ALLOC_ONE
#define __HAVE_ARCH_PGD_FREE
+#define __HAVE_ARCH_SYNC_KERNEL_PGTABLE
#include <asm-generic/pgalloc.h>
static inline int __paravirt_pgd_alloc(struct mm_struct *mm) { return 0; }
@@ -114,6 +115,17 @@ static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
}
+void arch_sync_kernel_pagetables(unsigned long addr);
+
+static inline void p4d_populate_kernel(unsigned long addr,
+ p4d_t *p4d, pud_t *pud)
+{
+ paravirt_alloc_pud(&init_mm, __pa(pud) >> PAGE_SHIFT);
+ set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
+ if (!pgtable_l5_enabled())
+ arch_sync_kernel_pagetables(addr);
+}
+
static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
{
paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
@@ -137,6 +149,16 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d)
set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
}
+static inline void pgd_populate_kernel(unsigned long addr,
+ pgd_t *pgd, p4d_t *p4d)
+{
+ if (!pgtable_l5_enabled())
+ return;
+ paravirt_alloc_p4d(&init_mm, __pa(p4d) >> PAGE_SHIFT);
+ set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+ arch_sync_kernel_pagetables(addr);
+}
+
static inline void pgd_populate_safe(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d)
{
if (!pgtable_l5_enabled())
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index fdb6cab524f0..cbddbef434d5 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -223,6 +223,86 @@ static void sync_global_pgds(unsigned long start, unsigned long end)
sync_global_pgds_l4(start, end);
}
+static void sync_kernel_pagetables_l4(unsigned long addr)
+{
+ pgd_t *pgd_ref = pgd_offset_k(addr);
+ const p4d_t *p4d_ref;
+ struct page *page;
+
+ VM_WARN_ON_ONCE(pgtable_l5_enabled());
+ /*
+ * With folded p4d, pgd_none() is always false, we need to
+ * handle synchronization on p4d level.
+ */
+ MAYBE_BUILD_BUG_ON(pgd_none(*pgd_ref));
+ p4d_ref = p4d_offset(pgd_ref, addr);
+
+ if (p4d_none(*p4d_ref))
+ return;
+
+ spin_lock(&pgd_lock);
+ list_for_each_entry(page, &pgd_list, lru) {
+ pgd_t *pgd;
+ p4d_t *p4d;
+ spinlock_t *pgt_lock;
+
+ pgd = (pgd_t *)page_address(page) + pgd_index(addr);
+ p4d = p4d_offset(pgd, addr);
+ /* the pgt_lock only for Xen */
+ pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+ spin_lock(pgt_lock);
+
+ if (!p4d_none(*p4d_ref) && !p4d_none(*p4d))
+ BUG_ON(p4d_pgtable(*p4d)
+ != p4d_pgtable(*p4d_ref));
+
+ if (p4d_none(*p4d))
+ set_p4d(p4d, *p4d_ref);
+
+ spin_unlock(pgt_lock);
+ }
+ spin_unlock(&pgd_lock);
+}
+
+static void sync_kernel_pagetables_l5(unsigned long addr)
+{
+ const pgd_t *pgd_ref = pgd_offset_k(addr);
+ struct page *page;
+
+ VM_WARN_ON_ONCE(!pgtable_l5_enabled());
+
+ if (pgd_none(*pgd_ref))
+ return;
+
+ spin_lock(&pgd_lock);
+ list_for_each_entry(page, &pgd_list, lru) {
+ pgd_t *pgd;
+ spinlock_t *pgt_lock;
+
+ pgd = (pgd_t *)page_address(page) + pgd_index(addr);
+ /* the pgt_lock only for Xen */
+ pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
+ spin_lock(pgt_lock);
+
+ if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
+ BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
+
+ if (pgd_none(*pgd))
+ set_pgd(pgd, *pgd_ref);
+
+ spin_unlock(pgt_lock);
+ }
+ spin_unlock(&pgd_lock);
+}
+
+void arch_sync_kernel_pagetables(unsigned long addr)
+{
+ if (pgtable_l5_enabled())
+ sync_kernel_pagetables_l5(addr);
+ else
+ sync_kernel_pagetables_l4(addr);
+}
+
/*
* NOTE: This function is marked __ref because it calls __init function
* (alloc_bootmem_pages). It's safe to do it ONLY when after_bootmem == 0.
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [RFC V1 PATCH mm-hotfixes 3/3] x86/mm: convert {pgd,p4d}_populate{,_init} to _kernel variant
2025-07-09 13:16 [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables Harry Yoo
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel() Harry Yoo
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 2/3] x86/mm: define p*d_populate_kernel() and top-level page table sync Harry Yoo
@ 2025-07-09 13:16 ` Harry Yoo
2025-07-09 13:24 ` [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables Harry Yoo
3 siblings, 0 replies; 14+ messages in thread
From: Harry Yoo @ 2025-07-09 13:16 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin, Arnd Bergmann,
Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
Cc: H . Peter Anvin, Alexander Potapenko, Andrey Konovalov,
Dmitry Vyukov, Vincenzo Frascino, Juergen Gross, Kevin Brodsky,
Muchun Song, Oscar Salvador, Joao Martins, Lorenzo Stoakes,
Jane Chu, Alistair Popple, Mike Rapoport, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
Harry Yoo, stable
Introduce {pgd,p4d}_populate_kernel_safe() and convert
{pgd,p4d}_populate{,_init}() to {pgd,p4d}_populate_kernel{,_init}().
By converting them, we no longer need to worry about forgetting to
synchronize top level page tables.
With all {pgd,p4d}_populate{,_init}() converted to
{pgd,p4d}_populate_kernel{,_init}(), it is now safe to drop
sync_global_pgds(). Let's remove it.
Cc: <stable@vger.kernel.org>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---
arch/x86/include/asm/pgalloc.h | 19 +++++
arch/x86/mm/init_64.c | 129 ++++++---------------------------
arch/x86/mm/kasan_init_64.c | 8 +-
3 files changed, 46 insertions(+), 110 deletions(-)
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index d66f2db54b16..98439b9ca293 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -132,6 +132,15 @@ static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d, pud_t *pu
set_p4d_safe(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
}
+static inline void p4d_populate_kernel_safe(unsigned long addr,
+ p4d_t *p4d, pud_t *pud)
+{
+ paravirt_alloc_pud(&init_mm, __pa(pud) >> PAGE_SHIFT);
+ set_p4d_safe(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
+ if (!pgtable_l5_enabled())
+ arch_sync_kernel_pagetables(addr);
+}
+
extern void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud);
static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pud,
@@ -167,6 +176,16 @@ static inline void pgd_populate_safe(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4
set_pgd_safe(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
}
+static inline void pgd_populate_kernel_safe(unsigned long addr,
+ pgd_t *pgd, p4d_t *p4d)
+{
+ if (!pgtable_l5_enabled())
+ return;
+ paravirt_alloc_p4d(&init_mm, __pa(p4d) >> PAGE_SHIFT);
+ set_pgd_safe(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+ arch_sync_kernel_pagetables(addr);
+}
+
extern void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d);
static inline void __p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d,
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index cbddbef434d5..00608ab36936 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -75,6 +75,19 @@ DEFINE_POPULATE(pgd_populate, pgd, p4d, init)
DEFINE_POPULATE(pud_populate, pud, pmd, init)
DEFINE_POPULATE(pmd_populate_kernel, pmd, pte, init)
+#define DEFINE_POPULATE_KERNEL(fname, type1, type2, init) \
+static inline void fname##_init(unsigned long addr, \
+ type1##_t *arg1, type2##_t *arg2, bool init) \
+{ \
+ if (init) \
+ fname##_safe(addr, arg1, arg2); \
+ else \
+ fname(addr, arg1, arg2); \
+}
+
+DEFINE_POPULATE_KERNEL(pgd_populate_kernel, pgd, p4d, init)
+DEFINE_POPULATE_KERNEL(p4d_populate_kernel, p4d, pud, init)
+
#define DEFINE_ENTRY(type1, type2, init) \
static inline void set_##type1##_init(type1##_t *arg1, \
type2##_t arg2, bool init) \
@@ -130,99 +143,6 @@ static int __init nonx32_setup(char *str)
}
__setup("noexec32=", nonx32_setup);
-static void sync_global_pgds_l5(unsigned long start, unsigned long end)
-{
- unsigned long addr;
-
- for (addr = start; addr <= end; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
- const pgd_t *pgd_ref = pgd_offset_k(addr);
- struct page *page;
-
- /* Check for overflow */
- if (addr < start)
- break;
-
- if (pgd_none(*pgd_ref))
- continue;
-
- spin_lock(&pgd_lock);
- list_for_each_entry(page, &pgd_list, lru) {
- pgd_t *pgd;
- spinlock_t *pgt_lock;
-
- pgd = (pgd_t *)page_address(page) + pgd_index(addr);
- /* the pgt_lock only for Xen */
- pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
- spin_lock(pgt_lock);
-
- if (!pgd_none(*pgd_ref) && !pgd_none(*pgd))
- BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
-
- if (pgd_none(*pgd))
- set_pgd(pgd, *pgd_ref);
-
- spin_unlock(pgt_lock);
- }
- spin_unlock(&pgd_lock);
- }
-}
-
-static void sync_global_pgds_l4(unsigned long start, unsigned long end)
-{
- unsigned long addr;
-
- for (addr = start; addr <= end; addr = ALIGN(addr + 1, PGDIR_SIZE)) {
- pgd_t *pgd_ref = pgd_offset_k(addr);
- const p4d_t *p4d_ref;
- struct page *page;
-
- /*
- * With folded p4d, pgd_none() is always false, we need to
- * handle synchronization on p4d level.
- */
- MAYBE_BUILD_BUG_ON(pgd_none(*pgd_ref));
- p4d_ref = p4d_offset(pgd_ref, addr);
-
- if (p4d_none(*p4d_ref))
- continue;
-
- spin_lock(&pgd_lock);
- list_for_each_entry(page, &pgd_list, lru) {
- pgd_t *pgd;
- p4d_t *p4d;
- spinlock_t *pgt_lock;
-
- pgd = (pgd_t *)page_address(page) + pgd_index(addr);
- p4d = p4d_offset(pgd, addr);
- /* the pgt_lock only for Xen */
- pgt_lock = &pgd_page_get_mm(page)->page_table_lock;
- spin_lock(pgt_lock);
-
- if (!p4d_none(*p4d_ref) && !p4d_none(*p4d))
- BUG_ON(p4d_pgtable(*p4d)
- != p4d_pgtable(*p4d_ref));
-
- if (p4d_none(*p4d))
- set_p4d(p4d, *p4d_ref);
-
- spin_unlock(pgt_lock);
- }
- spin_unlock(&pgd_lock);
- }
-}
-
-/*
- * When memory was added make sure all the processes MM have
- * suitable PGD entries in the local PGD level page.
- */
-static void sync_global_pgds(unsigned long start, unsigned long end)
-{
- if (pgtable_l5_enabled())
- sync_global_pgds_l5(start, end);
- else
- sync_global_pgds_l4(start, end);
-}
-
static void sync_kernel_pagetables_l4(unsigned long addr)
{
pgd_t *pgd_ref = pgd_offset_k(addr);
@@ -295,6 +215,10 @@ static void sync_kernel_pagetables_l5(unsigned long addr)
spin_unlock(&pgd_lock);
}
+/*
+ * When memory was added make sure all the processes MM have
+ * suitable PGD entries in the local PGD level page.
+ */
void arch_sync_kernel_pagetables(unsigned long addr)
{
if (pgtable_l5_enabled())
@@ -330,7 +254,7 @@ static p4d_t *fill_p4d(pgd_t *pgd, unsigned long vaddr)
{
if (pgd_none(*pgd)) {
p4d_t *p4d = (p4d_t *)spp_getpage();
- pgd_populate(&init_mm, pgd, p4d);
+ pgd_populate_kernel(vaddr, pgd, p4d);
if (p4d != p4d_offset(pgd, 0))
printk(KERN_ERR "PAGETABLE BUG #00! %p <-> %p\n",
p4d, p4d_offset(pgd, 0));
@@ -342,7 +266,7 @@ static pud_t *fill_pud(p4d_t *p4d, unsigned long vaddr)
{
if (p4d_none(*p4d)) {
pud_t *pud = (pud_t *)spp_getpage();
- p4d_populate(&init_mm, p4d, pud);
+ p4d_populate_kernel(vaddr, p4d, pud);
if (pud != pud_offset(p4d, 0))
printk(KERN_ERR "PAGETABLE BUG #01! %p <-> %p\n",
pud, pud_offset(p4d, 0));
@@ -795,7 +719,7 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
page_size_mask, prot, init);
spin_lock(&init_mm.page_table_lock);
- p4d_populate_init(&init_mm, p4d, pud, init);
+ p4d_populate_kernel_init(vaddr, p4d, pud, init);
spin_unlock(&init_mm.page_table_lock);
}
@@ -808,7 +732,6 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
unsigned long page_size_mask,
pgprot_t prot, bool init)
{
- bool pgd_changed = false;
unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last;
paddr_last = paddr_end;
@@ -837,18 +760,14 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
spin_lock(&init_mm.page_table_lock);
if (pgtable_l5_enabled())
- pgd_populate_init(&init_mm, pgd, p4d, init);
+ pgd_populate_kernel_init(vaddr, pgd, p4d, init);
else
- p4d_populate_init(&init_mm, p4d_offset(pgd, vaddr),
- (pud_t *) p4d, init);
+ p4d_populate_kernel_init(vaddr, p4d_offset(pgd, vaddr),
+ (pud_t *) p4d, init);
spin_unlock(&init_mm.page_table_lock);
- pgd_changed = true;
}
- if (pgd_changed)
- sync_global_pgds(vaddr_start, vaddr_end - 1);
-
return paddr_last;
}
@@ -1642,8 +1561,6 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
err = -ENOMEM;
} else
err = vmemmap_populate_basepages(start, end, node, NULL);
- if (!err)
- sync_global_pgds(start, end - 1);
return err;
}
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0539efd0d216..e825952d25b2 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -108,7 +108,7 @@ static void __init kasan_populate_p4d(p4d_t *p4d, unsigned long addr,
if (p4d_none(*p4d)) {
void *p = early_alloc(PAGE_SIZE, nid, true);
- p4d_populate(&init_mm, p4d, p);
+ p4d_populate_kernel(addr, p4d, p);
}
pud = pud_offset(p4d, addr);
@@ -128,7 +128,7 @@ static void __init kasan_populate_pgd(pgd_t *pgd, unsigned long addr,
if (pgd_none(*pgd)) {
p = early_alloc(PAGE_SIZE, nid, true);
- pgd_populate(&init_mm, pgd, p);
+ pgd_populate_kernel(addr, pgd, p);
}
p4d = p4d_offset(pgd, addr);
@@ -255,7 +255,7 @@ static void __init kasan_shallow_populate_p4ds(pgd_t *pgd,
if (p4d_none(*p4d)) {
p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
- p4d_populate(&init_mm, p4d, p);
+ p4d_populate_kernel(addr, p4d, p);
}
} while (p4d++, addr = next, addr != end);
}
@@ -273,7 +273,7 @@ static void __init kasan_shallow_populate_pgds(void *start, void *end)
if (pgd_none(*pgd)) {
p = early_alloc(PAGE_SIZE, NUMA_NO_NODE, true);
- pgd_populate(&init_mm, pgd, p);
+ pgd_populate_kernel(addr, pgd, p);
}
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables
2025-07-09 13:16 [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables Harry Yoo
` (2 preceding siblings ...)
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 3/3] x86/mm: convert {pgd,p4d}_populate{,_init} to _kernel variant Harry Yoo
@ 2025-07-09 13:24 ` Harry Yoo
3 siblings, 0 replies; 14+ messages in thread
From: Harry Yoo @ 2025-07-09 13:24 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin, Arnd Bergmann,
Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
Cc: H . Peter Anvin, Alexander Potapenko, Andrey Konovalov,
Dmitry Vyukov, Vincenzo Frascino, Juergen Gross, Kevin Brodsky,
Muchun Song, Oscar Salvador, Joao Martins, Lorenzo Stoakes,
Jane Chu, Alistair Popple, Mike Rapoport, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm
On Wed, Jul 09, 2025 at 10:16:54PM +0900, Harry Yoo wrote:
> Harry Yoo (3):
> mm: introduce and use {pgd,p4d}_populate_kernel()
> x86/mm: define p*d_populate_kernel() and top level page table sync
> x86/mm: convert {pgd,p4d}_populate{,_init} to _kernel variant
>
> arch/x86/include/asm/pgalloc.h | 41 +++++++++
> arch/x86/mm/init_64.c | 147 ++++++++++++++++-----------------
> arch/x86/mm/kasan_init_64.c | 8 +-
> include/asm-generic/pgalloc.h | 4 +
> include/linux/pgalloc.h | 0
Oops, while I created include/linux/pgalloc.h at first, I removed it
during the development and didn't realize it's still staged in git.
Will fix it in the next version, but let me wait for some feedback
for a while.
> mm/kasan/init.c | 10 +--
> mm/percpu.c | 4 +-
> mm/sparse-vmemmap.c | 4 +-
> 8 files changed, 130 insertions(+), 88 deletions(-)
> create mode 100644 include/linux/pgalloc.h
>
> --
> 2.43.0
>
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC V1 PATCH mm-hotfixes 2/3] x86/mm: define p*d_populate_kernel() and top-level page table sync
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 2/3] x86/mm: define p*d_populate_kernel() and top-level page table sync Harry Yoo
@ 2025-07-09 21:13 ` Andrew Morton
2025-07-10 8:27 ` Harry Yoo
0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2025-07-09 21:13 UTC (permalink / raw)
To: Harry Yoo
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, H . Peter Anvin,
Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
Vincenzo Frascino, Juergen Gross, Kevin Brodsky, Muchun Song,
Oscar Salvador, Joao Martins, Lorenzo Stoakes, Jane Chu,
Alistair Popple, Mike Rapoport, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
stable
On Wed, 9 Jul 2025 22:16:56 +0900 Harry Yoo <harry.yoo@oracle.com> wrote:
> Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
> Fixes: faf1c0008a33 ("x86/vmemmap: optimize for consecutive sections in partial populated PMDs")
Fortunately both of these appeared in 6.9-rc7, which minimizes the
problem with having more than one Fixes:.
But still, the Fixes: is a pointer telling -stable maintainers where in
the kernel history we want them to insert the patch(es). Giving them
multiple insertions points is confusing! Can we narrow this down
to a single Fixes:?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC V1 PATCH mm-hotfixes 2/3] x86/mm: define p*d_populate_kernel() and top-level page table sync
2025-07-09 21:13 ` Andrew Morton
@ 2025-07-10 8:27 ` Harry Yoo
2025-07-11 4:02 ` Harry Yoo
0 siblings, 1 reply; 14+ messages in thread
From: Harry Yoo @ 2025-07-10 8:27 UTC (permalink / raw)
To: Andrew Morton
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, H . Peter Anvin,
Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
Vincenzo Frascino, Juergen Gross, Kevin Brodsky, Muchun Song,
Oscar Salvador, Joao Martins, Lorenzo Stoakes, Jane Chu,
Alistair Popple, Mike Rapoport, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
stable
On Wed, Jul 09, 2025 at 02:13:59PM -0700, Andrew Morton wrote:
> On Wed, 9 Jul 2025 22:16:56 +0900 Harry Yoo <harry.yoo@oracle.com> wrote:
>
> > Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
> > Fixes: faf1c0008a33 ("x86/vmemmap: optimize for consecutive sections in partial populated PMDs")
>
> Fortunately both of these appeared in 6.9-rc7, which minimizes the
> problem with having more than one Fixes:.
>
> But still, the Fixes: is a pointer telling -stable maintainers where in
> the kernel history we want them to insert the patch(es). Giving them
> multiple insertions points is confusing! Can we narrow this down
> to a single Fixes:?
If I had to choose only one I think it should be 4917f55b4ef9,
since faf1c0008a33 is not yet known to be triggered without enlarging
struct page (and once it's backported it fixes both of them).
Will update in the next version.
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC V1 PATCH mm-hotfixes 2/3] x86/mm: define p*d_populate_kernel() and top-level page table sync
2025-07-10 8:27 ` Harry Yoo
@ 2025-07-11 4:02 ` Harry Yoo
2025-07-11 4:16 ` Harry Yoo
0 siblings, 1 reply; 14+ messages in thread
From: Harry Yoo @ 2025-07-11 4:02 UTC (permalink / raw)
To: Andrew Morton
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, H . Peter Anvin,
Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
Vincenzo Frascino, Juergen Gross, Kevin Brodsky, Muchun Song,
Oscar Salvador, Joao Martins, Lorenzo Stoakes, Jane Chu,
Alistair Popple, Mike Rapoport, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
stable
On Thu, Jul 10, 2025 at 05:27:36PM +0900, Harry Yoo wrote:
> On Wed, Jul 09, 2025 at 02:13:59PM -0700, Andrew Morton wrote:
> > On Wed, 9 Jul 2025 22:16:56 +0900 Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > > Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
> > > Fixes: faf1c0008a33 ("x86/vmemmap: optimize for consecutive sections in partial populated PMDs")
> >
> > Fortunately both of these appeared in 6.9-rc7, which minimizes the
> > problem with having more than one Fixes:.
> >
> > But still, the Fixes: is a pointer telling -stable maintainers where in
> > the kernel history we want them to insert the patch(es). Giving them
> > multiple insertions points is confusing! Can we narrow this down
> > to a single Fixes:?
>
> If I had to choose only one I think it should be 4917f55b4ef9,
> since faf1c0008a33 is not yet known to be triggered without enlarging
> struct page (and once it's backported it fixes both of them).
On second look, faf1c0008a33 is introduced in v5.13-rc1 and
4917f55b4ef9 is introduced in v5.19-rc1.
I'll go with Fixes: faf1c0008a33 because it's introduced earlier.
> Will update in the next version.
>
> --
> Cheers,
> Harry / Hyeonggon
>
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC V1 PATCH mm-hotfixes 2/3] x86/mm: define p*d_populate_kernel() and top-level page table sync
2025-07-11 4:02 ` Harry Yoo
@ 2025-07-11 4:16 ` Harry Yoo
0 siblings, 0 replies; 14+ messages in thread
From: Harry Yoo @ 2025-07-11 4:16 UTC (permalink / raw)
To: Andrew Morton
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin, Arnd Bergmann,
Dennis Zhou, Tejun Heo, Christoph Lameter, H . Peter Anvin,
Alexander Potapenko, Andrey Konovalov, Dmitry Vyukov,
Vincenzo Frascino, Juergen Gross, Kevin Brodsky, Muchun Song,
Oscar Salvador, Joao Martins, Lorenzo Stoakes, Jane Chu,
Alistair Popple, Mike Rapoport, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
stable
On Fri, Jul 11, 2025 at 01:02:08PM +0900, Harry Yoo wrote:
> On Thu, Jul 10, 2025 at 05:27:36PM +0900, Harry Yoo wrote:
> > On Wed, Jul 09, 2025 at 02:13:59PM -0700, Andrew Morton wrote:
> > > On Wed, 9 Jul 2025 22:16:56 +0900 Harry Yoo <harry.yoo@oracle.com> wrote:
> > >
> > > > Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
> > > > Fixes: faf1c0008a33 ("x86/vmemmap: optimize for consecutive sections in partial populated PMDs")
> > >
> > > Fortunately both of these appeared in 6.9-rc7, which minimizes the
> > > problem with having more than one Fixes:.
> > >
> > > But still, the Fixes: is a pointer telling -stable maintainers where in
> > > the kernel history we want them to insert the patch(es). Giving them
> > > multiple insertions points is confusing! Can we narrow this down
> > > to a single Fixes:?
> >
> > If I had to choose only one I think it should be 4917f55b4ef9,
> > since faf1c0008a33 is not yet known to be triggered without enlarging
> > struct page (and once it's backported it fixes both of them).
>
> On second look, faf1c0008a33 is introduced in v5.13-rc1 and
> 4917f55b4ef9 is introduced in v5.19-rc1.
>
> I'll go with Fixes: faf1c0008a33 because it's introduced earlier.
Uh, on third look Fixes: faf1c0008a33 is incorrect :/
It's Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges").
This is the commit that started initializing unused vmemmap with
PAGE_UNUSED, and can lead to bugs when current task's mm is not init_mm
because as accessing vmemmap before sync_global_pgds().
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel()
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel() Harry Yoo
@ 2025-07-11 16:18 ` David Hildenbrand
2025-07-13 11:39 ` Harry Yoo
0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2025-07-11 16:18 UTC (permalink / raw)
To: Harry Yoo, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin,
Arnd Bergmann, Andrew Morton, Dennis Zhou, Tejun Heo,
Christoph Lameter
Cc: H . Peter Anvin, Alexander Potapenko, Andrey Konovalov,
Dmitry Vyukov, Vincenzo Frascino, Juergen Gross, Kevin Brodsky,
Muchun Song, Oscar Salvador, Joao Martins, Lorenzo Stoakes,
Jane Chu, Alistair Popple, Mike Rapoport, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
stable
On 09.07.25 15:16, Harry Yoo wrote:
> Intrdocue and use {pgd,p4d}_pouplate_kernel() in core MM code when
> populating PGD and P4D entries corresponding to the kernel address
> space. The main purpose of these helpers is to ensure synchronization of
> the kernel portion of the top-level page tables whenever such an entry
> is populated.
>
> Until now, the kernel has relied on each architecture to handle
> synchronization of top-level page tables in an ad-hoc manner.
> For example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for
> direct mapping and vmemmap mapping changes").
>
> However, this approach has proven fragile, as it's easy to forget to
> perform the necessary synchronization when introducing new changes.
>
> To address this, introduce _kernel() varients of the page table
s/varients/variants/
> population helpers that invoke architecture-specific hooks to properly
> synchronize the page tables.
I was expecting to see the sync be done in common code -- such that it
cannot be missed :)
But it's really just rerouting to the arch code where the sync can be
done, correct?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel()
2025-07-11 16:18 ` David Hildenbrand
@ 2025-07-13 11:39 ` Harry Yoo
2025-07-13 17:56 ` Mike Rapoport
0 siblings, 1 reply; 14+ messages in thread
From: Harry Yoo @ 2025-07-13 11:39 UTC (permalink / raw)
To: David Hildenbrand
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin, Arnd Bergmann,
Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
H . Peter Anvin, Alexander Potapenko, Andrey Konovalov,
Dmitry Vyukov, Vincenzo Frascino, Juergen Gross, Kevin Brodsky,
Muchun Song, Oscar Salvador, Joao Martins, Lorenzo Stoakes,
Jane Chu, Alistair Popple, Mike Rapoport, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
stable
On Fri, Jul 11, 2025 at 06:18:44PM +0200, David Hildenbrand wrote:
> On 09.07.25 15:16, Harry Yoo wrote:
> > Intrdocue and use {pgd,p4d}_pouplate_kernel() in core MM code when
> > populating PGD and P4D entries corresponding to the kernel address
> > space. The main purpose of these helpers is to ensure synchronization of
> > the kernel portion of the top-level page tables whenever such an entry
> > is populated.
> >
> > Until now, the kernel has relied on each architecture to handle
> > synchronization of top-level page tables in an ad-hoc manner.
> > For example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for
> > direct mapping and vmemmap mapping changes").
> >
> > However, this approach has proven fragile, as it's easy to forget to
> > perform the necessary synchronization when introducing new changes.
> >
> > To address this, introduce _kernel() varients of the page table
>
> s/varients/variants/
Will fix. Thanks.
> > population helpers that invoke architecture-specific hooks to properly
> > synchronize the page tables.
>
> I was expecting to see the sync be done in common code -- such that it
> cannot be missed :)
You mean something like an arch-independent implementation of
sync_global_pgds()?
That would be a "much more robust" approach ;)
To do that, the kernel would need to maintain a list of page tables that
have kernel portion mapped and perform the sync in the common code.
But determining which page tables to add to the list would be highly
architecture-specific. For example, I think some architectures use separate
page tables for kernel space, unlike x86 (e.g., arm64 TTBR1, SPARC) and
user page tables should not be affected.
While doing the sync in common code might be a more robust option
in the long term, I'm afraid that making it work correctly across
all architectures would be challenging, due to differences in how each
architecture manages the kernel address space.
> But it's really just rerouting to the arch code where the sync can be done,
> correct?
Yes, that's correct.
Thanks for taking a look!
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel()
2025-07-13 11:39 ` Harry Yoo
@ 2025-07-13 17:56 ` Mike Rapoport
2025-07-14 8:10 ` Harry Yoo
0 siblings, 1 reply; 14+ messages in thread
From: Mike Rapoport @ 2025-07-13 17:56 UTC (permalink / raw)
To: Harry Yoo
Cc: David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin,
Arnd Bergmann, Andrew Morton, Dennis Zhou, Tejun Heo,
Christoph Lameter, H . Peter Anvin, Alexander Potapenko,
Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino, Juergen Gross,
Kevin Brodsky, Muchun Song, Oscar Salvador, Joao Martins,
Lorenzo Stoakes, Jane Chu, Alistair Popple, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
stable
On Sun, Jul 13, 2025 at 08:39:53PM +0900, Harry Yoo wrote:
> On Fri, Jul 11, 2025 at 06:18:44PM +0200, David Hildenbrand wrote:
> > On 09.07.25 15:16, Harry Yoo wrote:
> > > Intrdocue and use {pgd,p4d}_pouplate_kernel() in core MM code when
> > > populating PGD and P4D entries corresponding to the kernel address
> > > space. The main purpose of these helpers is to ensure synchronization of
> > > the kernel portion of the top-level page tables whenever such an entry
> > > is populated.
> > >
> > > Until now, the kernel has relied on each architecture to handle
> > > synchronization of top-level page tables in an ad-hoc manner.
> > > For example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for
> > > direct mapping and vmemmap mapping changes").
> > >
> > > However, this approach has proven fragile, as it's easy to forget to
> > > perform the necessary synchronization when introducing new changes.
> > >
> > > To address this, introduce _kernel() varients of the page table
> >
> > s/varients/variants/
>
> Will fix. Thanks.
>
> > > population helpers that invoke architecture-specific hooks to properly
> > > synchronize the page tables.
> >
> > I was expecting to see the sync be done in common code -- such that it
> > cannot be missed :)
>
> You mean something like an arch-independent implementation of
> sync_global_pgds()?
>
> That would be a "much more robust" approach ;)
>
> To do that, the kernel would need to maintain a list of page tables that
> have kernel portion mapped and perform the sync in the common code.
>
> But determining which page tables to add to the list would be highly
> architecture-specific. For example, I think some architectures use separate
> page tables for kernel space, unlike x86 (e.g., arm64 TTBR1, SPARC) and
> user page tables should not be affected.
sync_global_pgds() can be still implemented per architecture, but it can be
called from the common code.
We already have something like that for vmalloc that calls
arch_sync_kernel_mappings(). It's implemented only by x86-32 and arm, other
architectures do not define it.
> While doing the sync in common code might be a more robust option
> in the long term, I'm afraid that making it work correctly across
> all architectures would be challenging, due to differences in how each
> architecture manages the kernel address space.
>
> > But it's really just rerouting to the arch code where the sync can be done,
> > correct?
>
> Yes, that's correct.
>
> Thanks for taking a look!
>
> --
> Cheers,
> Harry / Hyeonggon
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel()
2025-07-13 17:56 ` Mike Rapoport
@ 2025-07-14 8:10 ` Harry Yoo
2025-07-14 15:32 ` Harry Yoo
0 siblings, 1 reply; 14+ messages in thread
From: Harry Yoo @ 2025-07-14 8:10 UTC (permalink / raw)
To: Mike Rapoport
Cc: David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin,
Arnd Bergmann, Andrew Morton, Dennis Zhou, Tejun Heo,
Christoph Lameter, H . Peter Anvin, Alexander Potapenko,
Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino, Juergen Gross,
Kevin Brodsky, Muchun Song, Oscar Salvador, Joao Martins,
Lorenzo Stoakes, Jane Chu, Alistair Popple, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
stable
On Sun, Jul 13, 2025 at 08:56:10PM +0300, Mike Rapoport wrote:
> On Sun, Jul 13, 2025 at 08:39:53PM +0900, Harry Yoo wrote:
> > On Fri, Jul 11, 2025 at 06:18:44PM +0200, David Hildenbrand wrote:
> > > > population helpers that invoke architecture-specific hooks to properly
> > > > synchronize the page tables.
> > >
> > > I was expecting to see the sync be done in common code -- such that it
> > > cannot be missed :)
> >
> > You mean something like an arch-independent implementation of
> > sync_global_pgds()?
> >
> > That would be a "much more robust" approach ;)
> >
> > To do that, the kernel would need to maintain a list of page tables that
> > have kernel portion mapped and perform the sync in the common code.
> >
> > But determining which page tables to add to the list would be highly
> > architecture-specific. For example, I think some architectures use separate
> > page tables for kernel space, unlike x86 (e.g., arm64 TTBR1, SPARC) and
> > user page tables should not be affected.
>
> sync_global_pgds() can be still implemented per architecture, but it can be
> called from the common code.
A good point, and that can be done!
Actually, that was the initial plan and I somehow thought that
you can't determine if the architecture is using 5-level or 4-level paging
and decide whether to call arch_sync_kernel_pagetables(). But looking at
how it's done in vmalloc, I think it can be done in a similar way.
> We already have something like that for vmalloc that calls
> arch_sync_kernel_mappings(). It's implemented only by x86-32 and arm, other
> architectures do not define it.
It is indeed a good example and was helpful.
Thank you for the comment, Mike!
> --
> Sincerely yours,
> Mike.
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel()
2025-07-14 8:10 ` Harry Yoo
@ 2025-07-14 15:32 ` Harry Yoo
0 siblings, 0 replies; 14+ messages in thread
From: Harry Yoo @ 2025-07-14 15:32 UTC (permalink / raw)
To: Mike Rapoport
Cc: David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin,
Arnd Bergmann, Andrew Morton, Dennis Zhou, Tejun Heo,
Christoph Lameter, H . Peter Anvin, Alexander Potapenko,
Andrey Konovalov, Dmitry Vyukov, Vincenzo Frascino, Juergen Gross,
Kevin Brodsky, Muchun Song, Oscar Salvador, Joao Martins,
Lorenzo Stoakes, Jane Chu, Alistair Popple, Gwan-gyeong Mun,
Aneesh Kumar K . V, x86, Joerg Roedel, linux-kernel, linux-arch,
linux-mm, stable
On Mon, Jul 14, 2025 at 05:10:44PM +0900, Harry Yoo wrote:
> On Sun, Jul 13, 2025 at 08:56:10PM +0300, Mike Rapoport wrote:
> > On Sun, Jul 13, 2025 at 08:39:53PM +0900, Harry Yoo wrote:
> > > On Fri, Jul 11, 2025 at 06:18:44PM +0200, David Hildenbrand wrote:
> > > > > population helpers that invoke architecture-specific hooks to properly
> > > > > synchronize the page tables.
> > > >
> > > > I was expecting to see the sync be done in common code -- such that it
> > > > cannot be missed :)
> > >
> > > You mean something like an arch-independent implementation of
> > > sync_global_pgds()?
> > >
> > > That would be a "much more robust" approach ;)
> > >
> > > To do that, the kernel would need to maintain a list of page tables that
> > > have kernel portion mapped and perform the sync in the common code.
> > >
> > > But determining which page tables to add to the list would be highly
> > > architecture-specific. For example, I think some architectures use separate
> > > page tables for kernel space, unlike x86 (e.g., arm64 TTBR1, SPARC) and
> > > user page tables should not be affected.
> >
> > sync_global_pgds() can be still implemented per architecture, but it can be
> > called from the common code.
>
> A good point, and that can be done!
>
> Actually, that was the initial plan and I somehow thought that
> you can't determine if the architecture is using 5-level or 4-level paging
> and decide whether to call arch_sync_kernel_pagetables(). But looking at
> how it's done in vmalloc, I think it can be done in a similar way.
>
> > We already have something like that for vmalloc that calls
> > arch_sync_kernel_mappings(). It's implemented only by x86-32 and arm, other
> > architectures do not define it.
>
> It is indeed a good example and was helpful.
> Thank you for the comment, Mike!
[Adding to Joerg Cc]
Wait, after reading more of the history on synchronization of page
tables for vmalloc area, I realized that at least on x86-64, all PGD
entries for vmalloc are preallocated [1].
But in this case I'm not sure adding/removing memory to/from the
system is performance critical enough to warrant a similar optimization.
I'll stick with current approach unless someone argues otherwise.
[1] https://lore.kernel.org/all/20200721095953.6218-2-joro@8bytes.org
Also, vmalloc and other features use apply_to_page_range() are not affected
by this change as they have their own ways to synchronize kernel mappings.
Perhaps that can be unified later but given that this series needs to be
backported later, I'd prefer to fix the bug first and defer cleanups
to a later time.
Thanks!
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-07-14 15:34 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 13:16 [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables Harry Yoo
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel() Harry Yoo
2025-07-11 16:18 ` David Hildenbrand
2025-07-13 11:39 ` Harry Yoo
2025-07-13 17:56 ` Mike Rapoport
2025-07-14 8:10 ` Harry Yoo
2025-07-14 15:32 ` Harry Yoo
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 2/3] x86/mm: define p*d_populate_kernel() and top-level page table sync Harry Yoo
2025-07-09 21:13 ` Andrew Morton
2025-07-10 8:27 ` Harry Yoo
2025-07-11 4:02 ` Harry Yoo
2025-07-11 4:16 ` Harry Yoo
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 3/3] x86/mm: convert {pgd,p4d}_populate{,_init} to _kernel variant Harry Yoo
2025-07-09 13:24 ` [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables Harry Yoo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).