linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables
@ 2025-07-09 13:16 Harry Yoo
  2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel() Harry Yoo
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Harry Yoo @ 2025-07-09 13:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Andrey Ryabinin, Arnd Bergmann,
	Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter
  Cc: H . Peter Anvin, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov, Vincenzo Frascino, Juergen Gross, Kevin Brodsky,
	Muchun Song, Oscar Salvador, Joao Martins, Lorenzo Stoakes,
	Jane Chu, Alistair Popple, Mike Rapoport, Gwan-gyeong Mun,
	Aneesh Kumar K . V, x86, linux-kernel, linux-arch, linux-mm,
	Harry Yoo

Hi all,

During our internal testing, we started observing intermittent boot
failures when the machine uses 4-level paging and has a large amount
of persistent memory:

  BUG: unable to handle page fault for address: ffffe70000000034
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 0 P4D 0 
  Oops: 0002 [#1] SMP NOPTI
  RIP: 0010:__init_single_page+0x9/0x6d
  Call Trace:
   <TASK>
   __init_zone_device_page+0x17/0x5d
   memmap_init_zone_device+0x154/0x1bb
   pagemap_range+0x2e0/0x40f
   memremap_pages+0x10b/0x2f0
   devm_memremap_pages+0x1e/0x60
   dev_dax_probe+0xce/0x2ec [device_dax]
   dax_bus_probe+0x6d/0xc9
   [... snip ...]
   </TASK>

It turns out that the kernel panics while initializing vmemmap
(struct page array) when the vmemmap region spans two PGD entries,
because the new PGD entry is only installed in init_mm.pgd,
but not in the page tables of other tasks.

And looking at __populate_section_memmap():
  if (vmemmap_can_optimize(altmap, pgmap))                                
          // does not sync top level page tables
          r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
  else                                                                    
          // sync top level page tables in x86
          r = vmemmap_populate(start, end, nid, altmap);

In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c
synchronizes the top level page table (See commit 9b861528a801
("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping
changes")) so that all tasks in the system can see the new vmemmap area.

However, when vmemmap_can_optimize() returns true, the optimized path
skips synchronization of top-level page tables. This is because
vmemmap_populate_compound_pages() is implemented in core MM code, which
does not handle synchronization of the top-level page tables. Instead,
the core MM has historically relied on each architecture to perform this
synchronization manually.

We're not the first party to encounter a crash caused by not-sync'd
top level page tables: earlier this year, Gwan-gyeong Mun attempted to
address the issue [1] [2] after hitting a kernel panic when x86 code
accessed the vmemmap area before the corresponding top-level entries
were synced. At that time, the issue was believed to be triggered
only when struct page was enlarged for debugging purposes, and the patch
did not get further updates.

It turns out that current approach of relying on each arch to handle
the page table sync manually is fragile because 1) it's easy to forget
to sync the top level page table, and 2) it's also easy to overlook that
the kernel should not access vmemmap / direct mapping area before the sync.

To address this, Dave Hansen suggested [3] [4] introducing
{pgd,p4d}_populate_kernel() for updating kernel portion
of the page tables and allow each architecture to explicitly perform
synchronization when installing top-level entries. With this approach,
we no longer need to worry about missing the sync step, reducing the risk
of future regressions.
 
This patch series implements Dave Hansen's suggestion and hence added
Suggested-by: Dave Hansen.

This is an RFC primarily because this involves non-trivial change and
I would like to fix it in a way that aligns with community consensus.

Cc stable because lack of this series opens the door to intermittent
boot failures.

[1] https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@intel.com
[2] https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com
[3] https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel.com
[4] https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel.com 

Harry Yoo (3):
  mm: introduce and use {pgd,p4d}_populate_kernel()
  x86/mm: define p*d_populate_kernel() and top level page table sync
  x86/mm: convert {pgd,p4d}_populate{,_init} to _kernel variant

 arch/x86/include/asm/pgalloc.h |  41 +++++++++
 arch/x86/mm/init_64.c          | 147 ++++++++++++++++-----------------
 arch/x86/mm/kasan_init_64.c    |   8 +-
 include/asm-generic/pgalloc.h  |   4 +
 include/linux/pgalloc.h        |   0
 mm/kasan/init.c                |  10 +--
 mm/percpu.c                    |   4 +-
 mm/sparse-vmemmap.c            |   4 +-
 8 files changed, 130 insertions(+), 88 deletions(-)
 create mode 100644 include/linux/pgalloc.h

-- 
2.43.0



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-07-14 15:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 13:16 [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables Harry Yoo
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 1/3] mm: introduce and use {pgd,p4d}_populate_kernel() Harry Yoo
2025-07-11 16:18   ` David Hildenbrand
2025-07-13 11:39     ` Harry Yoo
2025-07-13 17:56       ` Mike Rapoport
2025-07-14  8:10         ` Harry Yoo
2025-07-14 15:32           ` Harry Yoo
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 2/3] x86/mm: define p*d_populate_kernel() and top-level page table sync Harry Yoo
2025-07-09 21:13   ` Andrew Morton
2025-07-10  8:27     ` Harry Yoo
2025-07-11  4:02       ` Harry Yoo
2025-07-11  4:16         ` Harry Yoo
2025-07-09 13:16 ` [RFC V1 PATCH mm-hotfixes 3/3] x86/mm: convert {pgd,p4d}_populate{,_init} to _kernel variant Harry Yoo
2025-07-09 13:24 ` [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: A more robust approach to sync top level kernel page tables Harry Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).