linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Andy Lutomirski <luto@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Andrey Ryabinin <ryabinin.a.a@gmail.com>,
	Arnd Bergmann <arnd@arndb.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@gentwo.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>,
	Alexander Potapenko <glider@google.com>,
	Andrey Konovalov <andreyknvl@gmail.com>,
	Dmitry Vyukov <dvyukov@google.com>,
	Vincenzo Frascino <vincenzo.frascino@arm.com>,
	Juergen Gross <jgross@suse.com>,
	Kevin Brodsky <kevin.brodsky@arm.com>,
	Muchun Song <muchun.song@linux.dev>,
	Oscar Salvador <osalvador@suse.de>,
	Joao Martins <joao.m.martins@oracle.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Jane Chu <jane.chu@oracle.com>,
	Alistair Popple <apopple@nvidia.com>,
	Mike Rapoport <rppt@kernel.org>,
	David Hildenbrand <david@redhat.com>,
	Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>,
	Joerg Roedel <joro@8bytes.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 mm-hotfixes 0/5] mm, arch: a more robust approach to sync top level kernel page tables
Date: Mon, 21 Jul 2025 08:57:19 +0900	[thread overview]
Message-ID: <aH2CXyyrYbgtUXJB@hyeyoo> (raw)
In-Reply-To: <20250720234203.9126-1-harry.yoo@oracle.com>

On Mon, Jul 21, 2025 at 08:41:58AM +0900, Harry Yoo wrote:
> RFC v1: https://lore.kernel.org/linux-mm/20250709131657.5660-1-harry.yoo@oracle.com
> 
> RFC v1 -> v2:
> - Dropped RFC tag.
> - Exposed page table sync code to common code (Mike Rapoport).
> - Used only one Fixes: tag in patch 3 instead of two,
>   to avoid confusion (Andrew Morton)
> - Reused existing ARCH_PAGE_TABLE_SYNC_MASK and
>   arch_sync_kernel_mappings() facility (currently used by vmalloc and
>   ioremap) forpage table sync instead of introducing a new interface.
> 
> A quick question: Technically, patch 4 and 5 don't necessarily need to be
> backported. Does it make sense to backport only patch 1-3?
>
> # The problem: It is easy to miss/overlook page table synchronization
> 
> Hi all,

Looks like I forgot to Cc: Uladzislau and Joerg.. adding them to Cc.

-- 
Cheers,
Harry / Hyeonggon
 
> During our internal testing, we started observing intermittent boot
> failures when the machine uses 4-level paging and has a large amount
> of persistent memory:
> 
>   BUG: unable to handle page fault for address: ffffe70000000034
>   #PF: supervisor write access in kernel mode
>   #PF: error_code(0x0002) - not-present page
>   PGD 0 P4D 0 
>   Oops: 0002 [#1] SMP NOPTI
>   RIP: 0010:__init_single_page+0x9/0x6d
>   Call Trace:
>    <TASK>
>    __init_zone_device_page+0x17/0x5d
>    memmap_init_zone_device+0x154/0x1bb
>    pagemap_range+0x2e0/0x40f
>    memremap_pages+0x10b/0x2f0
>    devm_memremap_pages+0x1e/0x60
>    dev_dax_probe+0xce/0x2ec [device_dax]
>    dax_bus_probe+0x6d/0xc9
>    [... snip ...]
>    </TASK>
> 
> It turns out that the kernel panics while initializing vmemmap
> (struct page array) when the vmemmap region spans two PGD entries,
> because the new PGD entry is only installed in init_mm.pgd,
> but not in the page tables of other tasks.
> 
> And looking at __populate_section_memmap():
>   if (vmemmap_can_optimize(altmap, pgmap))                                
>           // does not sync top level page tables
>           r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
>   else                                                                    
>           // sync top level page tables in x86
>           r = vmemmap_populate(start, end, nid, altmap);
> 
> In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c
> synchronizes the top level page table (See commit 9b861528a801
> ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping
> changes")) so that all tasks in the system can see the new vmemmap area.
> 
> However, when vmemmap_can_optimize() returns true, the optimized path
> skips synchronization of top-level page tables. This is because
> vmemmap_populate_compound_pages() is implemented in core MM code, which
> does not handle synchronization of the top-level page tables. Instead,
> the core MM has historically relied on each architecture to perform this
> synchronization manually.
> 
> We're not the first party to encounter a crash caused by not-sync'd
> top level page tables: earlier this year, Gwan-gyeong Mun attempted to
> address the issue [1] [2] after hitting a kernel panic when x86 code
> accessed the vmemmap area before the corresponding top-level entries
> were synced. At that time, the issue was believed to be triggered
> only when struct page was enlarged for debugging purposes, and the patch
> did not get further updates.
> 
> It turns out that current approach of relying on each arch to handle
> the page table sync manually is fragile because 1) it's easy to forget
> to sync the top level page table, and 2) it's also easy to overlook that
> the kernel should not access the vmemmap and direct mapping areas before
> the sync.
> 
> # The solution: Make page table sync more code robust 
> 
> To address this, Dave Hansen suggested [3] [4] introducing
> {pgd,p4d}_populate_kernel() for updating kernel portion
> of the page tables and allow each architecture to explicitly perform
> synchronization when installing top-level entries. With this approach,
> we no longer need to worry about missing the sync step, reducing the risk
> of future regressions.
> 
> The new interface reuses existing ARCH_PAGE_TABLE_SYNC_MASK,
> PGTBL_P*D_MODIFIED and arch_sync_kernel_mappings() facility used by
> vmalloc and ioremap to synchronize page tables.
> 
> pgd_populate_kernel() looks like this:
>   #define pgd_populate_kernel(addr, pgd, p4d)                    \               
>   do {                                                           \               
>          pgd_populate(&init_mm, pgd, p4d);                       \               
>          if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED)     \               
>                  arch_sync_kernel_mappings(addr, addr);          \               
>   } while (0) 
> 
> It is worth noting that vmalloc() and apply_to_range() carefully
> synchronizes page tables by calling p*d_alloc_track() and
> arch_sync_kernel_mappings(), and thus they are not affected by
> this patch series.
> 
> This patch series was hugely inspired by Dave Hansen's suggestion and
> hence added Suggested-by: Dave Hansen.
> 
> Cc stable because lack of this series opens the door to intermittent
> boot failures.
> 
> [1] https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@intel.com
> [2] https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com
> [3] https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel.com
> [4] https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel.com 
> 
> Harry Yoo (5):
>   mm: move page table sync declarations to asm/pgalloc.h
>   mm: introduce and use {pgd,p4d}_populate_kernel()
>   x86/mm: define ARCH_PAGE_TABLE_SYNC_MASK and
>     arch_sync_kernel_mappings()
>   x86/mm: convert p*d_populate{,_init} to _kernel variants
>   x86/mm: drop unnecessary calls to sync_global_pgds() and fold into its
>     sole user
> 
>  arch/x86/include/asm/pgalloc.h | 22 ++++++++++++++++++++
>  arch/x86/mm/init_64.c          | 37 +++++++++++++++++++---------------
>  arch/x86/mm/kasan_init_64.c    |  8 ++++----
>  include/asm-generic/pgalloc.h  | 30 +++++++++++++++++++++++++++
>  include/linux/vmalloc.h        | 16 ---------------
>  mm/kasan/init.c                | 10 ++++-----
>  mm/percpu.c                    |  4 ++--
>  mm/sparse-vmemmap.c            |  4 ++--
>  mm/vmalloc.c                   |  1 +
>  9 files changed, 87 insertions(+), 45 deletions(-)
> 
> -- 
> 2.43.0
>

  parent reply	other threads:[~2025-07-20 23:58 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-20 23:41 [PATCH v2 mm-hotfixes 0/5] mm, arch: a more robust approach to sync top level kernel page tables Harry Yoo
2025-07-20 23:41 ` [PATCH v2 mm-hotfixes 1/5] mm: move page table sync declarations to asm/pgalloc.h Harry Yoo
2025-07-20 23:42 ` [PATCH v2 mm-hotfixes 2/5] mm: introduce and use {pgd,p4d}_populate_kernel() Harry Yoo
2025-07-20 23:42 ` [PATCH v2 mm-hotfixes 3/5] x86/mm: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() Harry Yoo
2025-07-20 23:42 ` [PATCH v2 mm-hotfixes 4/5] x86/mm: convert p*d_populate{,_init} to _kernel variants Harry Yoo
2025-07-20 23:42 ` [PATCH v2 mm-hotfixes 5/5] x86/mm: drop unnecessary calls to sync_global_pgds() and fold into its sole user Harry Yoo
2025-07-20 23:57 ` Harry Yoo [this message]
2025-07-21 11:46   ` [PATCH v2 mm-hotfixes 0/5] mm, arch: a more robust approach to sync top level kernel page tables Harry Yoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aH2CXyyrYbgtUXJB@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreyknvl@gmail.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=apopple@nvidia.com \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=cl@gentwo.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=dennis@kernel.org \
    --cc=dvyukov@google.com \
    --cc=glider@google.com \
    --cc=gwan-gyeong.mun@intel.com \
    --cc=hpa@zytor.com \
    --cc=jane.chu@oracle.com \
    --cc=jgross@suse.com \
    --cc=joao.m.martins@oracle.com \
    --cc=joro@8bytes.org \
    --cc=kevin.brodsky@arm.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=peterz@infradead.org \
    --cc=rppt@kernel.org \
    --cc=ryabinin.a.a@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=urezki@gmail.com \
    --cc=vincenzo.frascino@arm.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).