public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: Brendan Jackman <jackmanb@google.com>
To: Brendan Jackman <jackmanb@google.com>,
	Borislav Petkov <bp@alien8.de>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	 Vlastimil Babka <vbabka@kernel.org>, Wei Xu <weixugc@google.com>,
	 Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>,
	Lorenzo Stoakes <ljs@kernel.org>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<x86@kernel.org>,  <rppt@kernel.org>,
	Sumit Garg <sumit.garg@oss.qualcomm.com>, <derkling@google.com>,
	 <reijiw@google.com>, Will Deacon <will@kernel.org>,
	<rientjes@google.com>,
	 "Kalyazin, Nikita" <kalyazin@amazon.co.uk>,
	<patrick.roy@linux.dev>,
	 "Itazuri, Takahiro" <itazur@amazon.co.uk>,
	Andy Lutomirski <luto@kernel.org>,
	 David Kaplan <david.kaplan@amd.com>,
	Thomas Gleixner <tglx@kernel.org>, Yosry Ahmed <yosry@kernel.org>
Subject: Re: [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region"
Date: Wed, 25 Mar 2026 14:23:37 +0000	[thread overview]
Message-ID: <DHBXJD585D49.2FLK9J7LOYOB9@google.com> (raw)
In-Reply-To: <20260320-page_alloc-unmapped-v2-2-28bf1bd54f41@google.com>

Summarizing Sashiko review [0] so all the comments are in the same place...

https://sashiko.dev/#/patchset/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41%40google.com

On Fri Mar 20, 2026 at 6:23 PM UTC, Brendan Jackman wrote:
> Various security features benefit from having process-local address
> mappings. Examples include no-direct-map guest_memfd [2] and significant
> optimizations for ASI [1].
>
> As pointed out by Andy in [0], x86 already has a PGD entry that is local
> to the mm, which is used for the LDT.
>
> So, simply redefine that entry's region as "the mm-local region" and
> then redefine the LDT region as a sub-region of that.
>
> With the currently-envisaged usecases, there will be many situations
> where almost no processes have any need for the mm-local region.
> Therefore, avoid its overhead (memory cost of pagetables, alloc/free
> overhead during fork/exit) for processes that don't use it by requiring
> its users to explicitly initialize it via the new mm_local_* API.
>
> This means that the LDT remap code can be simplified:
>
> 1. map_ldt_struct_to_user() and free_ldt_pgtables() are no longer
>    required as the mm_local core code handles that automatically.
>
> 2. The sanity-check logic is unified: in both cases just walk the
>    pagetables via a generic mechanism. This slightly relaxes the
>    sanity-checking since lookup_address_in_pgd() is more flexible than
>    pgd_to_pmd_walk(), but this seems to be worth it for the simplified
>    code.
>
> On 64-bit, the mm-local region gets a whole PGD. On 32-bit, it just
> i.e. one PMD, i.e. it is completely consumed by the LDT remap - no
> investigation has been done into whether it's feasible to expand the
> region on 32-bit. Most likely there is no strong usecase for that
> anyway.
>
> In both cases, in order to combine the need for an on-demand mm
> initialisation, combined with the desire to transparently handle
> propagating mappings to userspace under KPTI, the user and kernel
> pagetables are shared at the highest level possible. For PAE that means
> the PTE table is shared and for 64-bit the P4D/PUD. This is implemented
> by pre-allocating the first shared table when the mm-local region is
> first initialised.
>
> The PAE implementation of mm_local_map_to_user() does not allocate
> pagetables, it assumes the PMD has been preallocated. To make that
> assumption safer, expose PREALLOCATED_PMDs in the arch headers so that
> mm_local_map_to_user() can have a BUILD_BUG_ON().
>
> [0] https://lore.kernel.org/linux-mm/CALCETrXHbS9VXfZ80kOjiTrreM2EbapYeGp68mvJPbosUtorYA@mail.gmail.com/
> [1] https://linuxasi.dev/
> [2] https://lore.kernel.org/all/20250924151101.2225820-1-patrick.roy@campus.lmu.de
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
>  Documentation/arch/x86/x86_64/mm.rst    |   4 +-
>  arch/x86/Kconfig                        |   2 +
>  arch/x86/include/asm/mmu_context.h      | 119 ++++++++++++++++++++++++++++-
>  arch/x86/include/asm/page.h             |  32 ++++++++
>  arch/x86/include/asm/pgtable_32_areas.h |   9 ++-
>  arch/x86/include/asm/pgtable_64_types.h |  12 ++-
>  arch/x86/kernel/ldt.c                   | 130 +++++---------------------------
>  arch/x86/mm/pgtable.c                   |  32 +-------
>  include/linux/mm.h                      |  13 ++++
>  include/linux/mm_types.h                |   2 +
>  kernel/fork.c                           |   1 +
>  mm/Kconfig                              |  11 +++
>  12 files changed, 217 insertions(+), 150 deletions(-)
>
> diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst
> index a6cf05d51bd8c..fa2bb7bab6a42 100644
> --- a/Documentation/arch/x86/x86_64/mm.rst
> +++ b/Documentation/arch/x86/x86_64/mm.rst
> @@ -53,7 +53,7 @@ Complete virtual memory map with 4-level page tables
>    ____________________________________________________________|___________________________________________________________
>                      |            |                  |         |
>     ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
> -   ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | LDT remap for PTI
> +   ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | MM-local kernel data. Includes LDT remap for PTI
>     ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
>     ffffc88000000000 |  -55.5  TB | ffffc8ffffffffff |  0.5 TB | ... unused hole
>     ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
> @@ -123,7 +123,7 @@ Complete virtual memory map with 5-level page tables
>    ____________________________________________________________|___________________________________________________________
>                      |            |                  |         |
>     ff00000000000000 |  -64    PB | ff0fffffffffffff |    4 PB | ... guard hole, also reserved for hypervisor
> -   ff10000000000000 |  -60    PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
> +   ff10000000000000 |  -60    PB | ff10ffffffffffff | 0.25 PB | MM-local kernel data. Includes LDT remap for PTI
>     ff11000000000000 |  -59.75 PB | ff90ffffffffffff |   32 PB | direct mapping of all physical memory (page_offset_base)
>     ff91000000000000 |  -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
>     ffa0000000000000 |  -24    PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 8038b26ae99e0..d7073b6077c62 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -133,6 +133,7 @@ config X86
>  	select ARCH_SUPPORTS_RT
>  	select ARCH_SUPPORTS_AUTOFDO_CLANG
>  	select ARCH_SUPPORTS_PROPELLER_CLANG    if X86_64
> +	select ARCH_SUPPORTS_MM_LOCAL_REGION	if X86_64 || X86_PAE
>  	select ARCH_USE_BUILTIN_BSWAP
>  	select ARCH_USE_CMPXCHG_LOCKREF		if X86_CX8
>  	select ARCH_USE_MEMTEST
> @@ -2323,6 +2324,7 @@ config CMDLINE_OVERRIDE
>  config MODIFY_LDT_SYSCALL
>  	bool "Enable the LDT (local descriptor table)" if EXPERT
>  	default y
> +	select MM_LOCAL_REGION if MITIGATION_PAGE_TABLE_ISOLATION || X86_PAE
>  	help
>  	  Linux can allow user programs to install a per-process x86
>  	  Local Descriptor Table (LDT) using the modify_ldt(2) system
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index ef5b507de34e2..14f75d1d7e28f 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -8,8 +8,10 @@
>  
>  #include <trace/events/tlb.h>
>  
> +#include <asm/tlb.h>
>  #include <asm/tlbflush.h>
>  #include <asm/paravirt.h>
> +#include <asm/pgalloc.h>
>  #include <asm/debugreg.h>
>  #include <asm/gsseg.h>
>  #include <asm/desc.h>
> @@ -59,7 +61,6 @@ static inline void init_new_context_ldt(struct mm_struct *mm)
>  }
>  int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
>  void destroy_context_ldt(struct mm_struct *mm);
> -void ldt_arch_exit_mmap(struct mm_struct *mm);
>  #else	/* CONFIG_MODIFY_LDT_SYSCALL */
>  static inline void init_new_context_ldt(struct mm_struct *mm) { }
>  static inline int ldt_dup_context(struct mm_struct *oldmm,
> @@ -68,7 +69,6 @@ static inline int ldt_dup_context(struct mm_struct *oldmm,
>  	return 0;
>  }
>  static inline void destroy_context_ldt(struct mm_struct *mm) { }
> -static inline void ldt_arch_exit_mmap(struct mm_struct *mm) { }
>  #endif
>  
>  #ifdef CONFIG_MODIFY_LDT_SYSCALL
> @@ -223,10 +223,123 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
>  	return ldt_dup_context(oldmm, mm);
>  }
>  
> +#ifdef CONFIG_MM_LOCAL_REGION
> +static inline void mm_local_region_free(struct mm_struct *mm)
> +{
> +	if (mm_local_region_used(mm)) {
> +		struct mmu_gather tlb;
> +		unsigned long start = MM_LOCAL_BASE_ADDR;
> +		unsigned long end = MM_LOCAL_END_ADDR;
> +
> +		/*
> +		 * Although free_pgd_range() is intended for freeing user
> +		 * page-tables, it also works out for kernel mappings on x86.
> +		 * We use tlb_gather_mmu_fullmm() to avoid confusing the
> +		 * range-tracking logic in __tlb_adjust_range().
> +		 */
> +		tlb_gather_mmu_fullmm(&tlb, mm);
> +		free_pgd_range(&tlb, start, end, start, end);
> +		tlb_finish_mmu(&tlb);
> +
> +		mm_flags_clear(MMF_LOCAL_REGION_USED, mm);
> +	}
> +}
> +
> +#if defined(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION) && defined(CONFIG_X86_PAE)
> +static inline pmd_t *pgd_to_pmd_walk(pgd_t *pgd, unsigned long va)
> +{
> +	p4d_t *p4d;
> +	pud_t *pud;
> +
> +	if (pgd->pgd == 0)
> +		return NULL;
> +
> +	p4d = p4d_offset(pgd, va);
> +	if (p4d_none(*p4d))
> +		return NULL;
> +
> +	pud = pud_offset(p4d, va);
> +	if (pud_none(*pud))
> +		return NULL;
> +
> +	return pmd_offset(pud, va);
> +}
> +
> +static inline int mm_local_map_to_user(struct mm_struct *mm)
> +{
> +	BUILD_BUG_ON(!PREALLOCATED_PMDS);
> +	pgd_t *k_pgd = pgd_offset(mm, MM_LOCAL_BASE_ADDR);
> +	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
> +	pmd_t *k_pmd, *u_pmd;
> +	int err;
> +
> +	k_pmd = pgd_to_pmd_walk(k_pgd, MM_LOCAL_BASE_ADDR);
> +	u_pmd = pgd_to_pmd_walk(u_pgd, MM_LOCAL_BASE_ADDR);
> +
> +	BUILD_BUG_ON(MM_LOCAL_END_ADDR - MM_LOCAL_BASE_ADDR > PMD_SIZE);
> +
> +	/* Preallocate the PTE table so it can be shared. */
> +	err = pte_alloc(mm, k_pmd);
> +	if (err)
> +		return err;
> +
> +	/* Point the userspace PMD at the same PTE as the kernel PMD. */
> +	set_pmd(u_pmd, *k_pmd);
> +	return 0;
> +}
> +#elif defined(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)
> +static inline int mm_local_map_to_user(struct mm_struct *mm)
> +{
> +	pgd_t *pgd;
> +	int err;
> +
> +	err = preallocate_sub_pgd(mm, MM_LOCAL_BASE_ADDR);
> +	if (err)
> +		return err;
> +
> +	pgd = pgd_offset(mm, MM_LOCAL_BASE_ADDR);
> +	set_pgd(kernel_to_user_pgdp(pgd), *pgd);
> +	return 0;
> +}
> +#else
> +static inline int mm_local_map_to_user(struct mm_struct *mm)
> +{
> +	WARN_ONCE(1, "mm_local_map_to_user() not implemented");
> +	return -EINVAL;
> +}
> +#endif
> +
> +/*
> + * Do initial setup of the user-local region. Call from process context.
> + *
> + * Under PTI, userspace shares the pagetables for the mm-local region with the
> + * kernel (if you map stuff here, it's immediately mapped into userspace too).
> + * LDT remap. It's assuming nothing gets mapped in here that needs to be
> + * protected from Meltdown-type attacks from the current process.
> + */
> +static inline int mm_local_region_init(struct mm_struct *mm)
> +{
> +	int err;
> +
> +	if (boot_cpu_has(X86_FEATURE_PTI)) {
> +		err = mm_local_map_to_user(mm);
> +		if (err)
> +			return err;
> +	}
> +
> +	mm_flags_set(MMF_LOCAL_REGION_USED, mm);
> +
> +	return 0;
> +}
> +
> +#else
> +static inline void mm_local_region_free(struct mm_struct *mm) { }
> +#endif /* CONFIG_MM_LOCAL_REGION */
> +
>  static inline void arch_exit_mmap(struct mm_struct *mm)
>  {
>  	paravirt_arch_exit_mmap(mm);
> -	ldt_arch_exit_mmap(mm);
> +	mm_local_region_free(mm);
>  }
>  
>  #ifdef CONFIG_X86_64
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 416dc88e35c15..4de4715c3b40f 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -78,6 +78,38 @@ static __always_inline u64 __is_canonical_address(u64 vaddr, u8 vaddr_bits)
>  	return __canonical_address(vaddr, vaddr_bits) == vaddr;
>  }
>  
> +#ifdef CONFIG_X86_PAE
> +
> +/*
> + * In PAE mode, we need to do a cr3 reload (=tlb flush) when
> + * updating the top-level pagetable entries to guarantee the
> + * processor notices the update.  Since this is expensive, and
> + * all 4 top-level entries are used almost immediately in a
> + * new process's life, we just pre-populate them here.
> + */
> +#define PREALLOCATED_PMDS	PTRS_PER_PGD
> +/*
> + * "USER_PMDS" are the PMDs for the user copy of the page tables when
> + * PTI is enabled. They do not exist when PTI is disabled.  Note that
> + * this is distinct from the user _portion_ of the kernel page tables
> + * which always exists.
> + *
> + * We allocate separate PMDs for the kernel part of the user page-table
> + * when PTI is enabled. We need them to map the per-process LDT into the
> + * user-space page-table.
> + */
> +#define PREALLOCATED_USER_PMDS (boot_cpu_has(X86_FEATURE_PTI) ? KERNEL_PGD_PTRS : 0)
> +#define MAX_PREALLOCATED_USER_PMDS KERNEL_PGD_PTRS
> +
> +#else  /* !CONFIG_X86_PAE */
> +
> +/* No need to prepopulate any pagetable entries in non-PAE modes. */
> +#define PREALLOCATED_PMDS	0
> +#define PREALLOCATED_USER_PMDS	0
> +#define MAX_PREALLOCATED_USER_PMDS 0
> +
> +#endif	/* CONFIG_X86_PAE */
> +
>  #endif	/* __ASSEMBLER__ */
>  
>  #include <asm-generic/memory_model.h>
> diff --git a/arch/x86/include/asm/pgtable_32_areas.h b/arch/x86/include/asm/pgtable_32_areas.h
> index 921148b429676..7fccb887f8b33 100644
> --- a/arch/x86/include/asm/pgtable_32_areas.h
> +++ b/arch/x86/include/asm/pgtable_32_areas.h
> @@ -30,9 +30,14 @@ extern bool __vmalloc_start_set; /* set once high_memory is set */
>  #define CPU_ENTRY_AREA_BASE	\
>  	((FIXADDR_TOT_START - PAGE_SIZE*(CPU_ENTRY_AREA_PAGES+1)) & PMD_MASK)
>  
> -#define LDT_BASE_ADDR		\
> -	((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
> +/*
> + * On 32-bit the mm-local region is currently completely consumed by the LDT
> + * remap.
> + */
> +#define MM_LOCAL_BASE_ADDR	((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
> +#define MM_LOCAL_END_ADDR	(MM_LOCAL_BASE_ADDR + PMD_SIZE)
>  
> +#define LDT_BASE_ADDR		MM_LOCAL_BASE_ADDR
>  #define LDT_END_ADDR		(LDT_BASE_ADDR + PMD_SIZE)
>  
>  #define PKMAP_BASE		\
> diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
> index 7eb61ef6a185f..1181565966405 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -5,8 +5,11 @@
>  #include <asm/sparsemem.h>
>  
>  #ifndef __ASSEMBLER__
> +#include <linux/build_bug.h>
>  #include <linux/types.h>
>  #include <asm/kaslr.h>
> +#include <asm/page_types.h>
> +#include <uapi/asm/ldt.h>
>  
>  /*
>   * These are used to make use of C type-checking..
> @@ -100,9 +103,12 @@ extern unsigned int ptrs_per_p4d;
>  #define GUARD_HOLE_BASE_ADDR	(GUARD_HOLE_PGD_ENTRY << PGDIR_SHIFT)
>  #define GUARD_HOLE_END_ADDR	(GUARD_HOLE_BASE_ADDR + GUARD_HOLE_SIZE)
>  
> -#define LDT_PGD_ENTRY		-240UL
> -#define LDT_BASE_ADDR		(LDT_PGD_ENTRY << PGDIR_SHIFT)
> -#define LDT_END_ADDR		(LDT_BASE_ADDR + PGDIR_SIZE)
> +#define MM_LOCAL_PGD_ENTRY	-240UL
> +#define MM_LOCAL_BASE_ADDR	(MM_LOCAL_PGD_ENTRY << PGDIR_SHIFT)
> +#define MM_LOCAL_END_ADDR	((MM_LOCAL_PGD_ENTRY + 1) << PGDIR_SHIFT)
> +
> +#define LDT_BASE_ADDR		MM_LOCAL_BASE_ADDR
> +#define LDT_END_ADDR		(LDT_BASE_ADDR + PMD_SIZE)
>  
>  #define __VMALLOC_BASE_L4	0xffffc90000000000UL
>  #define __VMALLOC_BASE_L5 	0xffa0000000000000UL
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index 40c5bf97dd5cc..fb2a1914539f8 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -31,6 +31,8 @@
>  
>  #include <xen/xen.h>
>  
> +/* LDTs are double-buffered, the buffers are called slots. */
> +#define LDT_NUM_SLOTS		2
>  /* This is a multiple of PAGE_SIZE. */
>  #define LDT_SLOT_STRIDE (LDT_ENTRIES * LDT_ENTRY_SIZE)
>  
> @@ -186,100 +188,36 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
>  
>  #ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
>  
> -static void do_sanity_check(struct mm_struct *mm,
> -			    bool had_kernel_mapping,
> -			    bool had_user_mapping)
> +static void sanity_check_ldt_mapping(struct mm_struct *mm)
>  {
> +	pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
> +	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
> +	unsigned int k_level, u_level;
> +	bool had_kernel, had_user;
> +
> +	had_kernel = lookup_address_in_pgd(k_pgd, LDT_BASE_ADDR, &k_level);
> +	had_user   = lookup_address_in_pgd(u_pgd, LDT_BASE_ADDR, &u_level);
> +
>  	if (mm->context.ldt) {
>  		/*
>  		 * We already had an LDT.  The top-level entry should already
>  		 * have been allocated and synchronized with the usermode
>  		 * tables.
>  		 */
> -		WARN_ON(!had_kernel_mapping);
> +		WARN_ON(!had_kernel);
>  		if (boot_cpu_has(X86_FEATURE_PTI))
> -			WARN_ON(!had_user_mapping);
> +			WARN_ON(!had_user);
>  	} else {
>  		/*
>  		 * This is the first time we're mapping an LDT for this process.
>  		 * Sync the pgd to the usermode tables.
>  		 */
> -		WARN_ON(had_kernel_mapping);
> +		WARN_ON(had_kernel);
>  		if (boot_cpu_has(X86_FEATURE_PTI))
> -			WARN_ON(had_user_mapping);
> +			WARN_ON(had_user);

But under PAE the PTE is preallocated. lookup_address_in_pgd() returns
NULL if the address is unmapped at a higher level but for 4K
specifically it returns a non-NULL pointer to a non-present PTE.

This WARNs immediately when I run the selftests so I suspect I broke
this and then forgot to retest with PTI.

>  	}
>  }
>  
> -#ifdef CONFIG_X86_PAE
> -
> -static pmd_t *pgd_to_pmd_walk(pgd_t *pgd, unsigned long va)
> -{
> -	p4d_t *p4d;
> -	pud_t *pud;
> -
> -	if (pgd->pgd == 0)
> -		return NULL;
> -
> -	p4d = p4d_offset(pgd, va);
> -	if (p4d_none(*p4d))
> -		return NULL;
> -
> -	pud = pud_offset(p4d, va);
> -	if (pud_none(*pud))
> -		return NULL;
> -
> -	return pmd_offset(pud, va);
> -}
> -
> -static void map_ldt_struct_to_user(struct mm_struct *mm)
> -{
> -	pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
> -	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
> -	pmd_t *k_pmd, *u_pmd;
> -
> -	k_pmd = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
> -	u_pmd = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
> -
> -	if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
> -		set_pmd(u_pmd, *k_pmd);
> -}
> -
> -static void sanity_check_ldt_mapping(struct mm_struct *mm)
> -{
> -	pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
> -	pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
> -	bool had_kernel, had_user;
> -	pmd_t *k_pmd, *u_pmd;
> -
> -	k_pmd      = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
> -	u_pmd      = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
> -	had_kernel = (k_pmd->pmd != 0);
> -	had_user   = (u_pmd->pmd != 0);
> -
> -	do_sanity_check(mm, had_kernel, had_user);
> -}
> -
> -#else /* !CONFIG_X86_PAE */
> -
> -static void map_ldt_struct_to_user(struct mm_struct *mm)
> -{
> -	pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
> -
> -	if (boot_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
> -		set_pgd(kernel_to_user_pgdp(pgd), *pgd);
> -}
> -
> -static void sanity_check_ldt_mapping(struct mm_struct *mm)
> -{
> -	pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
> -	bool had_kernel = (pgd->pgd != 0);
> -	bool had_user   = (kernel_to_user_pgdp(pgd)->pgd != 0);
> -
> -	do_sanity_check(mm, had_kernel, had_user);
> -}
> -
> -#endif /* CONFIG_X86_PAE */
> -
>  /*
>   * If PTI is enabled, this maps the LDT into the kernelmode and
>   * usermode tables for the given mm.
> @@ -295,6 +233,8 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
>  	if (!boot_cpu_has(X86_FEATURE_PTI))
>  		return 0;
>  
> +	mm_local_region_init(mm);

Need to handle errors...

It also seems to think there's a path where we allocate a pagetable in
mm_local_region_init(), then fail without setting MMF_LOCAL_REGION_USED,
and don't free the pagetable. I can't see the path it's talking about
though.

> +
>  	/*
>  	 * Any given ldt_struct should have map_ldt_struct() called at most
>  	 * once.
> @@ -339,9 +279,6 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
>  		pte_unmap_unlock(ptep, ptl);
>  	}
>  
> -	/* Propagate LDT mapping to the user page-table */
> -	map_ldt_struct_to_user(mm);
> -
>  	ldt->slot = slot;
>  	return 0;
>  }
> @@ -390,28 +327,6 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
>  }
>  #endif /* CONFIG_MITIGATION_PAGE_TABLE_ISOLATION */
>  
> -static void free_ldt_pgtables(struct mm_struct *mm)
> -{
> -#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
> -	struct mmu_gather tlb;
> -	unsigned long start = LDT_BASE_ADDR;
> -	unsigned long end = LDT_END_ADDR;
> -
> -	if (!boot_cpu_has(X86_FEATURE_PTI))
> -		return;
> -
> -	/*
> -	 * Although free_pgd_range() is intended for freeing user
> -	 * page-tables, it also works out for kernel mappings on x86.
> -	 * We use tlb_gather_mmu_fullmm() to avoid confusing the
> -	 * range-tracking logic in __tlb_adjust_range().
> -	 */
> -	tlb_gather_mmu_fullmm(&tlb, mm);
> -	free_pgd_range(&tlb, start, end, start, end);
> -	tlb_finish_mmu(&tlb);
> -#endif
> -}
> -
>  /* After calling this, the LDT is immutable. */
>  static void finalize_ldt_struct(struct ldt_struct *ldt)
>  {
> @@ -472,7 +387,6 @@ int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
>  
>  	retval = map_ldt_struct(mm, new_ldt, 0);
>  	if (retval) {
> -		free_ldt_pgtables(mm);
>  		free_ldt_struct(new_ldt);
>  		goto out_unlock;
>  	}
> @@ -494,11 +408,6 @@ void destroy_context_ldt(struct mm_struct *mm)
>  	mm->context.ldt = NULL;
>  }
>  
> -void ldt_arch_exit_mmap(struct mm_struct *mm)
> -{
> -	free_ldt_pgtables(mm);
> -}
> -
>  static int read_ldt(void __user *ptr, unsigned long bytecount)
>  {
>  	struct mm_struct *mm = current->mm;
> @@ -645,10 +554,9 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
>  		/*
>  		 * This only can fail for the first LDT setup. If an LDT is
>  		 * already installed then the PTE page is already
> -		 * populated. Mop up a half populated page table.
> +		 * populated.
>  		 */
> -		if (!WARN_ON_ONCE(old_ldt))
> -			free_ldt_pgtables(mm);
> +		WARN_ON_ONCE(!old_ldt);

That should be WARN_ON_ONCE(old_ldt);

>  		free_ldt_struct(new_ldt);
>  		goto out_unlock;
>  	}
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 2e5ecfdce73c3..e4132696c9ef2 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -111,29 +111,6 @@ static void pgd_dtor(pgd_t *pgd)
>   */
>  
>  #ifdef CONFIG_X86_PAE
> -/*
> - * In PAE mode, we need to do a cr3 reload (=tlb flush) when
> - * updating the top-level pagetable entries to guarantee the
> - * processor notices the update.  Since this is expensive, and
> - * all 4 top-level entries are used almost immediately in a
> - * new process's life, we just pre-populate them here.
> - */
> -#define PREALLOCATED_PMDS	PTRS_PER_PGD
> -
> -/*
> - * "USER_PMDS" are the PMDs for the user copy of the page tables when
> - * PTI is enabled. They do not exist when PTI is disabled.  Note that
> - * this is distinct from the user _portion_ of the kernel page tables
> - * which always exists.
> - *
> - * We allocate separate PMDs for the kernel part of the user page-table
> - * when PTI is enabled. We need them to map the per-process LDT into the
> - * user-space page-table.
> - */
> -#define PREALLOCATED_USER_PMDS	 (boot_cpu_has(X86_FEATURE_PTI) ? \
> -					KERNEL_PGD_PTRS : 0)
> -#define MAX_PREALLOCATED_USER_PMDS KERNEL_PGD_PTRS
> -
>  void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd)
>  {
>  	paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
> @@ -150,12 +127,6 @@ void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd)
>  	 */
>  	flush_tlb_mm(mm);
>  }
> -#else  /* !CONFIG_X86_PAE */
> -
> -/* No need to prepopulate any pagetable entries in non-PAE modes. */
> -#define PREALLOCATED_PMDS	0
> -#define PREALLOCATED_USER_PMDS	 0
> -#define MAX_PREALLOCATED_USER_PMDS 0
>  #endif	/* CONFIG_X86_PAE */
>  
>  static void free_pmds(struct mm_struct *mm, pmd_t *pmds[], int count)
> @@ -375,6 +346,9 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
>  
>  void pgd_free(struct mm_struct *mm, pgd_t *pgd)
>  {
> +	/* Should be cleaned up in mmap exit path. */
> +	VM_WARN_ON_ONCE(mm_local_region_used(mm));
> +
>  	pgd_mop_up_pmds(mm, pgd);
>  	pgd_dtor(pgd);
>  	paravirt_pgd_free(mm, pgd);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 70747b53c7da9..413dc707cff9b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -906,6 +906,19 @@ static inline void mm_flags_clear_all(struct mm_struct *mm)
>  	bitmap_zero(ACCESS_PRIVATE(&mm->flags, __mm_flags), NUM_MM_FLAG_BITS);
>  }
>  
> +#ifdef CONFIG_MM_LOCAL_REGION
> +static inline bool mm_local_region_used(struct mm_struct *mm)
> +{
> +	return mm_flags_test(MMF_LOCAL_REGION_USED, mm);
> +}
> +#else
> +static inline bool mm_local_region_used(struct mm_struct *mm)
> +{
> +	VM_WARN_ON_ONCE(mm_flags_test(MMF_LOCAL_REGION_USED, mm));
> +	return false;
> +}
> +#endif
> +
>  extern const struct vm_operations_struct vma_dummy_vm_ops;
>  
>  static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index cee934c6e78ec..0ca7cb7da918f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1944,6 +1944,8 @@ enum {
>  
>  #define MMF_USER_HWCAP		32	/* user-defined HWCAPs */
>  
> +#define MMF_LOCAL_REGION_USED	33
> +
>  #define MMF_INIT_LEGACY_MASK	(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
>  				 MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\
>  				 MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 68cf0109dde3c..ff075c74333fe 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1153,6 +1153,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  fail_nocontext:
>  	mm_free_id(mm);
>  fail_noid:
> +	WARN_ON_ONCE(mm_local_region_used(mm));
>  	mm_free_pgd(mm);
>  fail_nopgd:
>  	futex_hash_free(mm);
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ebd8ea353687e..2813059df9c1c 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1319,6 +1319,10 @@ config SECRETMEM
>  	default y
>  	bool "Enable memfd_secret() system call" if EXPERT
>  	depends on ARCH_HAS_SET_DIRECT_MAP
> +	# Soft dependency, for optimisation.
> +	imply MM_LOCAL_REGION
> +	imply MERMAP
> +	imply PAGE_ALLOC_UNMAPPED
>  	help
>  	  Enable the memfd_secret() system call with the ability to create
>  	  memory areas visible only in the context of the owning process and
> @@ -1471,6 +1475,13 @@ config LAZY_MMU_MODE_KUNIT_TEST
>  
>  	  If unsure, say N.
>  
> +config ARCH_SUPPORTS_MM_LOCAL_REGION
> +	def_bool n
> +
> +config MM_LOCAL_REGION
> +	bool
> +	depends on ARCH_SUPPORTS_MM_LOCAL_REGION
> +
>  source "mm/damon/Kconfig"
>  
>  endmenu



  parent reply	other threads:[~2026-03-25 14:23 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-20 18:23 [PATCH v2 00/22] mm: Add __GFP_UNMAPPED Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 01/22] x86/mm: split out preallocate_sub_pgd() Brendan Jackman
2026-03-20 19:42   ` Dave Hansen
2026-03-23 11:01     ` Brendan Jackman
2026-03-24 15:27   ` Borislav Petkov
2026-03-25 13:28     ` Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 02/22] x86/mm: Generalize LDT remap into "mm-local region" Brendan Jackman
2026-03-20 19:47   ` Dave Hansen
2026-03-23 12:01     ` Brendan Jackman
2026-03-23 12:57       ` Brendan Jackman
2026-03-25 14:23   ` Brendan Jackman [this message]
2026-03-20 18:23 ` [PATCH v2 03/22] x86/tlb: Expose some flush function declarations to modules Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 04/22] mm: Create flags arg for __apply_to_page_range() Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 05/22] mm: Add more flags " Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 06/22] x86/mm: introduce the mermap Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 07/22] mm: KUnit tests for " Brendan Jackman
2026-03-24  8:00   ` kernel test robot
2026-03-20 18:23 ` [PATCH v2 08/22] mm: introduce for_each_free_list() Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 09/22] mm/page_alloc: don't overload migratetype in find_suitable_fallback() Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 10/22] mm: introduce freetype_t Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 11/22] mm: move migratetype definitions to freetype.h Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 12/22] mm: add definitions for allocating unmapped pages Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 13/22] mm: rejig pageblock mask definitions Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 14/22] mm: encode freetype flags in pageblock flags Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 15/22] mm/page_alloc: remove ifdefs from pindex helpers Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 16/22] mm/page_alloc: separate pcplists by freetype flags Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 17/22] mm/page_alloc: rename ALLOC_NON_BLOCK back to _HARDER Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 18/22] mm/page_alloc: introduce ALLOC_NOBLOCK Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 19/22] mm/page_alloc: implement __GFP_UNMAPPED allocations Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 20/22] mm/page_alloc: implement __GFP_UNMAPPED|__GFP_ZERO allocations Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 21/22] mm: Minimal KUnit tests for some new page_alloc logic Brendan Jackman
2026-03-20 18:23 ` [PATCH v2 22/22] mm/secretmem: Use __GFP_UNMAPPED when available Brendan Jackman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DHBXJD585D49.2FLK9J7LOYOB9@google.com \
    --to=jackmanb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david.kaplan@amd.com \
    --cc=david@kernel.org \
    --cc=derkling@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=itazur@amazon.co.uk \
    --cc=kalyazin@amazon.co.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=luto@kernel.org \
    --cc=patrick.roy@linux.dev \
    --cc=peterz@infradead.org \
    --cc=reijiw@google.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=sumit.garg@oss.qualcomm.com \
    --cc=tglx@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=yosry@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox