* [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling
@ 2025-01-23 17:24 Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 1/8] x86/mm: Always allocate a whole page for PAE PGDs Dave Hansen
` (9 more replies)
0 siblings, 10 replies; 16+ messages in thread
From: Dave Hansen @ 2025-01-23 17:24 UTC (permalink / raw)
To: linux-kernel
Cc: x86, tglx, bp, joro, luto, peterz, kirill.shutemov,
rick.p.edgecombe, jgross, Dave Hansen
tl;dr: 32-bit PAE page table handing is a bit different when PTI
is on and off. Making the handling uniform removes a good amount
of code at the cost of not sharing kernel PMDs. The downside of
this simplification is bloating non-PTI PAE kernels by ~2 pages
per process.
Anyone who cares about security on 32-bit is running with PTI and
PAE because PAE has the No-eXecute page table bit. They are already
paying the 2-page penalty. Anyone who cares more about memory
footprint than security is probably already running a !PAE kernel
and will not be affected by this.
--
There are two 32-bit x86 hardware page table formats. A 2-level one
with 32-bit pte_t's and a 3-level one with 64-bit pte_t's called PAE.
But the PAE one is wonky. It effectively loses a bit of addressing
radix per level since its PTEs are twice as large. It makes up for
that by adding the third level, but with only 4 entries in the level.
This leads to all kinds of fun because this level only needs 32 bytes
instead of a whole page. Also, since it has only 4 entries in the top
level, the hardware just always caches the entire thing aggressively.
Modifying a PAE pgd_t ends up needing different rules than the other
other x86 paging modes and probably every other architecture too.
PAE support got even weirder when Xen came along. Xen wants to trap
into the hypervisor on page table writes and so it protects the guest
page tables with paging protections. It can't protect a 32 byte
object with paging protections so it bloats the 32-byte object out
to a page. Xen also didn't support sharing kernel PMD pages. This
is mostly moot now because the Xen support running as a 32-bit guest
was ripped out, but there are still remnants around.
PAE also interacts with PTI in fun and exciting ways. Since pgd
updates are so fraught, the PTI PAE implementation just chose to
avoid pgd updates by preallocating all the PMDs up front since
there are only 4 instead of 512 or 1024 in the other x86 paging
modes.
Make PAE less weird:
* Always allocate a page for PAE PGDs. This brings them in line
with the other 2 paging modes. It was done for Xen and for
PTI already and nobody screamed, so just do it everywhere.
* Never share kernel PMD pages. This brings PAE in line with
32-bit !PAE and 64-bit.
* Always preallocate all PAE PMD pages. This basically makes
all PAE kernels behave like PTI ones. It might waste a page
of memory, but all 4 pages probably get allocated in the common
case anyway.
--
include/asm/pgtable-2level_types.h | 2
include/asm/pgtable-3level_types.h | 4 -
include/asm/pgtable_64_types.h | 2
mm/pat/set_memory.c | 2
mm/pgtable.c | 104 +++++--------------------------------
5 files changed, 18 insertions(+), 96 deletions(-)
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC][PATCH 1/8] x86/mm: Always allocate a whole page for PAE PGDs
2025-01-23 17:24 [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Dave Hansen
@ 2025-01-23 17:24 ` Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 2/8] x86/mm: Always "broadcast" PMD setting operations Dave Hansen
` (8 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2025-01-23 17:24 UTC (permalink / raw)
To: linux-kernel
Cc: x86, tglx, bp, joro, luto, peterz, kirill.shutemov,
rick.p.edgecombe, jgross, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
A hardware PAE PGD is only 32 bytes. A PGD is PAGE_SIZE in the other
paging modes. But for reasons*, the kernel _sometimes_ allocates a
whole page even though it only ever uses 32 bytes.
Make PAE less weird. Just allocate a page like the other paging modes.
This was already being done for PTI (and Xen in the past) and nobody
screamed that loudly about it so it can't be that bad.
* The original reason for PAGE_SIZE allocations for the PAE PGDs was
Xen's need to detect page table writes. But 32-bit PTI forced it too
for reasons I'm unclear about.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/mm/pgtable.c | 63 +++---------------------------------------------
1 file changed, 4 insertions(+), 59 deletions(-)
diff -puN arch/x86/mm/pgtable.c~no-pae-kmem_cache arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c~no-pae-kmem_cache 2025-01-23 09:20:50.202330636 -0800
+++ b/arch/x86/mm/pgtable.c 2025-01-23 09:20:50.206330781 -0800
@@ -357,69 +357,15 @@ static void pgd_prepopulate_user_pmd(str
{
}
#endif
-/*
- * Xen paravirt assumes pgd table should be in one page. 64 bit kernel also
- * assumes that pgd should be in one page.
- *
- * But kernel with PAE paging that is not running as a Xen domain
- * only needs to allocate 32 bytes for pgd instead of one page.
- */
-#ifdef CONFIG_X86_PAE
-
-#include <linux/slab.h>
-
-#define PGD_SIZE (PTRS_PER_PGD * sizeof(pgd_t))
-#define PGD_ALIGN 32
-
-static struct kmem_cache *pgd_cache;
-
-void __init pgtable_cache_init(void)
-{
- /*
- * When PAE kernel is running as a Xen domain, it does not use
- * shared kernel pmd. And this requires a whole page for pgd.
- */
- if (!SHARED_KERNEL_PMD)
- return;
-
- /*
- * when PAE kernel is not running as a Xen domain, it uses
- * shared kernel pmd. Shared kernel pmd does not require a whole
- * page for pgd. We are able to just allocate a 32-byte for pgd.
- * During boot time, we create a 32-byte slab for pgd table allocation.
- */
- pgd_cache = kmem_cache_create("pgd_cache", PGD_SIZE, PGD_ALIGN,
- SLAB_PANIC, NULL);
-}
static inline pgd_t *_pgd_alloc(void)
{
/*
- * If no SHARED_KERNEL_PMD, PAE kernel is running as a Xen domain.
- * We allocate one page for pgd.
+ * PTI and Xen need a whole page for the PAE PGD
+ * even though the hardware only needs 32 bytes.
+ *
+ * For simplicity, allocate a page for all users.
*/
- if (!SHARED_KERNEL_PMD)
- return (pgd_t *)__get_free_pages(GFP_PGTABLE_USER,
- PGD_ALLOCATION_ORDER);
-
- /*
- * Now PAE kernel is not running as a Xen domain. We can allocate
- * a 32-byte slab for pgd to save memory space.
- */
- return kmem_cache_alloc(pgd_cache, GFP_PGTABLE_USER);
-}
-
-static inline void _pgd_free(pgd_t *pgd)
-{
- if (!SHARED_KERNEL_PMD)
- free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
- else
- kmem_cache_free(pgd_cache, pgd);
-}
-#else
-
-static inline pgd_t *_pgd_alloc(void)
-{
return (pgd_t *)__get_free_pages(GFP_PGTABLE_USER,
PGD_ALLOCATION_ORDER);
}
@@ -428,7 +374,6 @@ static inline void _pgd_free(pgd_t *pgd)
{
free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
}
-#endif /* CONFIG_X86_PAE */
pgd_t *pgd_alloc(struct mm_struct *mm)
{
_
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC][PATCH 2/8] x86/mm: Always "broadcast" PMD setting operations
2025-01-23 17:24 [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 1/8] x86/mm: Always allocate a whole page for PAE PGDs Dave Hansen
@ 2025-01-23 17:24 ` Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 3/8] x86/mm: Always tell core mm to sync kernel mappings Dave Hansen
` (7 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2025-01-23 17:24 UTC (permalink / raw)
To: linux-kernel
Cc: x86, tglx, bp, joro, luto, peterz, kirill.shutemov,
rick.p.edgecombe, jgross, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
Kernel PMDs can either be shared across processes or private to a
process. On 64-bit, they are always shared. 32-bit non-PAE hardware
does not have PMDs, but the kernel logically squishes them into the
PGD and treats them as private. Here are the four cases:
64-bit: Shared
32-bit: non-PAE: Private
32-bit: PAE+ PTI: Private
32-bit: PAE+noPTI: Shared
Note that 32-bit is all "Private" except for PAE+noPTI being an
oddball. The 32-bit+PAE+noPTI case will be made like the rest of
32-bit shortly.
But until that can be done, temporarily treat the 32-bit+PAE+noPTI
case as Private. This will do unnecessary walks across pgd_list and
unnecessary PTE setting but should be otherwise harmless.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/mm/pat/set_memory.c | 2 +-
b/arch/x86/mm/pgtable.c | 11 +++--------
2 files changed, 4 insertions(+), 9 deletions(-)
diff -puN arch/x86/mm/pat/set_memory.c~always-sync-kernel-mapping-updates arch/x86/mm/pat/set_memory.c
--- a/arch/x86/mm/pat/set_memory.c~always-sync-kernel-mapping-updates 2025-01-23 09:20:50.674347666 -0800
+++ b/arch/x86/mm/pat/set_memory.c 2025-01-23 09:20:50.678347810 -0800
@@ -840,7 +840,7 @@ static void __set_pmd_pte(pte_t *kpte, u
/* change init_mm */
set_pte_atomic(kpte, pte);
#ifdef CONFIG_X86_32
- if (!SHARED_KERNEL_PMD) {
+ {
struct page *page;
list_for_each_entry(page, &pgd_list, lru) {
diff -puN arch/x86/mm/pgtable.c~always-sync-kernel-mapping-updates arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c~always-sync-kernel-mapping-updates 2025-01-23 09:20:50.674347666 -0800
+++ b/arch/x86/mm/pgtable.c 2025-01-23 09:20:50.678347810 -0800
@@ -136,18 +136,13 @@ static void pgd_ctor(struct mm_struct *m
KERNEL_PGD_PTRS);
}
- /* list required to sync kernel mapping updates */
- if (!SHARED_KERNEL_PMD) {
- pgd_set_mm(pgd, mm);
- pgd_list_add(pgd);
- }
+ /* List used to sync kernel mapping updates */
+ pgd_set_mm(pgd, mm);
+ pgd_list_add(pgd);
}
static void pgd_dtor(pgd_t *pgd)
{
- if (SHARED_KERNEL_PMD)
- return;
-
spin_lock(&pgd_lock);
pgd_list_del(pgd);
spin_unlock(&pgd_lock);
_
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC][PATCH 3/8] x86/mm: Always tell core mm to sync kernel mappings
2025-01-23 17:24 [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 1/8] x86/mm: Always allocate a whole page for PAE PGDs Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 2/8] x86/mm: Always "broadcast" PMD setting operations Dave Hansen
@ 2025-01-23 17:24 ` Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 4/8] x86/mm: Simplify PAE PGD sharing macros Dave Hansen
` (6 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2025-01-23 17:24 UTC (permalink / raw)
To: linux-kernel
Cc: x86, tglx, bp, joro, luto, peterz, kirill.shutemov,
rick.p.edgecombe, jgross, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
Each mm_struct has its own copy of the page tables. When core mm code
makes changes to a copy of the page tables those changes sometimes
need to be synchronized with other mms' copies of the page tables. But
when this synchronization actually needs to happen is highly
architecture and configuration specific.
In cases where kernel PMDs are shared across processes
(SHARED_KERNEL_PMD) the core mm does not itself need to do that
synchronization for kernel PMD changes. The x86 code communicates
this by clearing the PGTBL_PMD_MODIFIED bit cleared in those
configs to avoid expensive synchronization.
The kernel is moving toward never sharing kernel PMDs on 32-bit.
Prepare for that and make 32-bit PAE always set PGTBL_PMD_MODIFIED,
even if there is no modification to synchronize. This obviously adds
some synchronization overhead in cases where the kernel page tables
are being changed.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/include/asm/pgtable-3level_types.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff -puN arch/x86/include/asm/pgtable-3level_types.h~always-set-ARCH_PAGE_TABLE_SYNC_MASK arch/x86/include/asm/pgtable-3level_types.h
--- a/arch/x86/include/asm/pgtable-3level_types.h~always-set-ARCH_PAGE_TABLE_SYNC_MASK 2025-01-23 09:20:51.158365128 -0800
+++ b/arch/x86/include/asm/pgtable-3level_types.h 2025-01-23 09:20:51.162365272 -0800
@@ -29,7 +29,7 @@ typedef union {
#define SHARED_KERNEL_PMD (!static_cpu_has(X86_FEATURE_PTI))
-#define ARCH_PAGE_TABLE_SYNC_MASK (SHARED_KERNEL_PMD ? 0 : PGTBL_PMD_MODIFIED)
+#define ARCH_PAGE_TABLE_SYNC_MASK PGTBL_PMD_MODIFIED
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
_
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC][PATCH 4/8] x86/mm: Simplify PAE PGD sharing macros
2025-01-23 17:24 [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Dave Hansen
` (2 preceding siblings ...)
2025-01-23 17:24 ` [RFC][PATCH 3/8] x86/mm: Always tell core mm to sync kernel mappings Dave Hansen
@ 2025-01-23 17:24 ` Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 5/8] x86/mm: Fix up comments around PMD preallocation Dave Hansen
` (5 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2025-01-23 17:24 UTC (permalink / raw)
To: linux-kernel
Cc: x86, tglx, bp, joro, luto, peterz, kirill.shutemov,
rick.p.edgecombe, jgross, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
There are a few too many levels of abstraction here.
First, just expand the PREALLOCATED_PMDS macro in place to make it
clear that it is only conditional on PTI.
Second, MAX_PREALLOCATED_PMDS is only used in one spot for an
on-stack allocation. It has a *maximum* value of 4. Do not bother
with the macro MAX() magic. Just set it to 4.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/mm/pgtable.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff -puN arch/x86/mm/pgtable.c~simplify-PREALLOCATED_PMDS arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c~simplify-PREALLOCATED_PMDS 2025-01-23 09:20:51.626382013 -0800
+++ b/arch/x86/mm/pgtable.c 2025-01-23 09:20:51.630382157 -0800
@@ -107,12 +107,6 @@ static inline void pgd_list_del(pgd_t *p
list_del(&ptdesc->pt_list);
}
-#define UNSHARED_PTRS_PER_PGD \
- (SHARED_KERNEL_PMD ? KERNEL_PGD_BOUNDARY : PTRS_PER_PGD)
-#define MAX_UNSHARED_PTRS_PER_PGD \
- MAX_T(size_t, KERNEL_PGD_BOUNDARY, PTRS_PER_PGD)
-
-
static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
{
virt_to_ptdesc(pgd)->pt_mm = mm;
@@ -171,8 +165,9 @@ static void pgd_dtor(pgd_t *pgd)
* not shared between pagetables (!SHARED_KERNEL_PMDS), we allocate
* and initialize the kernel pmds here.
*/
-#define PREALLOCATED_PMDS UNSHARED_PTRS_PER_PGD
-#define MAX_PREALLOCATED_PMDS MAX_UNSHARED_PTRS_PER_PGD
+#define PREALLOCATED_PMDS (static_cpu_has(X86_FEATURE_PTI) ? \
+ PTRS_PER_PGD : KERNEL_PGD_BOUNDARY)
+#define MAX_PREALLOCATED_PMDS PTRS_PER_PGD
/*
* We allocate separate PMDs for the kernel part of the user page-table
_
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC][PATCH 5/8] x86/mm: Fix up comments around PMD preallocation
2025-01-23 17:24 [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Dave Hansen
` (3 preceding siblings ...)
2025-01-23 17:24 ` [RFC][PATCH 4/8] x86/mm: Simplify PAE PGD sharing macros Dave Hansen
@ 2025-01-23 17:24 ` Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 6/8] x86/mm: Preallocate all PAE page tables Dave Hansen
` (4 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2025-01-23 17:24 UTC (permalink / raw)
To: linux-kernel
Cc: x86, tglx, bp, joro, luto, peterz, kirill.shutemov,
rick.p.edgecombe, jgross, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
The "paravirt environment" is no longer in the tree. Axe that part of the
comment. Also add a blurb to remind readers that "USER_PMDS" refer to
the PTI user *copy* of the page tables, not the user *portion*.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/mm/pgtable.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff -puN arch/x86/mm/pgtable.c~simplify-PREALLOCATED_PMDS-2 arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c~simplify-PREALLOCATED_PMDS-2 2025-01-23 09:20:52.098399043 -0800
+++ b/arch/x86/mm/pgtable.c 2025-01-23 09:20:52.102399187 -0800
@@ -160,16 +160,17 @@ static void pgd_dtor(pgd_t *pgd)
* processor notices the update. Since this is expensive, and
* all 4 top-level entries are used almost immediately in a
* new process's life, we just pre-populate them here.
- *
- * Also, if we're in a paravirt environment where the kernel pmd is
- * not shared between pagetables (!SHARED_KERNEL_PMDS), we allocate
- * and initialize the kernel pmds here.
*/
#define PREALLOCATED_PMDS (static_cpu_has(X86_FEATURE_PTI) ? \
PTRS_PER_PGD : KERNEL_PGD_BOUNDARY)
#define MAX_PREALLOCATED_PMDS PTRS_PER_PGD
/*
+ * "USER_PMDS" are the PMDs for the user copy of the page tables when
+ * PTI is enabled. They do not exist when PTI is disabled. Note that
+ * this is distinct from the user _portion_ of the kernel page tables
+ * which always exists.
+ *
* We allocate separate PMDs for the kernel part of the user page-table
* when PTI is enabled. We need them to map the per-process LDT into the
* user-space page-table.
_
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC][PATCH 6/8] x86/mm: Preallocate all PAE page tables
2025-01-23 17:24 [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Dave Hansen
` (4 preceding siblings ...)
2025-01-23 17:24 ` [RFC][PATCH 5/8] x86/mm: Fix up comments around PMD preallocation Dave Hansen
@ 2025-01-23 17:24 ` Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 7/8] x86/mm: Remove duplicated PMD preallocation macro Dave Hansen
` (3 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2025-01-23 17:24 UTC (permalink / raw)
To: linux-kernel
Cc: x86, tglx, bp, joro, luto, peterz, kirill.shutemov,
rick.p.edgecombe, jgross, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
Finally, move away from having PAE kernels share any PMDs across
processes.
This was already the default on PTI kernels which are the common
case.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/mm/pgtable.c | 12 +++---------
1 file changed, 3 insertions(+), 9 deletions(-)
diff -puN arch/x86/mm/pgtable.c~simplify-PREALLOCATED_PMDS-3 arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c~simplify-PREALLOCATED_PMDS-3 2025-01-23 09:20:52.562415784 -0800
+++ b/arch/x86/mm/pgtable.c 2025-01-23 09:20:52.566415928 -0800
@@ -119,16 +119,11 @@ struct mm_struct *pgd_page_get_mm(struct
static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
{
- /* If the pgd points to a shared pagetable level (either the
- ptes in non-PAE, or shared PMD in PAE), then just copy the
- references from swapper_pg_dir. */
- if (CONFIG_PGTABLE_LEVELS == 2 ||
- (CONFIG_PGTABLE_LEVELS == 3 && SHARED_KERNEL_PMD) ||
- CONFIG_PGTABLE_LEVELS >= 4) {
+ /* PAE preallocates all its PMDs. No cloning needed. */
+ if (!IS_ENABLED(CONFIG_X86_PAE))
clone_pgd_range(pgd + KERNEL_PGD_BOUNDARY,
swapper_pg_dir + KERNEL_PGD_BOUNDARY,
KERNEL_PGD_PTRS);
- }
/* List used to sync kernel mapping updates */
pgd_set_mm(pgd, mm);
@@ -161,8 +156,7 @@ static void pgd_dtor(pgd_t *pgd)
* all 4 top-level entries are used almost immediately in a
* new process's life, we just pre-populate them here.
*/
-#define PREALLOCATED_PMDS (static_cpu_has(X86_FEATURE_PTI) ? \
- PTRS_PER_PGD : KERNEL_PGD_BOUNDARY)
+#define PREALLOCATED_PMDS PTRS_PER_PGD
#define MAX_PREALLOCATED_PMDS PTRS_PER_PGD
/*
_
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC][PATCH 7/8] x86/mm: Remove duplicated PMD preallocation macro
2025-01-23 17:24 [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Dave Hansen
` (5 preceding siblings ...)
2025-01-23 17:24 ` [RFC][PATCH 6/8] x86/mm: Preallocate all PAE page tables Dave Hansen
@ 2025-01-23 17:24 ` Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 8/8] x86/mm: Remove now unused SHARED_KERNEL_PMD Dave Hansen
` (2 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2025-01-23 17:24 UTC (permalink / raw)
To: linux-kernel
Cc: x86, tglx, bp, joro, luto, peterz, kirill.shutemov,
rick.p.edgecombe, jgross, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
MAX_PREALLOCATED_PMDS and PREALLOCATED_PMDS are now identical. Just
use PREALLOCATED_PMDS and remove "MAX".
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/mm/pgtable.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff -puN arch/x86/mm/pgtable.c~simplify-PREALLOCATED_PMDS-4 arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c~simplify-PREALLOCATED_PMDS-4 2025-01-23 09:20:53.030432670 -0800
+++ b/arch/x86/mm/pgtable.c 2025-01-23 09:20:53.034432814 -0800
@@ -157,7 +157,6 @@ static void pgd_dtor(pgd_t *pgd)
* new process's life, we just pre-populate them here.
*/
#define PREALLOCATED_PMDS PTRS_PER_PGD
-#define MAX_PREALLOCATED_PMDS PTRS_PER_PGD
/*
* "USER_PMDS" are the PMDs for the user copy of the page tables when
@@ -193,7 +192,6 @@ void pud_populate(struct mm_struct *mm,
/* No need to prepopulate any pagetable entries in non-PAE modes. */
#define PREALLOCATED_PMDS 0
-#define MAX_PREALLOCATED_PMDS 0
#define PREALLOCATED_USER_PMDS 0
#define MAX_PREALLOCATED_USER_PMDS 0
#endif /* CONFIG_X86_PAE */
@@ -364,7 +362,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
{
pgd_t *pgd;
pmd_t *u_pmds[MAX_PREALLOCATED_USER_PMDS];
- pmd_t *pmds[MAX_PREALLOCATED_PMDS];
+ pmd_t *pmds[PREALLOCATED_PMDS];
pgd = _pgd_alloc();
_
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC][PATCH 8/8] x86/mm: Remove now unused SHARED_KERNEL_PMD
2025-01-23 17:24 [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Dave Hansen
` (6 preceding siblings ...)
2025-01-23 17:24 ` [RFC][PATCH 7/8] x86/mm: Remove duplicated PMD preallocation macro Dave Hansen
@ 2025-01-23 17:24 ` Dave Hansen
2025-01-23 21:49 ` [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Peter Zijlstra
2025-02-24 18:55 ` Ingo Molnar
9 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2025-01-23 17:24 UTC (permalink / raw)
To: linux-kernel
Cc: x86, tglx, bp, joro, luto, peterz, kirill.shutemov,
rick.p.edgecombe, jgross, Dave Hansen
From: Dave Hansen <dave.hansen@linux.intel.com>
All the users of SHARED_KERNEL_PMD are gone. Zap it.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---
b/arch/x86/include/asm/pgtable-2level_types.h | 2 --
b/arch/x86/include/asm/pgtable-3level_types.h | 2 --
b/arch/x86/include/asm/pgtable_64_types.h | 2 --
3 files changed, 6 deletions(-)
diff -puN arch/x86/include/asm/pgtable-2level_types.h~zap-SHARED_KERNEL_PMD arch/x86/include/asm/pgtable-2level_types.h
--- a/arch/x86/include/asm/pgtable-2level_types.h~zap-SHARED_KERNEL_PMD 2025-01-23 09:20:53.502449700 -0800
+++ b/arch/x86/include/asm/pgtable-2level_types.h 2025-01-23 09:20:53.506449845 -0800
@@ -18,8 +18,6 @@ typedef union {
} pte_t;
#endif /* !__ASSEMBLY__ */
-#define SHARED_KERNEL_PMD 0
-
#define ARCH_PAGE_TABLE_SYNC_MASK PGTBL_PMD_MODIFIED
/*
diff -puN arch/x86/include/asm/pgtable-3level_types.h~zap-SHARED_KERNEL_PMD arch/x86/include/asm/pgtable-3level_types.h
--- a/arch/x86/include/asm/pgtable-3level_types.h~zap-SHARED_KERNEL_PMD 2025-01-23 09:20:53.502449700 -0800
+++ b/arch/x86/include/asm/pgtable-3level_types.h 2025-01-23 09:20:53.506449845 -0800
@@ -27,8 +27,6 @@ typedef union {
} pmd_t;
#endif /* !__ASSEMBLY__ */
-#define SHARED_KERNEL_PMD (!static_cpu_has(X86_FEATURE_PTI))
-
#define ARCH_PAGE_TABLE_SYNC_MASK PGTBL_PMD_MODIFIED
/*
diff -puN arch/x86/include/asm/pgtable_64_types.h~zap-SHARED_KERNEL_PMD arch/x86/include/asm/pgtable_64_types.h
--- a/arch/x86/include/asm/pgtable_64_types.h~zap-SHARED_KERNEL_PMD 2025-01-23 09:20:53.506449845 -0800
+++ b/arch/x86/include/asm/pgtable_64_types.h 2025-01-23 09:20:53.506449845 -0800
@@ -46,8 +46,6 @@ extern unsigned int ptrs_per_p4d;
#endif /* !__ASSEMBLY__ */
-#define SHARED_KERNEL_PMD 0
-
#ifdef CONFIG_X86_5LEVEL
/*
_
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling
2025-01-23 17:24 [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Dave Hansen
` (7 preceding siblings ...)
2025-01-23 17:24 ` [RFC][PATCH 8/8] x86/mm: Remove now unused SHARED_KERNEL_PMD Dave Hansen
@ 2025-01-23 21:49 ` Peter Zijlstra
2025-01-23 23:06 ` Dave Hansen
2025-02-24 18:55 ` Ingo Molnar
9 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2025-01-23 21:49 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, x86, tglx, bp, joro, luto, kirill.shutemov,
rick.p.edgecombe, jgross
On Thu, Jan 23, 2025 at 09:24:28AM -0800, Dave Hansen wrote:
> tl;dr: 32-bit PAE page table handing is a bit different when PTI
> is on and off. Making the handling uniform removes a good amount
> of code at the cost of not sharing kernel PMDs. The downside of
> this simplification is bloating non-PTI PAE kernels by ~2 pages
> per process.
>
> Anyone who cares about security on 32-bit is running with PTI and
> PAE because PAE has the No-eXecute page table bit. They are already
> paying the 2-page penalty. Anyone who cares more about memory
> footprint than security is probably already running a !PAE kernel
> and will not be affected by this.
The reality is that many of the mitigations we have are 64bit only.
32bit is known insecure. There is absolutely no point in using PTI on
32bit at all.
Can't we just rip it out?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling
2025-01-23 21:49 ` [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Peter Zijlstra
@ 2025-01-23 23:06 ` Dave Hansen
2025-01-24 7:58 ` Joerg Roedel
2025-01-24 8:52 ` Peter Zijlstra
0 siblings, 2 replies; 16+ messages in thread
From: Dave Hansen @ 2025-01-23 23:06 UTC (permalink / raw)
To: Peter Zijlstra, Dave Hansen
Cc: linux-kernel, x86, tglx, bp, joro, luto, kirill.shutemov,
rick.p.edgecombe, jgross
On 1/23/25 13:49, Peter Zijlstra wrote:
> On Thu, Jan 23, 2025 at 09:24:28AM -0800, Dave Hansen wrote:
>> tl;dr: 32-bit PAE page table handing is a bit different when PTI
>> is on and off. Making the handling uniform removes a good amount
>> of code at the cost of not sharing kernel PMDs. The downside of
>> this simplification is bloating non-PTI PAE kernels by ~2 pages
>> per process.
>>
>> Anyone who cares about security on 32-bit is running with PTI and
>> PAE because PAE has the No-eXecute page table bit. They are already
>> paying the 2-page penalty. Anyone who cares more about memory
>> footprint than security is probably already running a !PAE kernel
>> and will not be affected by this.
>
> The reality is that many of the mitigations we have are 64bit only.
> 32bit is known insecure. There is absolutely no point in using PTI on
> 32bit at all.
>
> Can't we just rip it out?
32-bit+PTI or 32-bit in general? ;)
I'm curious what Joerg and the other folks that worked on 32-bit PTI
think about it in retrospect. The 32 vs. 64-bit security gap was
probably modest in 2018 and it can only have grown since then.
I definitely haven't seen a lot of 32-bit PTI bug reports.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling
2025-01-23 23:06 ` Dave Hansen
@ 2025-01-24 7:58 ` Joerg Roedel
2025-01-24 19:12 ` Dave Hansen
2025-01-24 8:52 ` Peter Zijlstra
1 sibling, 1 reply; 16+ messages in thread
From: Joerg Roedel @ 2025-01-24 7:58 UTC (permalink / raw)
To: Dave Hansen
Cc: Peter Zijlstra, Dave Hansen, linux-kernel, x86, tglx, bp, luto,
kirill.shutemov, rick.p.edgecombe, jgross
On Thu, Jan 23, 2025 at 03:06:26PM -0800, Dave Hansen wrote:
> 32-bit+PTI or 32-bit in general? ;)
+1 for removing x86-32 bit support alltogether.
> I'm curious what Joerg and the other folks that worked on 32-bit PTI
> think about it in retrospect. The 32 vs. 64-bit security gap was
> probably modest in 2018 and it can only have grown since then.
I think the decision to keep and maintain 32-bit support only makes
sense if it can be kept on-par with x86-64 security-wise, otherwise we
are lying to our users about the 'supported' part. Back in the day when
I did the 32-bit PTI support it made sense, but that was 7 years ago.
When was the last 32-bit only x86 CPU sold?
> I definitely haven't seen a lot of 32-bit PTI bug reports.
That's because the 32-bit PTI support is a well crafted piece of beauty,
which was merged with almost no bugs ;-)
Regards,
Joerg
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling
2025-01-23 23:06 ` Dave Hansen
2025-01-24 7:58 ` Joerg Roedel
@ 2025-01-24 8:52 ` Peter Zijlstra
1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2025-01-24 8:52 UTC (permalink / raw)
To: Dave Hansen
Cc: Dave Hansen, linux-kernel, x86, tglx, bp, joro, luto,
kirill.shutemov, rick.p.edgecombe, jgross
On Thu, Jan 23, 2025 at 03:06:26PM -0800, Dave Hansen wrote:
> On 1/23/25 13:49, Peter Zijlstra wrote:
> > Can't we just rip it out?
>
> 32-bit+PTI or 32-bit in general? ;)
Yes :-)
> I'm curious what Joerg and the other folks that worked on 32-bit PTI
> think about it in retrospect. The 32 vs. 64-bit security gap was
> probably modest in 2018 and it can only have grown since then.
>
> I definitely haven't seen a lot of 32-bit PTI bug reports.
3db03fb4995e x86/mm: Fix pti_clone_entry_text() for i386
41e71dbb0e0a x86/mm: Fix pti_clone_pgtable() alignment assumption
Them cost me a few gray hairs :-)
Anyway, 32bit PTI is 'solid', it's just all the other speculation
mitigations what we've added to x86_64 only since.
Even the retpoline crap on i386, that is still vulnerable to the whole
funnel thing, so while it has the OG retpoline, it is still vulnerable
to more modern attacks that abuse the fact that all indirect jumps come
from only a single location.
So yes, we patches a few (early) holes on i386, but nobody should be
thinking i386 is 'secure' from all this speculation nonsense.
What's the point of having a few holes patched, if you're still bleeding
from a dozen others :/
So if we keep i386 around, it might just make sense to rip out all
speculation mitigations -- no point pretending.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling
2025-01-24 7:58 ` Joerg Roedel
@ 2025-01-24 19:12 ` Dave Hansen
2025-01-28 8:13 ` Joerg Roedel
0 siblings, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2025-01-24 19:12 UTC (permalink / raw)
To: Joerg Roedel
Cc: Peter Zijlstra, Dave Hansen, linux-kernel, x86, tglx, bp, luto,
kirill.shutemov, rick.p.edgecombe, jgross
On 1/23/25 23:58, Joerg Roedel wrote:
> On Thu, Jan 23, 2025 at 03:06:26PM -0800, Dave Hansen wrote:
>> 32-bit+PTI or 32-bit in general? ;)
>
> +1 for removing x86-32 bit support alltogether.
>
>> I'm curious what Joerg and the other folks that worked on 32-bit PTI
>> think about it in retrospect. The 32 vs. 64-bit security gap was
>> probably modest in 2018 and it can only have grown since then.
>
> I think the decision to keep and maintain 32-bit support only makes
> sense if it can be kept on-par with x86-64 security-wise, otherwise we
> are lying to our users about the 'supported' part. Back in the day when
> I did the 32-bit PTI support it made sense, but that was 7 years ago.
>
> When was the last 32-bit only x86 CPU sold?
Probably INTEL_QUARK_X1000, but it was mostly a toy. It's family 5, so
this applies:
static const __initconst struct x86_cpu_id cpu_vuln_whitelist[] = {
...
VULNWL(INTEL, 5, X86_MODEL_ANY, NO_SPECULATION),
Here's one that was released in 2015, probably just for embedded use:
> https://www.intel.com/content/www/us/en/products/sku/91947/intel-quark-microcontroller-d2000/specifications.html
There were some Atoms in 2008 that seem to have had the 64-bit support
fused off. They were probably the last normal CPU that someone would
have in a PC that didn't have 64-bit support.
> https://www.intel.com/content/www/us/en/ark/products/codename/24976/products-formerly-silverthorne.html
> https://www.intel.com/content/www/us/en/products/sku/36331/intel-atom-processor-n270-512k-cache-1-60-ghz-533-mhz-fsb/specifications.html
But these are also NO_SPECULATION, so don't need most of the mitigations.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling
2025-01-24 19:12 ` Dave Hansen
@ 2025-01-28 8:13 ` Joerg Roedel
0 siblings, 0 replies; 16+ messages in thread
From: Joerg Roedel @ 2025-01-28 8:13 UTC (permalink / raw)
To: Dave Hansen
Cc: Peter Zijlstra, Dave Hansen, linux-kernel, x86, tglx, bp, luto,
kirill.shutemov, rick.p.edgecombe, jgross
On Fri, Jan 24, 2025 at 11:12:45AM -0800, Dave Hansen wrote:
> Probably INTEL_QUARK_X1000, but it was mostly a toy. It's family 5, so
> this applies:
>
> static const __initconst struct x86_cpu_id cpu_vuln_whitelist[] = {
> ...
> VULNWL(INTEL, 5, X86_MODEL_ANY, NO_SPECULATION),
>
> Here's one that was released in 2015, probably just for embedded use:
>
> > https://www.intel.com/content/www/us/en/products/sku/91947/intel-quark-microcontroller-d2000/specifications.html
>
> There were some Atoms in 2008 that seem to have had the 64-bit support
> fused off. They were probably the last normal CPU that someone would
> have in a PC that didn't have 64-bit support.
>
> > https://www.intel.com/content/www/us/en/ark/products/codename/24976/products-formerly-silverthorne.html
> > https://www.intel.com/content/www/us/en/products/sku/36331/intel-atom-processor-n270-512k-cache-1-60-ghz-533-mhz-fsb/specifications.html
>
> But these are also NO_SPECULATION, so don't need most of the mitigations.
So the last 32bit-only CPUs were released roughly 10 years ago. Most of
these systems are likely out-of-service already, and those still in
service are unlikely to require a new kernel.
Imho it is time to deprecate 32-bit x86 support and set a hard removal
date of, say, end of 2026 or so. It can also be timed to happen after a
long-term stable kernel is released to give the few people who still
want support some more time. This would send a serious message to
hardware vendors to not come up with any more of these pieces.
Btw, I've just built an x86-32 defconfig kernel from the latest tree,
the text section alone is more than 14MiB big, the vmlinux image more
than 29MiB. This is not really suitable for embedded environments.
Regards,
Joerg
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling
2025-01-23 17:24 [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Dave Hansen
` (8 preceding siblings ...)
2025-01-23 21:49 ` [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Peter Zijlstra
@ 2025-02-24 18:55 ` Ingo Molnar
9 siblings, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2025-02-24 18:55 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, x86, tglx, bp, joro, luto, peterz, kirill.shutemov,
rick.p.edgecombe, jgross
* Dave Hansen <dave.hansen@linux.intel.com> wrote:
> tl;dr: 32-bit PAE page table handing is a bit different when PTI
> is on and off. Making the handling uniform removes a good amount
> of code at the cost of not sharing kernel PMDs. The downside of
> this simplification is bloating non-PTI PAE kernels by ~2 pages
> per process.
>
> Anyone who cares about security on 32-bit is running with PTI and
> PAE because PAE has the No-eXecute page table bit. They are already
> paying the 2-page penalty. Anyone who cares more about memory
> footprint than security is probably already running a !PAE kernel
> and will not be affected by this.
>
> --
>
> There are two 32-bit x86 hardware page table formats. A 2-level one
> with 32-bit pte_t's and a 3-level one with 64-bit pte_t's called PAE.
> But the PAE one is wonky. It effectively loses a bit of addressing
> radix per level since its PTEs are twice as large. It makes up for
> that by adding the third level, but with only 4 entries in the level.
>
> This leads to all kinds of fun because this level only needs 32 bytes
> instead of a whole page. Also, since it has only 4 entries in the top
> level, the hardware just always caches the entire thing aggressively.
> Modifying a PAE pgd_t ends up needing different rules than the other
> other x86 paging modes and probably every other architecture too.
>
> PAE support got even weirder when Xen came along. Xen wants to trap
> into the hypervisor on page table writes and so it protects the guest
> page tables with paging protections. It can't protect a 32 byte
> object with paging protections so it bloats the 32-byte object out
> to a page. Xen also didn't support sharing kernel PMD pages. This
> is mostly moot now because the Xen support running as a 32-bit guest
> was ripped out, but there are still remnants around.
>
> PAE also interacts with PTI in fun and exciting ways. Since pgd
> updates are so fraught, the PTI PAE implementation just chose to
> avoid pgd updates by preallocating all the PMDs up front since
> there are only 4 instead of 512 or 1024 in the other x86 paging
> modes.
>
> Make PAE less weird:
> * Always allocate a page for PAE PGDs. This brings them in line
> with the other 2 paging modes. It was done for Xen and for
> PTI already and nobody screamed, so just do it everywhere.
> * Never share kernel PMD pages. This brings PAE in line with
> 32-bit !PAE and 64-bit.
> * Always preallocate all PAE PMD pages. This basically makes
> all PAE kernels behave like PTI ones. It might waste a page
> of memory, but all 4 pages probably get allocated in the common
> case anyway.
>
> --
>
> include/asm/pgtable-2level_types.h | 2
> include/asm/pgtable-3level_types.h | 4 -
> include/asm/pgtable_64_types.h | 2
> mm/pat/set_memory.c | 2
> mm/pgtable.c | 104 +++++--------------------------------
> 5 files changed, 18 insertions(+), 96 deletions(-)
The diffstat alone is pretty nice, so I'd suggest we pursue this series
even if continued work on 32-bit kernel features is being questioned.
Until the code exists and isn't explicitly marked as obsolete, such
changes are legit.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-02-24 18:55 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-23 17:24 [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 1/8] x86/mm: Always allocate a whole page for PAE PGDs Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 2/8] x86/mm: Always "broadcast" PMD setting operations Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 3/8] x86/mm: Always tell core mm to sync kernel mappings Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 4/8] x86/mm: Simplify PAE PGD sharing macros Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 5/8] x86/mm: Fix up comments around PMD preallocation Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 6/8] x86/mm: Preallocate all PAE page tables Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 7/8] x86/mm: Remove duplicated PMD preallocation macro Dave Hansen
2025-01-23 17:24 ` [RFC][PATCH 8/8] x86/mm: Remove now unused SHARED_KERNEL_PMD Dave Hansen
2025-01-23 21:49 ` [RFC][PATCH 0/8] x86/mm: Simplify PAE page table handling Peter Zijlstra
2025-01-23 23:06 ` Dave Hansen
2025-01-24 7:58 ` Joerg Roedel
2025-01-24 19:12 ` Dave Hansen
2025-01-28 8:13 ` Joerg Roedel
2025-01-24 8:52 ` Peter Zijlstra
2025-02-24 18:55 ` Ingo Molnar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox