Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
@ 2026-06-29 21:47 Xiang Mei
  2026-06-29 22:29 ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Xiang Mei @ 2026-06-29 21:47 UTC (permalink / raw)
  To: Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening
  Cc: Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili, Xiang Mei

With CONFIG_VMAP_STACK, kernel stacks are allocated in the vmalloc area,
which an unprivileged user can surround with attacker-controlled data by
spraying vmap allocations adjacent to a target stack (for example via
XDP_UMEM_REG, though other vmalloc spray paths work too). Today each
guarded vmalloc allocation is followed by a single unmapped guard page.

A single guard page is not enough to contain the x86_64 ENTER
instruction used as a one-instruction stack pivot. ENTER imm16, imm8
builds a stack frame and lowers RSP by:

	imm16 + 8 * (L + 1),  L = imm8 & 0x1f

imm16 is an unsigned 16-bit operand (ENTER never raises RSP), and L is
in [0, 31], so the maximum displacement of a single ENTER is:

	0xffff + 8 * 0x20 = 0x100ff bytes

That is more than enough to step off the current stack, across the
one-page guard, and into the adjacent sprayed pages. When those pages
contain a return sled feeding a ROP chain, reaching any ENTER gadget
(opcode 0xc8, abundant as both intended and unintended gadgets) turns a
control-flow hijack into full ROP execution without any register control
at the hijack site, making it a one-gadget-style primitive that
significantly eases exploitation. The pivot happens after the control
transfer, so it is not constrained by CFI (kCFI/FineIBT).

Widen the guard region from one page to VMAP_GUARD_PAGES (0x11 pages,
0x11000 bytes), which is the smallest whole-page span exceeding the
0x100ff-byte maximum single-ENTER pivot. A pivot off the top of the
stack now lands in the unmapped guard and faults, instead of in mapped,
attacker-controlled memory. RANDOMIZE_KSTACK_OFFSET only perturbs RSP by
a sub-page amount, so it does not change the required width.

Introduce a VMAP_GUARD_PAGES knob that defaults to a single page (no
change for current architectures) and can be overridden per arch via
asm/vmalloc.h, and set it to 0x11 on x86_64. This is deliberately scoped
to x86_64: the 0x100ff bound is a property of the ENTER opcode, and ENTER
is also a one-byte opcode (0xc8) that appears as abundant unintended
gadgets. Other architectures (e.g. arm64) have no equivalent
single-instruction, immediate-controlled pivot reachable as an unaligned
unintended gadget, so they keep the one-page guard and pay no cost.

The override is gated on CONFIG_X86_64 rather than applying to all of x86:
VMAP_STACK is selected only on x86_64, so 32-bit kernel stacks are not in
the vmalloc area and the technique does not apply there. 32-bit x86 also
has a far smaller vmalloc window, where widening every guarded area by 16
pages would needlessly pressure the address space.

The guard pages are never populated, so there is no extra physical
memory and no additional page-table population beyond the larger virtual
span; the cost is virtual address space and vmap_area bookkeeping, which
is negligible against the 64-bit vmalloc window. get_vm_area_size() is
adjusted by the same VMAP_GUARD_SIZE so the usable size reported to
callers is unchanged.

On x86 this widens the guard for all guarded vmap areas, not only thread
stacks. ret2enter targets the stack specifically, so a narrower
alternative is to apply the wider guard only on the thread-stack
allocation path via a dedicated VM_ flag; we kept the change in the
common path as defense in depth for any vmalloc-adjacent pivot target,
but are happy to scope it to stacks if maintainers prefer.

While widening the guard, also mark percpu vmap areas VM_NO_GUARD.
pcpu_get_vm_areas() and pcpu_page_first_chunk() size each area exactly and
reserve no guard, so get_vm_area_size() would subtract a guard that was
never added and underflow if an area were smaller than the guard. This is
a latent correctness fix only: on x86_64 percpu areas are megabyte-scale,
far larger than the guard.

Signed-off-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Jennifer Miller <jmill@asu.edu>
---
v2: VM_NO_GUARD for percpu vmap areas

 arch/x86/include/asm/vmalloc.h | 21 +++++++++++++++++++++
 include/linux/vmalloc.h        | 16 ++++++++++++++--
 mm/percpu.c                    |  2 +-
 mm/vmalloc.c                   |  4 ++--
 4 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h
index 49ce331f3ac6..2c341f398227 100644
--- a/arch/x86/include/asm/vmalloc.h
+++ b/arch/x86/include/asm/vmalloc.h
@@ -5,6 +5,27 @@
 #include <asm/page.h>
 #include <asm/pgtable_areas.h>
 
+/*
+ * The x86 ENTER instruction can be used as a one-instruction stack pivot:
+ * ENTER imm16, imm8 lowers RSP by imm16 + 8 * (L + 1), L = imm8 & 0x1f.
+ * imm16 is an unsigned 16-bit operand (ENTER never raises RSP) and L is in
+ * [0, 31], so a single ENTER can lower RSP by at most
+ * 0xffff + 8 * 0x20 = 0x100ff bytes. With CONFIG_VMAP_STACK the kernel
+ * stack lives in the vmalloc area, where an unprivileged user can spray
+ * adjacent allocations; a single-page guard is too small to contain such a
+ * pivot. Use 0x11 guard pages (0x11000 bytes), the smallest whole-page
+ * span exceeding 0x100ff, so the pivot faults in the guard instead of
+ * landing in attacker-controlled memory.
+ *
+ * Restrict this to 64-bit: VMAP_STACK is selected only on x86_64, so 32-bit
+ * kernel stacks are not in the vmalloc area and the technique does not apply.
+ * 32-bit also has a far smaller vmalloc window, where a 16-page-per-area
+ * widening would needlessly pressure the address space.
+ */
+#ifdef CONFIG_X86_64
+#define VMAP_GUARD_PAGES	0x11
+#endif
+
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
 
 #ifdef CONFIG_X86_64
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 3b02c0c6b371..b8546e519deb 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -49,6 +49,18 @@ struct iov_iter;		/* in uio.h */
 #define IOREMAP_MAX_ORDER	(7 + PAGE_SHIFT)	/* 128 pages */
 #endif
 
+/*
+ * Number of unmapped guard pages appended to each guarded vmalloc
+ * allocation. The default is a single page; an architecture may override
+ * VMAP_GUARD_PAGES (via asm/vmalloc.h) when a wider guard is needed to
+ * contain a worst-case single-instruction stack pivot into an adjacent,
+ * attacker-controlled vmap allocation (see arch/x86 for the ENTER case).
+ */
+#ifndef VMAP_GUARD_PAGES
+#define VMAP_GUARD_PAGES	1
+#endif
+#define VMAP_GUARD_SIZE		(VMAP_GUARD_PAGES * PAGE_SIZE)
+
 struct vm_struct {
 	union {
 		struct vm_struct *next;	  /* Early registration of vm_areas. */
@@ -236,8 +248,8 @@ int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
 static inline size_t get_vm_area_size(const struct vm_struct *area)
 {
 	if (!(area->flags & VM_NO_GUARD))
-		/* return actual size without guard page */
-		return area->size - PAGE_SIZE;
+		/* return actual size without guard region */
+		return area->size - VMAP_GUARD_SIZE;
 	else
 		return area->size;
 
diff --git a/mm/percpu.c b/mm/percpu.c
index b0676b8054ed..9f7262228be1 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -3243,7 +3243,7 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
 	}
 
 	/* allocate vm area, map the pages and copy static data */
-	vm.flags = VM_ALLOC;
+	vm.flags = VM_ALLOC | VM_NO_GUARD;
 	vm.size = num_possible_cpus() * ai->unit_size;
 	vm_area_register_early(&vm, PAGE_SIZE);
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index bb6ae08d18f5..fc435054f640 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3217,7 +3217,7 @@ struct vm_struct *__get_vm_area_node(unsigned long size,
 		return NULL;
 
 	if (!(flags & VM_NO_GUARD))
-		size += PAGE_SIZE;
+		size += VMAP_GUARD_SIZE;
 
 	area->flags = flags;
 	area->caller = caller;
@@ -5027,7 +5027,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 
 		spin_lock(&vn->busy.lock);
 		insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
-		setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC,
+		setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC | VM_NO_GUARD,
 				 pcpu_get_vm_areas);
 		spin_unlock(&vn->busy.lock);
 	}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-29 21:47 [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot Xiang Mei
@ 2026-06-29 22:29 ` Dave Hansen
  2026-06-29 23:28   ` Xiang Mei
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2026-06-29 22:29 UTC (permalink / raw)
  To: Xiang Mei, Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening
  Cc: Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili

On 6/29/26 14:47, Xiang Mei wrote:
> With CONFIG_VMAP_STACK, kernel stacks are allocated in the vmalloc area,
> which an unprivileged user can surround with attacker-controlled data by
> spraying vmap allocations adjacent to a target stack (for example via
> XDP_UMEM_REG, though other vmalloc spray paths work too). Today each
> guarded vmalloc allocation is followed by a single unmapped guard page.
> 
> A single guard page is not enough to contain the x86_64 ENTER
> instruction used as a one-instruction stack pivot. ENTER imm16, imm8
> builds a stack frame and lowers RSP by:
> 
> 	imm16 + 8 * (L + 1),  L = imm8 & 0x1f
> 
> imm16 is an unsigned 16-bit operand (ENTER never raises RSP), and L is
> in [0, 31], so the maximum displacement of a single ENTER is:
> 
> 	0xffff + 8 * 0x20 = 0x100ff bytes

This needs some more discussion of why this _specific_ instruction is so
important and why a good old 'add'. Peter asked about this on v1 and it
didn't make it into v2.

I think it boils down to ENTER doing a bunch of useful stuff setting up
a new frame in a single instruction. That single instruction is easier
to conjure up from another exploit or bad control flow than actually
setting up a stack frame.

But, really, if ENTER is so evil and nobody uses it, shouldn't we just
have an MSR bit somewhere to tell the CPU to #UD for it rather than
playing these stack games?

> That is more than enough to step off the current stack, across the
> one-page guard, and into the adjacent sprayed pages. When those pages
> contain a return sled feeding a ROP chain, reaching any ENTER gadget
> (opcode 0xc8, abundant as both intended and unintended gadgets) turns a
> control-flow hijack into full ROP execution without any register control
> at the hijack site, making it a one-gadget-style primitive that
> significantly eases exploitation. The pivot happens after the control
> transfer, so it is not constrained by CFI (kCFI/FineIBT).

This all sounds super theoretical.

I don't think we should mess with any of this without there being some
sign that this is an actual, practical juicy exploit target.

> Introduce a VMAP_GUARD_PAGES knob that defaults to a single page (no
> change for current architectures) and can be overridden per arch via
> asm/vmalloc.h, and set it to 0x11 on x86_64. This is deliberately scoped
> to x86_64: the 0x100ff bound is a property of the ENTER opcode, and ENTER
> is also a one-byte opcode (0xc8) that appears as abundant unintended
> gadgets. Other architectures (e.g. arm64) have no equivalent
> single-instruction, immediate-controlled pivot reachable as an unaligned
> unintended gadget, so they keep the one-page guard and pay no cost.

To even be considered, this series needs to be refactored properly.
Making this VMAP_GUARD_PAGES a separate patch is the bare minimum.

> The override is gated on CONFIG_X86_64 rather than applying to all of x86:
> VMAP_STACK is selected only on x86_64, so 32-bit kernel stacks are not in
> the vmalloc area and the technique does not apply there. 32-bit x86 also
> has a far smaller vmalloc window, where widening every guarded area by 16
> pages would needlessly pressure the address space.

Shouldn't you condition it on HAVE_ARCH_VMAP_STACK, not X86_64 directly?

> The guard pages are never populated, so there is no extra physical
> memory and no additional page-table population beyond the larger virtual
> span; the cost is virtual address space and vmap_area bookkeeping, which
> is negligible against the 64-bit vmalloc window. get_vm_area_size() is
> adjusted by the same VMAP_GUARD_SIZE so the usable size reported to
> callers is unchanged.

Let's be thorough here, though, please. You're arguing that there's no
real cost to this. It's going to make the vmalloc() address space more
sparse and put pressure on the intermediate paging structure caches.
Whether that pressure matters is debatable.

But I do think you owe at least some rudimentary performance checks on this.

BTW, this is LLM-wordy. If you send another version of this, please work
on making it more concicse.

> On x86 this widens the guard for all guarded vmap areas, not only thread
> stacks. ret2enter targets the stack specifically, so a narrower
> alternative is to apply the wider guard only on the thread-stack
> allocation path via a dedicated VM_ flag; we kept the change in the
> common path as defense in depth for any vmalloc-adjacent pivot target,
> but are happy to scope it to stacks if maintainers prefer.

The simplest code thing for now is to just make it apply to all
vmalloc() allocations. That also theoretically has the largest impact,
but it's probably the best patch to start with.

> While widening the guard, also mark percpu vmap areas VM_NO_GUARD.
> pcpu_get_vm_areas() and pcpu_page_first_chunk() size each area exactly and
> reserve no guard, so get_vm_area_size() would subtract a guard that was
> never added and underflow if an area were smaller than the guard. This is
> a latent correctness fix only: on x86_64 percpu areas are megabyte-scale,
> far larger than the guard.

Honestly, I think this is just a sign that the code needs refactoring
rather than hacks.

If you go forward with this, I think vm_struct just needs a
area->guard_nr_pages. Then the internal users of the structure just set
area->size and the guard size. They don't have to fiddle with VM_NO_GUARD.

> +/*
> + * The x86 ENTER instruction can be used as a one-instruction stack pivot:
> + * ENTER imm16, imm8 lowers RSP by imm16 + 8 * (L + 1), L = imm8 & 0x1f.
> + * imm16 is an unsigned 16-bit operand (ENTER never raises RSP) and L is in
> + * [0, 31], so a single ENTER can lower RSP by at most
> + * 0xffff + 8 * 0x20 = 0x100ff bytes. With CONFIG_VMAP_STACK the kernel
> + * stack lives in the vmalloc area, where an unprivileged user can spray
> + * adjacent allocations; a single-page guard is too small to contain such a
> + * pivot. Use 0x11 guard pages (0x11000 bytes), the smallest whole-page
> + * span exceeding 0x100ff, so the pivot faults in the guard instead of
> + * landing in attacker-controlled memory.
> + *
> + * Restrict this to 64-bit: VMAP_STACK is selected only on x86_64, so 32-bit
> + * kernel stacks are not in the vmalloc area and the technique does not apply.
> + * 32-bit also has a far smaller vmalloc window, where a 16-page-per-area
> + * widening would needlessly pressure the address space.
> + */
> +#ifdef CONFIG_X86_64
> +#define VMAP_GUARD_PAGES	0x11
> +#endif

That comment is way too big. What is it protecting? What does the number
come from? We don't need to see the gory details in the comment.

/*
 * Protect against control flow hijacks to gadgets using the ENTER
 * instruction. Those can jump a bit over 64k on the stack so make the
 * guard 64k+4k.
 */
#ifdef CONFIG_VMAP_STACK
#define VMAP_GUARD_PAGES	0x11
#endif

Right? What else do you really need?

>  #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>  
>  #ifdef CONFIG_X86_64
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 3b02c0c6b371..b8546e519deb 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -49,6 +49,18 @@ struct iov_iter;		/* in uio.h */
>  #define IOREMAP_MAX_ORDER	(7 + PAGE_SHIFT)	/* 128 pages */
>  #endif
>  
> +/*
> + * Number of unmapped guard pages appended to each guarded vmalloc
> + * allocation. The default is a single page; an architecture may override
> + * VMAP_GUARD_PAGES (via asm/vmalloc.h) when a wider guard is needed to
> + * contain a worst-case single-instruction stack pivot into an adjacent,
> + * attacker-controlled vmap allocation (see arch/x86 for the ENTER case).
> + */

Heh, are you getting paid by the word here? These are way too verbose.

> +#ifndef VMAP_GUARD_PAGES
> +#define VMAP_GUARD_PAGES	1
> +#endif
> +#define VMAP_GUARD_SIZE		(VMAP_GUARD_PAGES * PAGE_SIZE)

This could also be quite trivially expressed in Kconfig:

config VMAP_GUARD_PAGES
	int
	default 1
	default ARCH_VMAP_GUARD_PAGES if ARCH_VMAP_GUARD_PAGES

>  struct vm_struct {
>  	union {
>  		struct vm_struct *next;	  /* Early registration of vm_areas. */
> @@ -236,8 +248,8 @@ int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
>  static inline size_t get_vm_area_size(const struct vm_struct *area)
>  {
>  	if (!(area->flags & VM_NO_GUARD))
> -		/* return actual size without guard page */
> -		return area->size - PAGE_SIZE;
> +		/* return actual size without guard region */
> +		return area->size - VMAP_GUARD_SIZE;
>  	else
>  		return area->size;
>  
> diff --git a/mm/percpu.c b/mm/percpu.c
> index b0676b8054ed..9f7262228be1 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -3243,7 +3243,7 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
>  	}
>  
>  	/* allocate vm area, map the pages and copy static data */
> -	vm.flags = VM_ALLOC;
> +	vm.flags = VM_ALLOC | VM_NO_GUARD;
>  	vm.size = num_possible_cpus() * ai->unit_size;
>  	vm_area_register_early(&vm, PAGE_SIZE);

Yeah, I'd much rather see:

	vm.size  = num_possible_cpus() * ai->unit_size;
	vm.guard = 0;

(or whatever we name the structure member) in cases like this.

So, yeah, this is a cute PoC hack. But it's gluing about 10 different
things into one patch instead of doing proper refactoring. Plus, I'm not
really even sure it's worth it in the first place.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-29 22:29 ` Dave Hansen
@ 2026-06-29 23:28   ` Xiang Mei
  2026-06-29 23:37     ` Dave Hansen
  2026-06-30 14:40     ` Pedro Falcato
  0 siblings, 2 replies; 15+ messages in thread
From: Xiang Mei @ 2026-06-29 23:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili

On Mon, Jun 29, 2026 at 3:29 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/29/26 14:47, Xiang Mei wrote:
> > With CONFIG_VMAP_STACK, kernel stacks are allocated in the vmalloc area,
> > which an unprivileged user can surround with attacker-controlled data by
> > spraying vmap allocations adjacent to a target stack (for example via
> > XDP_UMEM_REG, though other vmalloc spray paths work too). Today each
> > guarded vmalloc allocation is followed by a single unmapped guard page.
> >
> > A single guard page is not enough to contain the x86_64 ENTER
> > instruction used as a one-instruction stack pivot. ENTER imm16, imm8
> > builds a stack frame and lowers RSP by:
> >
> >       imm16 + 8 * (L + 1),  L = imm8 & 0x1f
> >
> > imm16 is an unsigned 16-bit operand (ENTER never raises RSP), and L is
> > in [0, 31], so the maximum displacement of a single ENTER is:
> >
> >       0xffff + 8 * 0x20 = 0x100ff bytes
>
> This needs some more discussion of why this _specific_ instruction is so
> important and why a good old 'add'. Peter asked about this on v1 and it
> didn't make it into v2.
>

Thanks, I see it; we can also add more background introduction to the
commit message so readers can understand why ENTER is so special.

> I think it boils down to ENTER doing a bunch of useful stuff setting up
> a new frame in a single instruction. That single instruction is easier
> to conjure up from another exploit or bad control flow than actually
> setting up a stack frame.
>
> But, really, if ENTER is so evil and nobody uses it, shouldn't we just
> have an MSR bit somewhere to tell the CPU to #UD for it rather than
> playing these stack games?
>

I totally agree, and we'd like to see such a control. If I understand
correctly, that's a hardware/architectural change, so for current CPUs
we still need a software mitigation to harden the kernel.

> > That is more than enough to step off the current stack, across the
> > one-page guard, and into the adjacent sprayed pages. When those pages
> > contain a return sled feeding a ROP chain, reaching any ENTER gadget
> > (opcode 0xc8, abundant as both intended and unintended gadgets) turns a
> > control-flow hijack into full ROP execution without any register control
> > at the hijack site, making it a one-gadget-style primitive that
> > significantly eases exploitation. The pivot happens after the control
> > transfer, so it is not constrained by CFI (kCFI/FineIBT).
>
> This all sounds super theoretical.
>
> I don't think we should mess with any of this without there being some
> sign that this is an actual, practical juicy exploit target.
>
Yes, I am sorry to reuse some incorrect comments I copied from v1.
I'll remove the CFI-related content since we assume we already have
control flow hijacking.
Thanks for pointing it out.

> > Introduce a VMAP_GUARD_PAGES knob that defaults to a single page (no
> > change for current architectures) and can be overridden per arch via
> > asm/vmalloc.h, and set it to 0x11 on x86_64. This is deliberately scoped
> > to x86_64: the 0x100ff bound is a property of the ENTER opcode, and ENTER
> > is also a one-byte opcode (0xc8) that appears as abundant unintended
> > gadgets. Other architectures (e.g. arm64) have no equivalent
> > single-instruction, immediate-controlled pivot reachable as an unaligned
> > unintended gadget, so they keep the one-page guard and pay no cost.
>
> To even be considered, this series needs to be refactored properly.
> Making this VMAP_GUARD_PAGES a separate patch is the bare minimum.
>
Good suggestion, I will do it in v3:

    1/3 - introduce VMAP_GUARD_PAGES
    2/3 - mark percpu vmap areas VM_NO_GUARD
    3/3 - set VMAP_GUARD_PAGES to 0x11 on x86_64

> > The override is gated on CONFIG_X86_64 rather than applying to all of x86:
> > VMAP_STACK is selected only on x86_64, so 32-bit kernel stacks are not in
> > the vmalloc area and the technique does not apply there. 32-bit x86 also
> > has a far smaller vmalloc window, where widening every guarded area by 16
> > pages would needlessly pressure the address space.
>
> Shouldn't you condition it on HAVE_ARCH_VMAP_STACK, not X86_64 directly?
>

Yup, that's better; we'll do the following in v3:
```
#if defined(CONFIG_X86_64) && defined(CONFIG_VMAP_STACK)
#define VMAP_GUARD_PAGES 0x11
#endif
```

> > The guard pages are never populated, so there is no extra physical
> > memory and no additional page-table population beyond the larger virtual
> > span; the cost is virtual address space and vmap_area bookkeeping, which
> > is negligible against the 64-bit vmalloc window. get_vm_area_size() is
> > adjusted by the same VMAP_GUARD_SIZE so the usable size reported to
> > callers is unchanged.
>
> Let's be thorough here, though, please. You're arguing that there's no
> real cost to this. It's going to make the vmalloc() address space more
> sparse and put pressure on the intermediate paging structure caches.
> Whether that pressure matters is debatable.
>
> But I do think you owe at least some rudimentary performance checks on this.
>

I'll do some cost tests before sending v3 and the correct statement.

> BTW, this is LLM-wordy. If you send another version of this, please work
> on making it more concicse.
>
The v3 comments would be concise.

> > On x86 this widens the guard for all guarded vmap areas, not only thread
> > stacks. ret2enter targets the stack specifically, so a narrower
> > alternative is to apply the wider guard only on the thread-stack
> > allocation path via a dedicated VM_ flag; we kept the change in the
> > common path as defense in depth for any vmalloc-adjacent pivot target,
> > but are happy to scope it to stacks if maintainers prefer.
>
> The simplest code thing for now is to just make it apply to all
> vmalloc() allocations. That also theoretically has the largest impact,
> but it's probably the best patch to start with.
>
> > While widening the guard, also mark percpu vmap areas VM_NO_GUARD.
> > pcpu_get_vm_areas() and pcpu_page_first_chunk() size each area exactly and
> > reserve no guard, so get_vm_area_size() would subtract a guard that was
> > never added and underflow if an area were smaller than the guard. This is
> > a latent correctness fix only: on x86_64 percpu areas are megabyte-scale,
> > far larger than the guard.
>
> Honestly, I think this is just a sign that the code needs refactoring
> rather than hacks.
>
> If you go forward with this, I think vm_struct just needs a
> area->guard_nr_pages. Then the internal users of the structure just set
> area->size and the guard size. They don't have to fiddle with VM_NO_GUARD.
>
Agree, the problem exists before our patch (get_vm_area_size() is
already one page less).
> > +/*
> > + * The x86 ENTER instruction can be used as a one-instruction stack pivot:
> > + * ENTER imm16, imm8 lowers RSP by imm16 + 8 * (L + 1), L = imm8 & 0x1f.
> > + * imm16 is an unsigned 16-bit operand (ENTER never raises RSP) and L is in
> > + * [0, 31], so a single ENTER can lower RSP by at most
> > + * 0xffff + 8 * 0x20 = 0x100ff bytes. With CONFIG_VMAP_STACK the kernel
> > + * stack lives in the vmalloc area, where an unprivileged user can spray
> > + * adjacent allocations; a single-page guard is too small to contain such a
> > + * pivot. Use 0x11 guard pages (0x11000 bytes), the smallest whole-page
> > + * span exceeding 0x100ff, so the pivot faults in the guard instead of
> > + * landing in attacker-controlled memory.
> > + *
> > + * Restrict this to 64-bit: VMAP_STACK is selected only on x86_64, so 32-bit
> > + * kernel stacks are not in the vmalloc area and the technique does not apply.
> > + * 32-bit also has a far smaller vmalloc window, where a 16-page-per-area
> > + * widening would needlessly pressure the address space.
> > + */
> > +#ifdef CONFIG_X86_64
> > +#define VMAP_GUARD_PAGES     0x11
> > +#endif
>
> That comment is way too big. What is it protecting? What does the number
> come from? We don't need to see the gory details in the comment.
>
> /*
>  * Protect against control flow hijacks to gadgets using the ENTER
>  * instruction. Those can jump a bit over 64k on the stack so make the
>  * guard 64k+4k.
>  */
> #ifdef CONFIG_VMAP_STACK
> #define VMAP_GUARD_PAGES        0x11
> #endif
>
> Right? What else do you really need?
>
Got it, simple comments and thanks for the example.

> >  #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> >
> >  #ifdef CONFIG_X86_64
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index 3b02c0c6b371..b8546e519deb 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -49,6 +49,18 @@ struct iov_iter;           /* in uio.h */
> >  #define IOREMAP_MAX_ORDER    (7 + PAGE_SHIFT)        /* 128 pages */
> >  #endif
> >
> > +/*
> > + * Number of unmapped guard pages appended to each guarded vmalloc
> > + * allocation. The default is a single page; an architecture may override
> > + * VMAP_GUARD_PAGES (via asm/vmalloc.h) when a wider guard is needed to
> > + * contain a worst-case single-instruction stack pivot into an adjacent,
> > + * attacker-controlled vmap allocation (see arch/x86 for the ENTER case).
> > + */
>
> Heh, are you getting paid by the word here? These are way too verbose.
>
I'll clean them in v3.
> > +#ifndef VMAP_GUARD_PAGES
> > +#define VMAP_GUARD_PAGES     1
> > +#endif
> > +#define VMAP_GUARD_SIZE              (VMAP_GUARD_PAGES * PAGE_SIZE)
>
> This could also be quite trivially expressed in Kconfig:
>
> config VMAP_GUARD_PAGES
>         int
>         default 1
>         default ARCH_VMAP_GUARD_PAGES if ARCH_VMAP_GUARD_PAGES
>
Good suggestion, it'll be in v3.
> >  struct vm_struct {
> >       union {
> >               struct vm_struct *next;   /* Early registration of vm_areas. */
> > @@ -236,8 +248,8 @@ int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
> >  static inline size_t get_vm_area_size(const struct vm_struct *area)
> >  {
> >       if (!(area->flags & VM_NO_GUARD))
> > -             /* return actual size without guard page */
> > -             return area->size - PAGE_SIZE;
> > +             /* return actual size without guard region */
> > +             return area->size - VMAP_GUARD_SIZE;
> >       else
> >               return area->size;
> >
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index b0676b8054ed..9f7262228be1 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -3243,7 +3243,7 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
> >       }
> >
> >       /* allocate vm area, map the pages and copy static data */
> > -     vm.flags = VM_ALLOC;
> > +     vm.flags = VM_ALLOC | VM_NO_GUARD;
> >       vm.size = num_possible_cpus() * ai->unit_size;
> >       vm_area_register_early(&vm, PAGE_SIZE);
>
> Yeah, I'd much rather see:
>
>         vm.size  = num_possible_cpus() * ai->unit_size;
>         vm.guard = 0;
>
> (or whatever we name the structure member) in cases like this.
>
> So, yeah, this is a cute PoC hack. But it's gluing about 10 different
> things into one patch instead of doing proper refactoring. Plus, I'm not
> really even sure it's worth it in the first place.

Yeah, I'll separate them into a series of patches. I believe it's a
worthy patch since it's a huge benefit to have a generic
one-gadget-style gadget. We have tested this one-gadget-style stack
pivoting on real CVEs and got the exploitation working without
additional changes.

Thanks for your review. I'll do the benchmarking and deliver v3.

Xiang


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-29 23:28   ` Xiang Mei
@ 2026-06-29 23:37     ` Dave Hansen
  2026-06-30  1:22       ` Xiang Mei
  2026-06-30 14:40     ` Pedro Falcato
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2026-06-29 23:37 UTC (permalink / raw)
  To: Xiang Mei
  Cc: Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili

On 6/29/26 16:28, Xiang Mei wrote:
>>> That is more than enough to step off the current stack, across the
>>> one-page guard, and into the adjacent sprayed pages. When those pages
>>> contain a return sled feeding a ROP chain, reaching any ENTER gadget
>>> (opcode 0xc8, abundant as both intended and unintended gadgets) turns a
>>> control-flow hijack into full ROP execution without any register control
>>> at the hijack site, making it a one-gadget-style primitive that
>>> significantly eases exploitation. The pivot happens after the control
>>> transfer, so it is not constrained by CFI (kCFI/FineIBT).
>> This all sounds super theoretical.
>>
>> I don't think we should mess with any of this without there being some
>> sign that this is an actual, practical juicy exploit target.
>>
> Yes, I am sorry to reuse some incorrect comments I copied from v1.
> I'll remove the CFI-related content since we assume we already have
> control flow hijacking.

I think you missed the main point: this all sounds *SUPER* theoretical.
In other words, no real attacker would ever need to use ENTER like. Only
make-believe attackers in imaginary academic papers. Those imagined
attackers' only goal is to help mint PhD's.

Upstream, we're concerned with practical attacks, not theoretical ones.

You've done virtually nothing here to show that this is a practical
attack that someone might use in the real world, outside of the
PhD-minting industry.

Please don't even try to send a v3 without addressing this.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-29 23:37     ` Dave Hansen
@ 2026-06-30  1:22       ` Xiang Mei
  2026-06-30 14:01         ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Xiang Mei @ 2026-06-30  1:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili

On Mon, Jun 29, 2026 at 4:37 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/29/26 16:28, Xiang Mei wrote:
> >>> That is more than enough to step off the current stack, across the
> >>> one-page guard, and into the adjacent sprayed pages. When those pages
> >>> contain a return sled feeding a ROP chain, reaching any ENTER gadget
> >>> (opcode 0xc8, abundant as both intended and unintended gadgets) turns a
> >>> control-flow hijack into full ROP execution without any register control
> >>> at the hijack site, making it a one-gadget-style primitive that
> >>> significantly eases exploitation. The pivot happens after the control
> >>> transfer, so it is not constrained by CFI (kCFI/FineIBT).
> >> This all sounds super theoretical.
> >>
> >> I don't think we should mess with any of this without there being some
> >> sign that this is an actual, practical juicy exploit target.
> >>
> > Yes, I am sorry to reuse some incorrect comments I copied from v1.
> > I'll remove the CFI-related content since we assume we already have
> > control flow hijacking.
>
> I think you missed the main point: this all sounds *SUPER* theoretical.
> In other words, no real attacker would ever need to use ENTER like. Only
> make-believe attackers in imaginary academic papers. Those imagined
> attackers' only goal is to help mint PhD's.
>
> Upstream, we're concerned with practical attacks, not theoretical ones.
>
> You've done virtually nothing here to show that this is a practical
> attack that someone might use in the real world, outside of the
> PhD-minting industry.
>
> Please don't even try to send a v3 without addressing this.
This is a demo exploiting CVE-2026-31419 with this technique:
https://github.com/google/security-research/pull/397

I have no comment on your PhD-minting story. Let's keep this issue
free from personal stuff.
I would like to demo you that this technique is practical. Please tell
me what you need to prove that this bug is practical.


Thanks,
Xiang


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-30  1:22       ` Xiang Mei
@ 2026-06-30 14:01         ` Dave Hansen
  2026-06-30 14:58           ` Pedro Falcato
  2026-06-30 22:02           ` Xiang Mei
  0 siblings, 2 replies; 15+ messages in thread
From: Dave Hansen @ 2026-06-30 14:01 UTC (permalink / raw)
  To: Xiang Mei
  Cc: Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili

On 6/29/26 18:22, Xiang Mei wrote:
>> Please don't even try to send a v3 without addressing this.
> This is a demo exploiting CVE-2026-31419 with this technique:
> https://github.com/google/security-research/pull/397

Thanks for sharing that. That's really good info.

But what I want to hear a bit more about is why this new guard region is
a good, generic mitigation. Does it help mitigate a whole class of
vulnerabilities?

I think you're making the claim that this ENTER technique takes what
would normally just be a DoS and makes it fully exploitable. Does this
happen for a lot of DoS bugs? Or is CVE-2026-31419 very unusual and this
stack guard gunk won't ever be useful again?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-29 23:28   ` Xiang Mei
  2026-06-29 23:37     ` Dave Hansen
@ 2026-06-30 14:40     ` Pedro Falcato
  2026-06-30 15:15       ` Dave Hansen
  2026-06-30 21:41       ` Xiang Mei
  1 sibling, 2 replies; 15+ messages in thread
From: Pedro Falcato @ 2026-06-30 14:40 UTC (permalink / raw)
  To: Xiang Mei
  Cc: Dave Hansen, Kees Cook, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili

Just as a quick FYI, it's good LKML ettiquette to keep people who engaged with
the previous threads on CC for new versions :)

On Mon, Jun 29, 2026 at 04:28:19PM -0700, Xiang Mei wrote:
> On Mon, Jun 29, 2026 at 3:29 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 6/29/26 14:47, Xiang Mei wrote:
> > > With CONFIG_VMAP_STACK, kernel stacks are allocated in the vmalloc area,
> > > which an unprivileged user can surround with attacker-controlled data by
> > > spraying vmap allocations adjacent to a target stack (for example via
> > > XDP_UMEM_REG, though other vmalloc spray paths work too). Today each
> > > guarded vmalloc allocation is followed by a single unmapped guard page.
...snip...
> > To even be considered, this series needs to be refactored properly.
> > Making this VMAP_GUARD_PAGES a separate patch is the bare minimum.
> >
> Good suggestion, I will do it in v3:
> 
>     1/3 - introduce VMAP_GUARD_PAGES
>     2/3 - mark percpu vmap areas VM_NO_GUARD

I would suggest you create a VMAP_STACK flag and condition these guard regions
bsaed on that. Otherwise it's a bit arbitrary as to what callers get 0x11 guard
pages, and which don't.

(you can find the concrete stack allocation functions in kernel/fork.c)

-- 
Pedro


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-30 14:01         ` Dave Hansen
@ 2026-06-30 14:58           ` Pedro Falcato
  2026-06-30 22:02           ` Xiang Mei
  1 sibling, 0 replies; 15+ messages in thread
From: Pedro Falcato @ 2026-06-30 14:58 UTC (permalink / raw)
  To: Dave Hansen, Xiang Mei
  Cc: Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili

On Tue, Jun 30, 2026 at 07:01:48AM -0700, Dave Hansen wrote:
> On 6/29/26 18:22, Xiang Mei wrote:
> >> Please don't even try to send a v3 without addressing this.
> > This is a demo exploiting CVE-2026-31419 with this technique:
> > https://github.com/google/security-research/pull/397
> 
> Thanks for sharing that. That's really good info.
> 
> But what I want to hear a bit more about is why this new guard region is
> a good, generic mitigation. Does it help mitigate a whole class of
> vulnerabilities?

I guess, to add to the questions (to Xiang and/or x86 people):
1) Aren't initiatives like kCFI/CET/shadow stack supposed to mitigate these
issues? Is this mitigation supposed to be applied in spite of these features?
2) Aren't you screwed by the time the attacker gets kernel remote code
execution anyway?

> 
> I think you're making the claim that this ENTER technique takes what
> would normally just be a DoS and makes it fully exploitable. Does this
> happen for a lot of DoS bugs? Or is CVE-2026-31419 very unusual and this
> stack guard gunk won't ever be useful again?

I suspect it's just the typical UAF with a function pointer table, that leads
into remote code execution. I know that for our (SUSE) CVE scoring, we tend
to treat these kinds of UAFs a lot more seriously than others. But I didn't
look closely.

-- 
Pedro


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-30 14:40     ` Pedro Falcato
@ 2026-06-30 15:15       ` Dave Hansen
  2026-06-30 21:54         ` Dave Hansen
  2026-06-30 21:41       ` Xiang Mei
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2026-06-30 15:15 UTC (permalink / raw)
  To: Pedro Falcato, Xiang Mei
  Cc: Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili

On 6/30/26 07:40, Pedro Falcato wrote:
>>> To even be considered, this series needs to be refactored properly.
>>> Making this VMAP_GUARD_PAGES a separate patch is the bare minimum.
>>>
>> Good suggestion, I will do it in v3:
>>
>>     1/3 - introduce VMAP_GUARD_PAGES
>>     2/3 - mark percpu vmap areas VM_NO_GUARD
> I would suggest you create a VMAP_STACK flag and condition these guard regions
> bsaed on that. Otherwise it's a bit arbitrary as to what callers get 0x11 guard
> pages, and which don't.
> 
> (you can find the concrete stack allocation functions in kernel/fork.c)

The real question here is whether VMAP_STACK is a good idea or whether
__get_vm_area_node() should grow functionality to manipulate guard gaps
and then just have the stack allocation code use it directly.

I, personally, despise code like this:

static inline size_t get_vm_area_size(const struct vm_struct *area)
{
        if (!(area->flags & VM_NO_GUARD))
                /* return actual size without guard page */
                return area->size - PAGE_SIZE;
        else
                return area->size;

}

I'd *much* rather it be something more like:

static inline size_t get_vm_area_size(const struct vm_struct *area)
{
	return area->size - area->gap;
}

for example.

Looking at every use of VM_NO_GUARD, I think the kernel just gets
simpler if it goes away. It's only referenced in 6 sites:

 1. __get_vm_area_node() - Munge gap argument into 'area'
 2. get_vm_area_size() can be replaced as I showed above
 3. kasan_mem_notifier() - just pass a gap=0 to __get_vm_area_node()
 4. kasan_alloc_module_shadow() - just pass a gap=0
 5. check_sparse_vm_area() - check area->gap instead
 6. Just remove VM_NO_GUARD checks. No flag means no munging the flag.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-30 14:40     ` Pedro Falcato
  2026-06-30 15:15       ` Dave Hansen
@ 2026-06-30 21:41       ` Xiang Mei
  1 sibling, 0 replies; 15+ messages in thread
From: Xiang Mei @ 2026-06-30 21:41 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Dave Hansen, Kees Cook, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili

On Tue, Jun 30, 2026 at 7:40 AM Pedro Falcato <pfalcato@suse.de> wrote:
>
> Just as a quick FYI, it's good LKML ettiquette to keep people who engaged with
> the previous threads on CC for new versions :)
>
Thanks so much for the tip. I'll remember that.
> On Mon, Jun 29, 2026 at 04:28:19PM -0700, Xiang Mei wrote:
> > On Mon, Jun 29, 2026 at 3:29 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > On 6/29/26 14:47, Xiang Mei wrote:
> > > > With CONFIG_VMAP_STACK, kernel stacks are allocated in the vmalloc area,
> > > > which an unprivileged user can surround with attacker-controlled data by
> > > > spraying vmap allocations adjacent to a target stack (for example via
> > > > XDP_UMEM_REG, though other vmalloc spray paths work too). Today each
> > > > guarded vmalloc allocation is followed by a single unmapped guard page.
> ...snip...
> > > To even be considered, this series needs to be refactored properly.
> > > Making this VMAP_GUARD_PAGES a separate patch is the bare minimum.
> > >
> > Good suggestion, I will do it in v3:
> >
> >     1/3 - introduce VMAP_GUARD_PAGES
> >     2/3 - mark percpu vmap areas VM_NO_GUARD
>
> I would suggest you create a VMAP_STACK flag and condition these guard regions
> bsaed on that. Otherwise it's a bit arbitrary as to what callers get 0x11 guard
> pages, and which don't.

Good point. That would make the structure cleaner.
>
> (you can find the concrete stack allocation functions in kernel/fork.c)
>
> --
> Pedro


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-30 15:15       ` Dave Hansen
@ 2026-06-30 21:54         ` Dave Hansen
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Hansen @ 2026-06-30 21:54 UTC (permalink / raw)
  To: Pedro Falcato, Xiang Mei
  Cc: Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili

[-- Attachment #1: Type: text/plain, Size: 1329 bytes --]

On 6/30/26 08:15, Dave Hansen wrote:
> Looking at every use of VM_NO_GUARD, I think the kernel just gets
> simpler if it goes away. It's only referenced in 6 sites:
> 
>  1. __get_vm_area_node() - Munge gap argument into 'area'
>  2. get_vm_area_size() can be replaced as I showed above
>  3. kasan_mem_notifier() - just pass a gap=0 to __get_vm_area_node()
>  4. kasan_alloc_module_shadow() - just pass a gap=0
>  5. check_sparse_vm_area() - check area->gap instead
>  6. Just remove VM_NO_GUARD checks. No flag means no munging the flag.

Oh, and I guess I didn't say it explicitly, but what this *also* lets
you do is have runtime-variable guard gaps.

You could have variable-sized gaps as another mitigation. Or trim them
down to one page if mitigations=off. Or, make them huge. Or, align
stacks to 32k so that all their PTEs are in a cacheline.

I suspect that building the gap into the vmap_area itself will make the
code simpler *and* more flexible. That also makes hardening/mitigation
based on changing the gap much easier to swallow because it can be
turned on and off much more easily.

It also prevents leaking details about stack allocations into the core
vmalloc() code.

Here's a completely vibe-coded hack that shows how this might look. It's
basically neutral on lines-of-code. It's almost a cleanup on its own.

[-- Attachment #2: zap-VM_NO_GUARD.patch --]
[-- Type: text/x-patch, Size: 18578 bytes --]

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index dd85e093ffdb..cf583209df50 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1220,7 +1220,7 @@ void mark_rodata_ro(void)
 
 static void __init declare_vma(struct vm_struct *vma,
 			       void *va_start, void *va_end,
-			       unsigned long vm_flags)
+			       unsigned int nr_guard_pages)
 {
 	phys_addr_t pa_start = __pa_symbol(va_start);
 	unsigned long size = va_end - va_start;
@@ -1228,14 +1228,14 @@ static void __init declare_vma(struct vm_struct *vma,
 	BUG_ON(!PAGE_ALIGNED(pa_start));
 	BUG_ON(!PAGE_ALIGNED(size));
 
-	if (!(vm_flags & VM_NO_GUARD))
-		size += PAGE_SIZE;
+	size += nr_guard_pages * PAGE_SIZE;
 
-	vma->addr	= va_start;
-	vma->phys_addr	= pa_start;
-	vma->size	= size;
-	vma->flags	= VM_MAP | vm_flags;
-	vma->caller	= __builtin_return_address(0);
+	vma->addr		= va_start;
+	vma->phys_addr		= pa_start;
+	vma->size		= size;
+	vma->flags		= VM_MAP;
+	vma->nr_guard_pages	= nr_guard_pages;
+	vma->caller		= __builtin_return_address(0);
 
 	vm_area_add_early(vma);
 }
@@ -1376,11 +1376,11 @@ static void __init declare_kernel_vmas(void)
 {
 	static struct vm_struct vmlinux_seg[KERNEL_SEGMENT_COUNT];
 
-	declare_vma(&vmlinux_seg[0], _text, _etext, VM_NO_GUARD);
-	declare_vma(&vmlinux_seg[1], __start_rodata, __inittext_begin, VM_NO_GUARD);
-	declare_vma(&vmlinux_seg[2], __inittext_begin, __inittext_end, VM_NO_GUARD);
-	declare_vma(&vmlinux_seg[3], __initdata_begin, __initdata_end, VM_NO_GUARD);
-	declare_vma(&vmlinux_seg[4], _data, _end, 0);
+	declare_vma(&vmlinux_seg[0], _text, _etext, 0);
+	declare_vma(&vmlinux_seg[1], __start_rodata, __inittext_begin, 0);
+	declare_vma(&vmlinux_seg[2], __inittext_begin, __inittext_end, 0);
+	declare_vma(&vmlinux_seg[3], __initdata_begin, __initdata_end, 0);
+	declare_vma(&vmlinux_seg[4], _data, _end, VM_DEFAULT_GUARD_PAGES);
 }
 
 void __pi_map_range(phys_addr_t *pte, u64 start, u64 end, phys_addr_t pa,
diff --git a/arch/loongarch/include/asm/kfence.h b/arch/loongarch/include/asm/kfence.h
index da9e93024626..fa9562750312 100644
--- a/arch/loongarch/include/asm/kfence.h
+++ b/arch/loongarch/include/asm/kfence.h
@@ -23,6 +23,7 @@ static inline bool arch_kfence_init_pool(void)
 
 	area = __get_vm_area_caller(KFENCE_POOL_SIZE, VM_IOREMAP,
 				    KFENCE_AREA_START, KFENCE_AREA_END,
+				    VM_DEFAULT_GUARD_PAGES,
 				    __builtin_return_address(0));
 	if (!area)
 		return false;
diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
index e27342ef128b..a29fc0350fb3 100644
--- a/arch/powerpc/kernel/pci_64.c
+++ b/arch/powerpc/kernel/pci_64.c
@@ -133,6 +133,7 @@ void __iomem *ioremap_phb(phys_addr_t paddr, unsigned long size)
 	 * reserved 64K legacy region.
 	 */
 	area = __get_vm_area_caller(size, VM_IOREMAP, PHB_IO_BASE, PHB_IO_END,
+				    VM_DEFAULT_GUARD_PAGES,
 				    __builtin_return_address(0));
 	if (!area)
 		return NULL;
diff --git a/arch/sh/kernel/cpu/sh4/sq.c b/arch/sh/kernel/cpu/sh4/sq.c
index 908a8e09113b..383f42f22dd5 100644
--- a/arch/sh/kernel/cpu/sh4/sq.c
+++ b/arch/sh/kernel/cpu/sh4/sq.c
@@ -104,7 +104,8 @@ static int __sq_remap(struct sq_mapping *map, pgprot_t prot)
 	struct vm_struct *vma;
 
 	vma = __get_vm_area_caller(map->size, VM_IOREMAP, map->sq_addr,
-			SQ_ADDRMAX, __builtin_return_address(0));
+			SQ_ADDRMAX, VM_DEFAULT_GUARD_PAGES,
+			__builtin_return_address(0));
 	if (!vma)
 		return -ENOMEM;
 
diff --git a/arch/sh/mm/pmb.c b/arch/sh/mm/pmb.c
index 482eec50f404..f593fa166d0b 100644
--- a/arch/sh/mm/pmb.c
+++ b/arch/sh/mm/pmb.c
@@ -444,7 +444,7 @@ void __iomem *pmb_remap_caller(phys_addr_t phys, unsigned long size,
 	 * 0xb000...0xc000 range.
 	 */
 	area = __get_vm_area_caller(aligned, VM_IOREMAP, 0xb0000000,
-				    P3SEG, caller);
+				    P3SEG, VM_DEFAULT_GUARD_PAGES, caller);
 	if (!area)
 		return NULL;
 
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 323adc93f2dc..1ab9e5cdba5b 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -524,8 +524,8 @@ void __init hyperv_init(void)
 
 	hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, MODULES_VADDR,
 			MODULES_END, GFP_KERNEL, PAGE_KERNEL_ROX,
-			VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
-			__builtin_return_address(0));
+			VM_FLUSH_RESET_PERMS, VM_DEFAULT_GUARD_PAGES,
+			NUMA_NO_NODE, __builtin_return_address(0));
 	if (hv_hypercall_pg == NULL)
 		goto clean_guest_os_id;
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 3b02c0c6b371..f5a51fb4518a 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -25,7 +25,6 @@ struct iov_iter;		/* in uio.h */
 #define VM_USERMAP		0x00000008	/* suitable for remap_vmalloc_range */
 #define VM_DMA_COHERENT		0x00000010	/* dma_alloc_coherent */
 #define VM_UNINITIALIZED	0x00000020	/* vm_struct is not fully initialized */
-#define VM_NO_GUARD		0x00000040      /* ***DANGEROUS*** don't add guard page */
 #define VM_KASAN		0x00000080      /* has allocated kasan shadow memory */
 #define VM_FLUSH_RESET_PERMS	0x00000100	/* reset direct map and flush TLB on unmap, can't be freed in atomic context */
 #define VM_MAP_PUT_PAGES	0x00000200	/* put pages and free array in vfree */
@@ -41,6 +40,13 @@ struct iov_iter;		/* in uio.h */
 
 /* bits [20..32] reserved for arch specific ioremap internals */
 
+/*
+ * Default number of unmapped guard pages appended to a vm area to catch
+ * out-of-bounds accesses. Callers that need a different count (typically
+ * zero) pass it explicitly to __get_vm_area_node() and friends.
+ */
+#define VM_DEFAULT_GUARD_PAGES	1
+
 /*
  * Maximum alignment for ioremap() regions.
  * Can be overridden by arch-specific value.
@@ -63,6 +69,7 @@ struct vm_struct {
 	unsigned int		page_order;
 #endif
 	unsigned int		nr_pages;
+	unsigned int		nr_guard_pages;
 	phys_addr_t		phys_addr;
 	const void		*caller;
 	unsigned long		requested_size;
@@ -86,6 +93,7 @@ struct vmap_area {
 		struct vm_struct *vm;           /* in "busy" tree */
 	};
 	unsigned long flags; /* mark type of vm_map_ram area */
+	unsigned int nr_guard_pages; /* unmapped pages trailing this area */
 };
 
 /* archs that select HAVE_ARCH_HUGE_VMAP should override one or more of these */
@@ -173,7 +181,8 @@ extern void *__vmalloc_noprof(unsigned long size, gfp_t gfp_mask) __alloc_size(1
 
 extern void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
 			unsigned long start, unsigned long end, gfp_t gfp_mask,
-			pgprot_t prot, unsigned long vm_flags, int node,
+			pgprot_t prot, unsigned long vm_flags,
+			unsigned long nr_guard_pages, int node,
 			const void *caller) __alloc_size(1);
 #define __vmalloc_node_range(...)	alloc_hooks(__vmalloc_node_range_noprof(__VA_ARGS__))
 
@@ -235,12 +244,7 @@ int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
 
 static inline size_t get_vm_area_size(const struct vm_struct *area)
 {
-	if (!(area->flags & VM_NO_GUARD))
-		/* return actual size without guard page */
-		return area->size - PAGE_SIZE;
-	else
-		return area->size;
-
+	return area->size - area->nr_guard_pages * PAGE_SIZE;
 }
 
 extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags);
@@ -249,6 +253,7 @@ extern struct vm_struct *get_vm_area_caller(unsigned long size,
 extern struct vm_struct *__get_vm_area_caller(unsigned long size,
 					unsigned long flags,
 					unsigned long start, unsigned long end,
+					unsigned long nr_guard_pages,
 					const void *caller);
 void free_vm_area(struct vm_struct *area);
 extern struct vm_struct *remove_vm_area(const void *addr);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a3c0214ca934..b844d24fba58 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -403,7 +403,8 @@ static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
 
 	return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END,
 			gfp | GFP_KERNEL | __GFP_RETRY_MAYFAIL, PAGE_KERNEL,
-			flags, numa_node, __builtin_return_address(0));
+			flags, VM_DEFAULT_GUARD_PAGES, numa_node,
+			__builtin_return_address(0));
 }
 
 void *bpf_map_area_alloc(u64 size, int numa_node)
diff --git a/kernel/scs.c b/kernel/scs.c
index 772488afd5b9..2647c79488e6 100644
--- a/kernel/scs.c
+++ b/kernel/scs.c
@@ -44,7 +44,8 @@ static void *__scs_alloc(int node)
 	}
 
 	s = __vmalloc_node_range(SCS_SIZE, 1, VMALLOC_START, VMALLOC_END,
-				    GFP_SCS, PAGE_KERNEL, 0, node,
+				    GFP_SCS, PAGE_KERNEL, 0,
+				    VM_DEFAULT_GUARD_PAGES, node,
 				    __builtin_return_address(0));
 
 out:
diff --git a/mm/execmem.c b/mm/execmem.c
index 084a207e4278..62d6800f3e92 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -39,13 +39,14 @@ static void *execmem_vmalloc(struct execmem_range *range, size_t size,
 		vm_flags |= VM_DEFER_KMEMLEAK;
 
 	p = __vmalloc_node_range(size, align, start, end, gfp_flags,
-				 pgprot, vm_flags, NUMA_NO_NODE,
-				 __builtin_return_address(0));
+				 pgprot, vm_flags, VM_DEFAULT_GUARD_PAGES,
+				 NUMA_NO_NODE, __builtin_return_address(0));
 	if (!p && range->fallback_start) {
 		start = range->fallback_start;
 		end = range->fallback_end;
 		p = __vmalloc_node_range(size, align, start, end, gfp_flags,
-					 pgprot, vm_flags, NUMA_NO_NODE,
+					 pgprot, vm_flags,
+					 VM_DEFAULT_GUARD_PAGES, NUMA_NO_NODE,
 					 __builtin_return_address(0));
 	}
 
@@ -68,12 +69,14 @@ struct vm_struct *execmem_vmap(size_t size)
 	struct vm_struct *area;
 
 	area = __get_vm_area_node(size, range->alignment, PAGE_SHIFT, VM_ALLOC,
-				  range->start, range->end, NUMA_NO_NODE,
+				  range->start, range->end,
+				  VM_DEFAULT_GUARD_PAGES, NUMA_NO_NODE,
 				  GFP_KERNEL, __builtin_return_address(0));
 	if (!area && range->fallback_start)
 		area = __get_vm_area_node(size, range->alignment, PAGE_SHIFT, VM_ALLOC,
 					  range->fallback_start, range->fallback_end,
-					  NUMA_NO_NODE, GFP_KERNEL, __builtin_return_address(0));
+					  VM_DEFAULT_GUARD_PAGES, NUMA_NO_NODE,
+					  GFP_KERNEL, __builtin_return_address(0));
 
 	return area;
 }
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..d60e45d3b6d9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1588,8 +1588,9 @@ int migrate_device_coherent_folio(struct folio *folio);
 struct vm_struct *__get_vm_area_node(unsigned long size,
 				     unsigned long align, unsigned long shift,
 				     unsigned long vm_flags, unsigned long start,
-				     unsigned long end, int node, gfp_t gfp_mask,
-				     const void *caller);
+				     unsigned long end,
+				     unsigned long nr_guard_pages, int node,
+				     gfp_t gfp_mask, const void *caller);
 
 /*
  * mm/gup.c
diff --git a/mm/ioremap.c b/mm/ioremap.c
index c36dd9f62fd5..61734b8ec128 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -34,7 +34,8 @@ void __iomem *generic_ioremap_prot(phys_addr_t phys_addr, size_t size,
 	size = PAGE_ALIGN(size + offset);
 
 	area = __get_vm_area_caller(size, VM_IOREMAP, IOREMAP_START,
-				    IOREMAP_END, __builtin_return_address(0));
+				    IOREMAP_END, VM_DEFAULT_GUARD_PAGES,
+				    __builtin_return_address(0));
 	if (!area)
 		return NULL;
 	vaddr = (unsigned long)area->addr;
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index d286e0a04543..5532a735534d 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -241,7 +241,7 @@ static int __meminit kasan_mem_notifier(struct notifier_block *nb,
 
 		ret = __vmalloc_node_range(shadow_size, PAGE_SIZE, shadow_start,
 					shadow_end, GFP_KERNEL,
-					PAGE_KERNEL, VM_NO_GUARD,
+					PAGE_KERNEL, 0, 0,
 					pfn_to_nid(mem_data->start_pfn),
 					__builtin_return_address(0));
 		if (!ret)
@@ -676,7 +676,7 @@ int kasan_alloc_module_shadow(void *addr, size_t size, gfp_t gfp_mask)
 	ret = __vmalloc_node_range(shadow_size, 1, shadow_start,
 			shadow_start + shadow_size,
 			GFP_KERNEL,
-			PAGE_KERNEL, VM_NO_GUARD, NUMA_NO_NODE,
+			PAGE_KERNEL, 0, 0, NUMA_NO_NODE,
 			__builtin_return_address(0));
 
 	if (ret) {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c31a8615a832..256b1de55080 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -729,7 +729,7 @@ static int check_sparse_vm_area(struct vm_struct *area, unsigned long start,
 	might_sleep();
 	if (WARN_ON_ONCE(area->flags & VM_FLUSH_RESET_PERMS))
 		return -EINVAL;
-	if (WARN_ON_ONCE(area->flags & VM_NO_GUARD))
+	if (WARN_ON_ONCE(!area->nr_guard_pages))
 		return -EINVAL;
 	if (WARN_ON_ONCE(!(area->flags & VM_SPARSE)))
 		return -EINVAL;
@@ -2103,11 +2103,13 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	va->va_end = addr + size;
 	va->vm = NULL;
 	va->flags = (va_flags | vn_id);
+	va->nr_guard_pages = 0;
 
 	if (vm) {
 		vm->addr = (void *)va->va_start;
 		vm->size = va_size(va);
 		va->vm = vm;
+		va->nr_guard_pages = vm->nr_guard_pages;
 	}
 
 	vn = addr_to_node(va->va_start);
@@ -3196,7 +3198,8 @@ void clear_vm_uninitialized_flag(struct vm_struct *vm)
 
 struct vm_struct *__get_vm_area_node(unsigned long size,
 		unsigned long align, unsigned long shift, unsigned long flags,
-		unsigned long start, unsigned long end, int node,
+		unsigned long start, unsigned long end,
+		unsigned long nr_guard_pages, int node,
 		gfp_t gfp_mask, const void *caller)
 {
 	struct vmap_area *va;
@@ -3216,12 +3219,12 @@ struct vm_struct *__get_vm_area_node(unsigned long size,
 	if (unlikely(!area))
 		return NULL;
 
-	if (!(flags & VM_NO_GUARD))
-		size += PAGE_SIZE;
+	size += nr_guard_pages * PAGE_SIZE;
 
 	area->flags = flags;
 	area->caller = caller;
 	area->requested_size = requested_size;
+	area->nr_guard_pages = nr_guard_pages;
 
 	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area);
 	if (IS_ERR(va)) {
@@ -3246,10 +3249,12 @@ struct vm_struct *__get_vm_area_node(unsigned long size,
 
 struct vm_struct *__get_vm_area_caller(unsigned long size, unsigned long flags,
 				       unsigned long start, unsigned long end,
+				       unsigned long nr_guard_pages,
 				       const void *caller)
 {
 	return __get_vm_area_node(size, 1, PAGE_SHIFT, flags, start, end,
-				  NUMA_NO_NODE, GFP_KERNEL, caller);
+				  nr_guard_pages, NUMA_NO_NODE, GFP_KERNEL,
+				  caller);
 }
 
 /**
@@ -3267,8 +3272,8 @@ struct vm_struct *get_vm_area(unsigned long size, unsigned long flags)
 {
 	return __get_vm_area_node(size, 1, PAGE_SHIFT, flags,
 				  VMALLOC_START, VMALLOC_END,
-				  NUMA_NO_NODE, GFP_KERNEL,
-				  __builtin_return_address(0));
+				  VM_DEFAULT_GUARD_PAGES, NUMA_NO_NODE,
+				  GFP_KERNEL, __builtin_return_address(0));
 }
 
 struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags,
@@ -3276,7 +3281,8 @@ struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags,
 {
 	return __get_vm_area_node(size, 1, PAGE_SHIFT, flags,
 				  VMALLOC_START, VMALLOC_END,
-				  NUMA_NO_NODE, GFP_KERNEL, caller);
+				  VM_DEFAULT_GUARD_PAGES, NUMA_NO_NODE,
+				  GFP_KERNEL, caller);
 }
 
 /**
@@ -3532,13 +3538,6 @@ void *vmap(struct page **pages, unsigned int count,
 	if (WARN_ON_ONCE(flags & VM_FLUSH_RESET_PERMS))
 		return NULL;
 
-	/*
-	 * Your top guard is someone else's bottom guard. Not having a top
-	 * guard compromises someone else's mappings too.
-	 */
-	if (WARN_ON_ONCE(flags & VM_NO_GUARD))
-		flags &= ~VM_NO_GUARD;
-
 	if (count > totalram_pages())
 		return NULL;
 
@@ -3959,7 +3958,8 @@ static gfp_t vmalloc_fix_flags(gfp_t flags)
  * @end:		  vm area range end
  * @gfp_mask:		  flags for the page level allocator
  * @prot:		  protection mask for the allocated pages
- * @vm_flags:		  additional vm area flags (e.g. %VM_NO_GUARD)
+ * @vm_flags:		  additional vm area flags
+ * @nr_guard_pages:	  number of unmapped guard pages to append (0 if none)
  * @node:		  node to use for allocation or NUMA_NO_NODE
  * @caller:		  caller's return address
  *
@@ -3985,7 +3985,8 @@ static gfp_t vmalloc_fix_flags(gfp_t flags)
  */
 void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
 			unsigned long start, unsigned long end, gfp_t gfp_mask,
-			pgprot_t prot, unsigned long vm_flags, int node,
+			pgprot_t prot, unsigned long vm_flags,
+			unsigned long nr_guard_pages, int node,
 			const void *caller)
 {
 	struct vm_struct *area;
@@ -4022,8 +4023,8 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
 
 again:
 	area = __get_vm_area_node(size, align, shift, VM_ALLOC |
-				  VM_UNINITIALIZED | vm_flags, start, end, node,
-				  gfp_mask, caller);
+				  VM_UNINITIALIZED | vm_flags, start, end,
+				  nr_guard_pages, node, gfp_mask, caller);
 	if (!area) {
 		bool nofail = gfp_mask & __GFP_NOFAIL;
 		warn_alloc(gfp_mask, NULL,
@@ -4122,7 +4123,8 @@ void *__vmalloc_node_noprof(unsigned long size, unsigned long align,
 			    gfp_t gfp_mask, int node, const void *caller)
 {
 	return __vmalloc_node_range_noprof(size, align, VMALLOC_START, VMALLOC_END,
-				gfp_mask, PAGE_KERNEL, 0, node, caller);
+				gfp_mask, PAGE_KERNEL, 0,
+				VM_DEFAULT_GUARD_PAGES, node, caller);
 }
 /*
  * This is only for performance analysis of vmalloc and stress purpose.
@@ -4180,7 +4182,8 @@ void *vmalloc_huge_node_noprof(unsigned long size, gfp_t gfp_mask, int node)
 		gfp_mask = vmalloc_fix_flags(gfp_mask);
 	return __vmalloc_node_range_noprof(size, 1, VMALLOC_START, VMALLOC_END,
 					   gfp_mask, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
-					   node, __builtin_return_address(0));
+					   VM_DEFAULT_GUARD_PAGES, node,
+					   __builtin_return_address(0));
 }
 EXPORT_SYMBOL_GPL(vmalloc_huge_node_noprof);
 
@@ -4217,8 +4220,8 @@ void *vmalloc_user_noprof(unsigned long size)
 {
 	return __vmalloc_node_range_noprof(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
 				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
-				    VM_USERMAP, NUMA_NO_NODE,
-				    __builtin_return_address(0));
+				    VM_USERMAP, VM_DEFAULT_GUARD_PAGES,
+				    NUMA_NO_NODE, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(vmalloc_user_noprof);
 
@@ -4410,8 +4413,8 @@ void *vmalloc_32_user_noprof(unsigned long size)
 {
 	return __vmalloc_node_range_noprof(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
 				    GFP_VMALLOC32 | __GFP_ZERO, PAGE_KERNEL,
-				    VM_USERMAP, NUMA_NO_NODE,
-				    __builtin_return_address(0));
+				    VM_USERMAP, VM_DEFAULT_GUARD_PAGES,
+				    NUMA_NO_NODE, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(vmalloc_32_user_noprof);
 
@@ -5463,6 +5466,7 @@ void __init vmalloc_init(void)
 		va->va_start = (unsigned long)tmp->addr;
 		va->va_end = va->va_start + tmp->size;
 		va->vm = tmp;
+		va->nr_guard_pages = tmp->nr_guard_pages;
 
 		vn = addr_to_node(va->va_start);
 		insert_vmap_area(va, &vn->busy.root, &vn->busy.head);

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-30 14:01         ` Dave Hansen
  2026-06-30 14:58           ` Pedro Falcato
@ 2026-06-30 22:02           ` Xiang Mei
  2026-06-30 22:05             ` Dave Hansen
  1 sibling, 1 reply; 15+ messages in thread
From: Xiang Mei @ 2026-06-30 22:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili

On Tue, Jun 30, 2026 at 7:02 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/29/26 18:22, Xiang Mei wrote:
> >> Please don't even try to send a v3 without addressing this.
> > This is a demo exploiting CVE-2026-31419 with this technique:
> > https://github.com/google/security-research/pull/397
>
> Thanks for sharing that. That's really good info.
>
> But what I want to hear a bit more about is why this new guard region is
> a good, generic mitigation. Does it help mitigate a whole class of
> vulnerabilities?
>
Thanks for the question. I'll change my words to call this problem an
issue instead of a bug since it's more like an (instruction set +
kernel stack design) issue.
I have used LLMs to evaluate other Intel instructions influencing SP
register (plus checking if such a gadget exists, e.g., `add rsp
0x8000`) and can't find a second gadget that could be used for stack
pivoting targeting adjacent pages.

> I think you're making the claim that this ENTER technique takes what
> would normally just be a DoS and makes it fully exploitable. Does this
> happen for a lot of DoS bugs? Or is CVE-2026-31419 very unusual and this
> stack guard gunk won't ever be useful again?

I may have written some misleading content; let me provide more
information to correct it.
ENTER escalates CFH (Contrflow Hijacking) to ACE (Arbitrary Code
Execution). It can't escalate DoS to exploitation primitives:
1) ENTER is an instruction, and it can be used to perform stack pivoting
2) The ENTER-pivoting technique requires a CFH primitive, for example
  a) CVE-2026-31419 is a race condition, and it gives a UAF
  b) attackers exploit UAF and control a function pointer
  c) attackers change the pointer to be an enter-pivoting gadget
(e.g., `enter 0x8000, 0; ret`)
  d) attackers escalate CFH to ACE
3) ENTER-pivoting is strong since the gadget is common and **one
gadget** is enough to escalate the CFH to ACE
5) Before this gadget, there is only one public one-gadget style
CFH->ACE technique: jump into BPF JIT (mitigated by JIT hardening)
6) This technique can be used for all CFH attacks, and it can enable
some hard exploitations (no register control, BPF JIT hardened).

Please feel free to ask any questions; I am glad to help.
Thanks,
Xiang


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-30 22:02           ` Xiang Mei
@ 2026-06-30 22:05             ` Dave Hansen
  2026-06-30 22:13               ` H. Peter Anvin
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2026-06-30 22:05 UTC (permalink / raw)
  To: Xiang Mei
  Cc: Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, H . Peter Anvin,
	linux-mm, linux-kernel, Jennifer Miller, Tiffany Bao, Ruoyu Wang,
	Adam Doupe, Kyle Zeng, Yan Shoshitaishvili

On 6/30/26 15:02, Xiang Mei wrote:
> Please feel free to ask any questions; I am glad to help.

How do the CET features: kernel IBT and the (theoretical for Linux)
kernel shadow stacks impact the situation?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-30 22:05             ` Dave Hansen
@ 2026-06-30 22:13               ` H. Peter Anvin
  2026-06-30 22:47                 ` Xiang Mei
  0 siblings, 1 reply; 15+ messages in thread
From: H. Peter Anvin @ 2026-06-30 22:13 UTC (permalink / raw)
  To: Dave Hansen, Xiang Mei
  Cc: Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, linux-mm, linux-kernel,
	Jennifer Miller, Tiffany Bao, Ruoyu Wang, Adam Doupe, Kyle Zeng,
	Yan Shoshitaishvili

On 2026-06-30 15:05, Dave Hansen wrote:
> On 6/30/26 15:02, Xiang Mei wrote:
>> Please feel free to ask any questions; I am glad to help.
> 
> How do the CET features: kernel IBT and the (theoretical for Linux)
> kernel shadow stacks impact the situation?

CET should prevent this from being the target of a JOP attack.

Kernel shadow stacks should prevent most stack-pivot attacks in general.

	-hpa



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot
  2026-06-30 22:13               ` H. Peter Anvin
@ 2026-06-30 22:47                 ` Xiang Mei
  0 siblings, 0 replies; 15+ messages in thread
From: Xiang Mei @ 2026-06-30 22:47 UTC (permalink / raw)
  To: H. Peter Anvin, Dave Hansen
  Cc: Kees Cook, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, linux-hardening,
	Uladzislau Rezki, Gustavo A . R . Silva, linux-mm, linux-kernel,
	Jennifer Miller, Tiffany Bao, Ruoyu Wang, Adam Doupe, Kyle Zeng,
	Yan Shoshitaishvili

On Tue, Jun 30, 2026 at 3:14 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 2026-06-30 15:05, Dave Hansen wrote:
> > On 6/30/26 15:02, Xiang Mei wrote:
> >> Please feel free to ask any questions; I am glad to help.
> >
> > How do the CET features: kernel IBT and the (theoretical for Linux)
> > kernel shadow stacks impact the situation?
>
> CET should prevent this from being the target of a JOP attack.
>
You are right; CET breaks the assumption that this technique needs a
CFH primitive.

> Kernel shadow stacks should prevent most stack-pivot attacks in general.

For the shadow stack, I didn't examine the implementation to check if
the working stack can be surrounded by attackers' payload (vmalloc
pages). If yes, the shadow stack can't stop this technique, assuming
we got a CFH from a function pointer in a heap object.

>
>         -hpa
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-06-30 22:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-29 21:47 [PATCH v2] mm/vmalloc: widen guard region to defeat ENTER-based stack pivot Xiang Mei
2026-06-29 22:29 ` Dave Hansen
2026-06-29 23:28   ` Xiang Mei
2026-06-29 23:37     ` Dave Hansen
2026-06-30  1:22       ` Xiang Mei
2026-06-30 14:01         ` Dave Hansen
2026-06-30 14:58           ` Pedro Falcato
2026-06-30 22:02           ` Xiang Mei
2026-06-30 22:05             ` Dave Hansen
2026-06-30 22:13               ` H. Peter Anvin
2026-06-30 22:47                 ` Xiang Mei
2026-06-30 14:40     ` Pedro Falcato
2026-06-30 15:15       ` Dave Hansen
2026-06-30 21:54         ` Dave Hansen
2026-06-30 21:41       ` Xiang Mei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox