linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] introduce VM_MAYBE_GUARD and make it sticky
@ 2025-10-29 16:50 Lorenzo Stoakes
  2025-10-29 16:50 ` [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions Lorenzo Stoakes
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-29 16:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin

Currently, guard regions are not visible to users except through
/proc/$pid/pagemap, with no explicit visibility at the VMA level.

This makes the feature less useful, as it isn't entirely apparent which
VMAs may have these entries present, especially when performing actions
which walk through memory regions such as those performed by CRIU.

This series addresses this issue by introducing the VM_MAYBE_GUARD flag
which fulfils this role, updating the smaps logic to display an entry for
these.

The semantics of this flag are that a guard region MAY be present if set
(we cannot be sure, as we can't efficiently track whether an
MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if
not set the VMA definitely does NOT have any guard regions present.

It's problematic to establish this flag without further action, because
that means that VMAs with guard regions in them become non-mergeable with
adjacent VMAs for no especially good reason.

To work around this, this series also introduces the concept of 'sticky'
VMA flags - that is flags which:

a. if set in one VMA and not in another still permit those VMAs to be
   merged (if otherwise compatible).

b. When they are merged, the resultant VMA must have the flag set.

The VMA logic is updated to propagate these flags correctly.

Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve
an issue with file-backed guard regions - previously these established an
anon_vma object for file-backed mappings solely to have vma_needs_copy()
correctly propagate guard region mappings to child processes.

We introduce a new flag alias VM_COPY_ON_FORK (which currently only
specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly
for this flag and to copy page tables if it is present, which resolves this
issue.

Finally we introduce extensive VMA userland tests to assert that the sticky
VMA logic behaves correctly as well as guard region self tests to assert
that smaps visibility is correctly implemented.

Lorenzo Stoakes (3):
  mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  mm: implement sticky, copy on fork VMA flags
  selftests/mm/guard-regions: add smaps visibility test

 Documentation/filesystems/proc.rst         |   1 +
 fs/proc/task_mmu.c                         |   1 +
 include/linux/mm.h                         |  33 ++++++
 include/trace/events/mmflags.h             |   1 +
 mm/madvise.c                               |  22 ++--
 mm/memory.c                                |   3 +
 mm/vma.c                                   |  22 ++--
 tools/testing/selftests/mm/guard-regions.c | 120 +++++++++++++++++++++
 tools/testing/selftests/mm/vm_util.c       |   5 +
 tools/testing/selftests/mm/vm_util.h       |   1 +
 tools/testing/vma/vma.c                    |  89 +++++++++++++--
 tools/testing/vma/vma_internal.h           |  33 ++++++
 12 files changed, 303 insertions(+), 28 deletions(-)

--
2.51.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-29 16:50 [PATCH 0/3] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
@ 2025-10-29 16:50 ` Lorenzo Stoakes
  2025-10-29 19:50   ` Randy Dunlap
                     ` (2 more replies)
  2025-10-29 16:50 ` [PATCH 2/3] mm: implement sticky, copy on fork VMA flags Lorenzo Stoakes
  2025-10-29 16:50 ` [PATCH 3/3] selftests/mm/guard-regions: add smaps visibility test Lorenzo Stoakes
  2 siblings, 3 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-29 16:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin

Currently, if a user needs to determine if guard regions are present in a
range, they have to scan all VMAs (or have knowledge of which ones might
have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
pagemap") and the related commit a516403787e0 ("fs/proc: extend the
PAGEMAP_SCAN ioctl to report guard regions"), users can use either
/proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag,
VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
uncertain because we cannot reasonably determine whether a
MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit
architectures also, a flag that was previously used by VM_DENYWRITE, which
was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
bee reused yet.

The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
lock (and also VMA write lock) whereas previously it did not, but this
seems a reasonable overhead.

We also update the smaps logic and documentation to identify these VMAs.

Another major use of this functionality is that we can use it to identify
that we ought to copy page tables on fork.

For anonymous mappings this is inherent, however since commit f807123d578d
 ("mm: allow guard regions in file-backed and read-only mappings") which
 allowed file-backed guard regions, we have unfortunately had to enforce
this behaviour by settings vma->anon_vma to force page table copying.

The existence of this flag removes the need for this, so we simply update
vma_needs_copy() to check for this flag instead.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 Documentation/filesystems/proc.rst |  1 +
 fs/proc/task_mmu.c                 |  1 +
 include/linux/mm.h                 |  1 +
 include/trace/events/mmflags.h     |  1 +
 mm/madvise.c                       | 22 ++++++++++++++--------
 mm/memory.c                        |  4 ++++
 tools/testing/vma/vma_internal.h   |  1 +
 7 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 0b86a8022fa1..b8a423ca590a 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -591,6 +591,7 @@ encoded manner. The codes are the following:
     sl    sealed
     lf    lock on fault pages
     dp    always lazily freeable mapping
+    gu    maybe contains guard regions (if not set, definitely doesn't)
     ==    =======================================
 
 Note that there is no guarantee that every flag and associated mnemonic will
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fc35a0543f01..db16ed91c269 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_MAYSHARE)]	= "ms",
 		[ilog2(VM_GROWSDOWN)]	= "gd",
 		[ilog2(VM_PFNMAP)]	= "pf",
+		[ilog2(VM_MAYBE_GUARD)]	= "gu",
 		[ilog2(VM_LOCKED)]	= "lo",
 		[ilog2(VM_IO)]		= "io",
 		[ilog2(VM_SEQ_READ)]	= "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index aada935c4950..f963afa1b9de 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_UFFD_MISSING	0
 #endif /* CONFIG_MMU */
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
+#define VM_MAYBE_GUARD	0x00000800	/* The VMA maybe contains guard regions. */
 #define VM_UFFD_WP	0x00001000	/* wrprotect pages tracking */
 
 #define VM_LOCKED	0x00002000
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index aa441f593e9a..a6e5a44c9b42 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3)
 	{VM_UFFD_MISSING,		"uffd_missing"	},		\
 IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR,	"uffd_minor"	)		\
 	{VM_PFNMAP,			"pfnmap"	},		\
+	{VM_MAYBE_GUARD,		"maybe_guard"	},		\
 	{VM_UFFD_WP,			"uffd_wp"	},		\
 	{VM_LOCKED,			"locked"	},		\
 	{VM_IO,				"io"		},		\
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..216ae6ed344e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
 		return -EINVAL;
 
 	/*
-	 * If we install guard markers, then the range is no longer
-	 * empty from a page table perspective and therefore it's
-	 * appropriate to have an anon_vma.
+	 * It would be confusing for anonymous mappings to have page table
+	 * entries but no anon_vma established, so ensure that it is.
+	 */
+	if (vma_is_anonymous(vma))
+		anon_vma_prepare(vma);
+
+	/*
+	 * Indicate that the VMA may contain guard regions, making it visible to
+	 * the user that a VMA may contain these, narrowing down the range which
+	 * must be scanned in order to detect them.
 	 *
-	 * This ensures that on fork, we copy page tables correctly.
+	 * This additionally causes page tables to be copied on fork regardless
+	 * of whether the VMA is anonymous or not, correctly preserving the
+	 * guard region page table entries.
 	 */
-	err = anon_vma_prepare(vma);
-	if (err)
-		return err;
+	vm_flags_set(vma, VM_MAYBE_GUARD);
 
 	/*
 	 * Optimistically try to install the guard marker pages first. If any
@@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
 	case MADV_COLLAPSE:
-	case MADV_GUARD_INSTALL:
 	case MADV_GUARD_REMOVE:
 		return MADVISE_MMAP_READ_LOCK;
 	case MADV_DONTNEED:
diff --git a/mm/memory.c b/mm/memory.c
index 4c3a7e09a159..a2c79ee43d68 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	if (src_vma->anon_vma)
 		return true;
 
+	/* Guard regions have momdified page tables that require copying. */
+	if (src_vma->vm_flags & VM_MAYBE_GUARD)
+		return true;
+
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.  Fork
 	 * becomes much lighter when there are big shared or private readonly
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index d873667704e8..e40c93edc5a7 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr;
 #define VM_MAYEXEC	0x00000040
 #define VM_GROWSDOWN	0x00000100
 #define VM_PFNMAP	0x00000400
+#define VM_MAYBE_GUARD	0x00000800
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000
 #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/3] mm: implement sticky, copy on fork VMA flags
  2025-10-29 16:50 [PATCH 0/3] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
  2025-10-29 16:50 ` [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions Lorenzo Stoakes
@ 2025-10-29 16:50 ` Lorenzo Stoakes
  2025-10-30  4:35   ` Suren Baghdasaryan
  2025-10-30 16:25   ` Pedro Falcato
  2025-10-29 16:50 ` [PATCH 3/3] selftests/mm/guard-regions: add smaps visibility test Lorenzo Stoakes
  2 siblings, 2 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-29 16:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin

It's useful to be able to force a VMA to be copied on fork outside of the
parameters specified by vma_needs_copy(), which otherwise only copies page
tables if:

* The destination VMA has VM_UFFD_WP set
* The mapping is a PFN or mixed map
* The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)

Setting this flag implies that the page tables mapping the VMA are such
that simply re-faulting the VMA will not re-establish them in identical
form.

We introduce VM_COPY_ON_FORK to clearly identify which flags require this
behaviour, which currently is only VM_MAYBE_GUARD.

Any VMA flags which require this behaviour are inherently 'sticky', that
is, should we merge two VMAs together, this implies that the newly merged
VMA maps a range that requires page table copying on fork.

In order to implement this we must both introduce the concept of a 'sticky'
VMA flag and adjust the VMA merge logic accordingly, and also have VMA
merge still successfully succeed should one VMA have the flag set and
another not.

Note that we update the VMA expand logic to handle new VMA merging, as this
function is the one ultimately called by all instances of merging of new
VMAs.

This patch implements this, establishing VM_STICKY to contain all such
flags and VM_IGNORE_MERGE for those flags which should be ignored when
comparing adjacent VMA's flags for the purposes of merging.

As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it
already had this behaviour, alongside VM_STICKY as sticky flags by
implication must not disallow merge.

We update the VMA userland tests to account for the changes and,
furthermore, in order to assert that the functionality is workingly
correctly, update the new VMA and existing VMA merging logic to consider
every permutation of the flag being set/not set in all VMAs being
considered for merge.

As a result of this change, VMAs with guard ranges will now not have their
merge behaviour impacted by doing so and can be freely merged with other
VMAs without VM_MAYBE_GUARD set.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h               | 32 ++++++++++++
 mm/memory.c                      |  3 +-
 mm/vma.c                         | 22 ++++----
 tools/testing/vma/vma.c          | 89 ++++++++++++++++++++++++++++----
 tools/testing/vma/vma_internal.h | 32 ++++++++++++
 5 files changed, 156 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f963afa1b9de..a8811ba57150 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -522,6 +522,38 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 #define VM_FLAGS_CLEAR	(ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)
 
+/* Flags which should result in page tables being copied on fork. */
+#define VM_COPY_ON_FORK VM_MAYBE_GUARD
+
+/*
+ * Flags which should be 'sticky' on merge - that is, flags which, when one VMA
+ * possesses it but the other does not, the merged VMA should nonetheless have
+ * applied to it:
+ *
+ * VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
+ *                   metadata which should be unconditionally propagated upon
+ *                   fork. When merging two VMAs, we encapsulate this range in
+ *                   the merged VMA, so the flag should be 'sticky' as a result.
+ */
+#define VM_STICKY VM_COPY_ON_FORK
+
+/*
+ * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
+ * of these flags and the other not does not preclude a merge.
+ *
+ * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
+ *                dirty bit -- the caller should mark merged VMA as dirty. If
+ *                dirty bit won't be excluded from comparison, we increase
+ *                pressure on the memory system forcing the kernel to generate
+ *                new VMAs when old one could be extended instead.
+ *
+ *    VM_STICKY - If one VMA has flags which most be 'sticky', that is ones
+ *                which should propagate to all VMAs, but the other does not,
+ *                the merge should still proceed with the merge logic applying
+ *                sticky flags to the final VMA.
+ */
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
+
 /*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
diff --git a/mm/memory.c b/mm/memory.c
index a2c79ee43d68..9528133e5147 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1478,8 +1478,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	if (src_vma->anon_vma)
 		return true;
 
-	/* Guard regions have momdified page tables that require copying. */
-	if (src_vma->vm_flags & VM_MAYBE_GUARD)
+	if (src_vma->vm_flags & VM_COPY_ON_FORK)
 		return true;
 
 	/*
diff --git a/mm/vma.c b/mm/vma.c
index 919d1fc63a52..50a6909c4be3 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex
 
 	if (!mpol_equal(vmg->policy, vma_policy(vma)))
 		return false;
-	/*
-	 * VM_SOFTDIRTY should not prevent from VMA merging, if we
-	 * match the flags but dirty bit -- the caller should mark
-	 * merged VMA as dirty. If dirty bit won't be excluded from
-	 * comparison, we increase pressure on the memory system forcing
-	 * the kernel to generate new VMAs when old one could be
-	 * extended instead.
-	 */
-	if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY)
+	if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE)
 		return false;
 	if (vma->vm_file != vmg->file)
 		return false;
@@ -809,6 +801,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma)
 static __must_check struct vm_area_struct *vma_merge_existing_range(
 		struct vma_merge_struct *vmg)
 {
+	vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY;
 	struct vm_area_struct *middle = vmg->middle;
 	struct vm_area_struct *prev = vmg->prev;
 	struct vm_area_struct *next;
@@ -901,11 +894,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
 	if (merge_right) {
 		vma_start_write(next);
 		vmg->target = next;
+		sticky_flags |= (next->vm_flags & VM_STICKY);
 	}
 
 	if (merge_left) {
 		vma_start_write(prev);
 		vmg->target = prev;
+		sticky_flags |= (prev->vm_flags & VM_STICKY);
 	}
 
 	if (merge_both) {
@@ -975,6 +970,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
 	if (err || commit_merge(vmg))
 		goto abort;
 
+	vm_flags_set(vmg->target, sticky_flags);
 	khugepaged_enter_vma(vmg->target, vmg->vm_flags);
 	vmg->state = VMA_MERGE_SUCCESS;
 	return vmg->target;
@@ -1125,6 +1121,10 @@ int vma_expand(struct vma_merge_struct *vmg)
 	bool remove_next = false;
 	struct vm_area_struct *target = vmg->target;
 	struct vm_area_struct *next = vmg->next;
+	vm_flags_t sticky_flags;
+
+	sticky_flags = vmg->vm_flags & VM_STICKY;
+	sticky_flags |= target->vm_flags & VM_STICKY;
 
 	VM_WARN_ON_VMG(!target, vmg);
 
@@ -1134,6 +1134,7 @@ int vma_expand(struct vma_merge_struct *vmg)
 	if (next && (target != next) && (vmg->end == next->vm_end)) {
 		int ret;
 
+		sticky_flags |= next->vm_flags & VM_STICKY;
 		remove_next = true;
 		/* This should already have been checked by this point. */
 		VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);
@@ -1160,6 +1161,7 @@ int vma_expand(struct vma_merge_struct *vmg)
 	if (commit_merge(vmg))
 		goto nomem;
 
+	vm_flags_set(target, sticky_flags);
 	return 0;
 
 nomem:
@@ -1903,7 +1905,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *
 	return a->vm_end == b->vm_start &&
 		mpol_equal(vma_policy(a), vma_policy(b)) &&
 		a->vm_file == b->vm_file &&
-		!((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) &&
+		!((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) &&
 		b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
 }
 
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 656e1c75b711..ee9d3547c421 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -48,6 +48,8 @@ static struct anon_vma dummy_anon_vma;
 #define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2))
 #define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2))
 
+#define IS_SET(_val, _flags) ((_val & _flags) == _flags)
+
 static struct task_struct __current;
 
 struct task_struct *get_current(void)
@@ -441,7 +443,7 @@ static bool test_simple_shrink(void)
 	return true;
 }
 
-static bool test_merge_new(void)
+static bool __test_merge_new(bool is_sticky, bool a_is_sticky, bool b_is_sticky, bool c_is_sticky)
 {
 	vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
 	struct mm_struct mm = {};
@@ -469,23 +471,32 @@ static bool test_merge_new(void)
 	struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d;
 	bool merged;
 
+	if (is_sticky)
+		vm_flags |= VM_STICKY;
+
 	/*
 	 * 0123456789abc
 	 * AA B       CC
 	 */
 	vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags);
 	ASSERT_NE(vma_a, NULL);
+	if (a_is_sticky)
+		vm_flags_set(vma_a, VM_STICKY);
 	/* We give each VMA a single avc so we can test anon_vma duplication. */
 	INIT_LIST_HEAD(&vma_a->anon_vma_chain);
 	list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain);
 
 	vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags);
 	ASSERT_NE(vma_b, NULL);
+	if (b_is_sticky)
+		vm_flags_set(vma_b, VM_STICKY);
 	INIT_LIST_HEAD(&vma_b->anon_vma_chain);
 	list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain);
 
 	vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, vm_flags);
 	ASSERT_NE(vma_c, NULL);
+	if (c_is_sticky)
+		vm_flags_set(vma_c, VM_STICKY);
 	INIT_LIST_HEAD(&vma_c->anon_vma_chain);
 	list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain);
 
@@ -520,6 +531,8 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 3);
+	if (is_sticky || a_is_sticky || b_is_sticky)
+		ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
 
 	/*
 	 * Merge to PREVIOUS VMA.
@@ -537,6 +550,8 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 3);
+	if (is_sticky || a_is_sticky)
+		ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
 
 	/*
 	 * Merge to NEXT VMA.
@@ -556,6 +571,8 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 3);
+	if (is_sticky) /* D uses is_sticky. */
+		ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
 
 	/*
 	 * Merge BOTH sides.
@@ -574,6 +591,8 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 2);
+	if (is_sticky || a_is_sticky)
+		ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
 
 	/*
 	 * Merge to NEXT VMA.
@@ -592,6 +611,8 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 2);
+	if (is_sticky || c_is_sticky)
+		ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
 
 	/*
 	 * Merge BOTH sides.
@@ -609,6 +630,8 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 1);
+	if (is_sticky || a_is_sticky || c_is_sticky)
+		ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
 
 	/*
 	 * Final state.
@@ -637,6 +660,20 @@ static bool test_merge_new(void)
 	return true;
 }
 
+static bool test_merge_new(void)
+{
+	int i, j, k, l;
+
+	/* Generate every possible permutation of sticky flags. */
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < 2; j++)
+			for (k = 0; k < 2; k++)
+				for (l = 0; l < 2; l++)
+					ASSERT_TRUE(__test_merge_new(i, j, k, l));
+
+	return true;
+}
+
 static bool test_vma_merge_special_flags(void)
 {
 	vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
@@ -973,9 +1010,11 @@ static bool test_vma_merge_new_with_close(void)
 	return true;
 }
 
-static bool test_merge_existing(void)
+static bool __test_merge_existing(bool prev_is_sticky, bool middle_is_sticky, bool next_is_sticky)
 {
 	vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
+	vm_flags_t prev_flags = vm_flags;
+	vm_flags_t next_flags = vm_flags;
 	struct mm_struct mm = {};
 	VMA_ITERATOR(vmi, &mm, 0);
 	struct vm_area_struct *vma, *vma_prev, *vma_next;
@@ -988,6 +1027,13 @@ static bool test_merge_existing(void)
 	};
 	struct anon_vma_chain avc = {};
 
+	if (prev_is_sticky)
+		prev_flags |= VM_STICKY;
+	if (middle_is_sticky)
+		vm_flags |= VM_STICKY;
+	if (next_is_sticky)
+		next_flags |= VM_STICKY;
+
 	/*
 	 * Merge right case - partial span.
 	 *
@@ -1000,7 +1046,7 @@ static bool test_merge_existing(void)
 	 */
 	vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags);
 	vma->vm_ops = &vm_ops; /* This should have no impact. */
-	vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
+	vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags);
 	vma_next->vm_ops = &vm_ops; /* This should have no impact. */
 	vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma);
 	vmg.middle = vma;
@@ -1018,6 +1064,8 @@ static bool test_merge_existing(void)
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_TRUE(vma_write_started(vma_next));
 	ASSERT_EQ(mm.map_count, 2);
+	if (middle_is_sticky || next_is_sticky)
+		ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));
 
 	/* Clear down and reset. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1033,7 +1081,7 @@ static bool test_merge_existing(void)
 	 *   NNNNNNN
 	 */
 	vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags);
-	vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
+	vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags);
 	vma_next->vm_ops = &vm_ops; /* This should have no impact. */
 	vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, vm_flags, &dummy_anon_vma);
 	vmg.middle = vma;
@@ -1046,6 +1094,8 @@ static bool test_merge_existing(void)
 	ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma_next));
 	ASSERT_EQ(mm.map_count, 1);
+	if (middle_is_sticky || next_is_sticky)
+		ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));
 
 	/* Clear down and reset. We should have deleted vma. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1060,7 +1110,7 @@ static bool test_merge_existing(void)
 	 * 0123456789
 	 * PPPPPPV
 	 */
-	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
+	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
 	vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
 	vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
 	vma->vm_ops = &vm_ops; /* This should have no impact. */
@@ -1080,6 +1130,8 @@ static bool test_merge_existing(void)
 	ASSERT_TRUE(vma_write_started(vma_prev));
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 2);
+	if (prev_is_sticky || middle_is_sticky)
+		ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
 
 	/* Clear down and reset. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1094,7 +1146,7 @@ static bool test_merge_existing(void)
 	 * 0123456789
 	 * PPPPPPP
 	 */
-	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
+	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
 	vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
 	vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
 	vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);
@@ -1109,6 +1161,8 @@ static bool test_merge_existing(void)
 	ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma_prev));
 	ASSERT_EQ(mm.map_count, 1);
+	if (prev_is_sticky || middle_is_sticky)
+		ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
 
 	/* Clear down and reset. We should have deleted vma. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1123,10 +1177,10 @@ static bool test_merge_existing(void)
 	 * 0123456789
 	 * PPPPPPPPPP
 	 */
-	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
+	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
 	vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
 	vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
-	vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags);
+	vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, next_flags);
 	vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);
 	vmg.prev = vma_prev;
 	vmg.middle = vma;
@@ -1139,6 +1193,8 @@ static bool test_merge_existing(void)
 	ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma_prev));
 	ASSERT_EQ(mm.map_count, 1);
+	if (prev_is_sticky || middle_is_sticky || next_is_sticky)
+		ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
 
 	/* Clear down and reset. We should have deleted prev and next. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1158,9 +1214,9 @@ static bool test_merge_existing(void)
 	 * PPPVVVVVNNN
 	 */
 
-	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
+	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
 	vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags);
-	vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, vm_flags);
+	vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, next_flags);
 
 	vmg_set_range(&vmg, 0x4000, 0x5000, 4, vm_flags);
 	vmg.prev = vma;
@@ -1203,6 +1259,19 @@ static bool test_merge_existing(void)
 	return true;
 }
 
+static bool test_merge_existing(void)
+{
+	int i, j, k;
+
+	/* Generate every possible permutation of sticky flags. */
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < 2; j++)
+			for (k = 0; k < 2; k++)
+				ASSERT_TRUE(__test_merge_existing(i, j, k));
+
+	return true;
+}
+
 static bool test_anon_vma_non_mergeable(void)
 {
 	vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index e40c93edc5a7..3d9cb3a9411a 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -117,6 +117,38 @@ extern unsigned long dac_mmap_min_addr;
 #define VM_SEALED	VM_NONE
 #endif
 
+/* Flags which should result in page tables being copied on fork. */
+#define VM_COPY_ON_FORK VM_MAYBE_GUARD
+
+/*
+ * Flags which should be 'sticky' on merge - that is, flags which, when one VMA
+ * possesses it but the other does not, the merged VMA should nonetheless have
+ * applied to it:
+ *
+ * VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
+ *                   metadata which should be unconditionally propagated upon
+ *                   fork. When merging two VMAs, we encapsulate this range in
+ *                   the merged VMA, so the flag should be 'sticky' as a result.
+ */
+#define VM_STICKY VM_COPY_ON_FORK
+
+/*
+ * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
+ * of these flags and the other not does not preclude a merge.
+ *
+ * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
+ *                dirty bit -- the caller should mark merged VMA as dirty. If
+ *                dirty bit won't be excluded from comparison, we increase
+ *                pressure on the memory system forcing the kernel to generate
+ *                new VMAs when old one could be extended instead.
+ *
+ *    VM_STICKY - If one VMA has flags which must be 'sticky', that is ones
+ *                which should propagate to all VMAs, but the other does not,
+ *                the merge should still proceed with the merge logic applying
+ *                sticky flags to the final VMA.
+ */
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
+
 #define FIRST_USER_ADDRESS	0UL
 #define USER_PGTABLES_CEILING	0UL
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/3] selftests/mm/guard-regions: add smaps visibility test
  2025-10-29 16:50 [PATCH 0/3] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
  2025-10-29 16:50 ` [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions Lorenzo Stoakes
  2025-10-29 16:50 ` [PATCH 2/3] mm: implement sticky, copy on fork VMA flags Lorenzo Stoakes
@ 2025-10-29 16:50 ` Lorenzo Stoakes
  2025-10-30  4:40   ` Suren Baghdasaryan
  2 siblings, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-29 16:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin

Assert that we observe guard regions appearing in /proc/$pid/smaps as
expected, and when split/merge is performed too (with expected sticky
behaviour).

Also add handling for file systems which don't sanely handle mmap() VMA
merging so we don't incorrectly encounter a test failure in this situation.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/selftests/mm/guard-regions.c | 120 +++++++++++++++++++++
 tools/testing/selftests/mm/vm_util.c       |   5 +
 tools/testing/selftests/mm/vm_util.h       |   1 +
 3 files changed, 126 insertions(+)

diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
index 8dd81c0a4a5a..a9be11e03a6a 100644
--- a/tools/testing/selftests/mm/guard-regions.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -94,6 +94,7 @@ static void *mmap_(FIXTURE_DATA(guard_regions) * self,
 	case ANON_BACKED:
 		flags |= MAP_PRIVATE | MAP_ANON;
 		fd = -1;
+		offset = 0;
 		break;
 	case SHMEM_BACKED:
 	case LOCAL_FILE_BACKED:
@@ -260,6 +261,54 @@ static bool is_buf_eq(char *buf, size_t size, char chr)
 	return true;
 }
 
+/*
+ * Some file systems have issues with merging due to changing merge-sensitive
+ * parameters in the .mmap callback, and prior to .mmap_prepare being
+ * implemented everywhere this will now result in an unexpected failure to
+ * merge (e.g. - overlayfs).
+ *
+ * Perform a simple test to see if the local file system suffers from this, if
+ * it does then we can skip test logic that assumes local file system merging is
+ * sane.
+ */
+static bool local_fs_has_sane_mmap(FIXTURE_DATA(guard_regions) * self,
+				   const FIXTURE_VARIANT(guard_regions) * variant)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr, *ptr2;
+	struct procmap_fd procmap;
+
+	if (variant->backing != LOCAL_FILE_BACKED)
+		return true;
+
+	/* Map 10 pages. */
+	ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0);
+	if (ptr == MAP_FAILED)
+		return false;
+	/* Unmap the middle. */
+	munmap(&ptr[5 * page_size], page_size);
+
+	/* Map again. */
+	ptr2 = mmap_(self, variant, &ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE,
+		     MAP_FIXED, 5 * page_size);
+
+	if (ptr2 == MAP_FAILED)
+		return false;
+
+	/* Now make sure they all merged. */
+	if (open_self_procmap(&procmap) != 0)
+		return false;
+	if (!find_vma_procmap(&procmap, ptr))
+		return false;
+	if (procmap.query.vma_start != (unsigned long)ptr)
+		return false;
+	if (procmap.query.vma_end != (unsigned long)ptr + 10 * page_size)
+		return false;
+	close_procmap(&procmap);
+
+	return true;
+}
+
 FIXTURE_SETUP(guard_regions)
 {
 	self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
@@ -2138,4 +2187,75 @@ TEST_F(guard_regions, pagemap_scan)
 	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
 }
 
+TEST_F(guard_regions, smaps)
+{
+	const unsigned long page_size = self->page_size;
+	struct procmap_fd procmap;
+	char *ptr, *ptr2;
+	int i;
+
+	/* Map a region. */
+	ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* We shouldn't yet see a guard flag. */
+	ASSERT_FALSE(check_vmflag_guard(ptr));
+
+	/* Install a single guard region. */
+	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Now we should see a guard flag. */
+	ASSERT_TRUE(check_vmflag_guard(ptr));
+
+	/*
+	 * Removing the guard region should not change things because we simply
+	 * cannot accurately track whether a given VMA has had all of its guard
+	 * regions removed.
+	 */
+	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_REMOVE), 0);
+	ASSERT_TRUE(check_vmflag_guard(ptr));
+
+	/* Install guard regions throughout. */
+	for (i = 0; i < 10; i++) {
+		ASSERT_EQ(madvise(&ptr[i * page_size], page_size, MADV_GUARD_INSTALL), 0);
+		/* We should always see the guard region flag. */
+		ASSERT_TRUE(check_vmflag_guard(ptr));
+	}
+
+	/* Split into two VMAs. */
+	ASSERT_EQ(munmap(&ptr[4 * page_size], page_size), 0);
+
+	/* Both VMAs should have the guard flag set. */
+	ASSERT_TRUE(check_vmflag_guard(ptr));
+	ASSERT_TRUE(check_vmflag_guard(&ptr[5 * page_size]));
+
+	/*
+	 * If the local file system is unable to merge VMAs due to having
+	 * unusual characteristics, there is no point in asserting merge
+	 * behaviour.
+	 */
+	if (!local_fs_has_sane_mmap(self, variant)) {
+		TH_LOG("local filesystem does not support sane merging skipping merge test");
+		return;
+	}
+
+	/* Map a fresh VMA between the two split VMAs. */
+	ptr2 = mmap_(self, variant, &ptr[4 * page_size], page_size,
+		     PROT_READ | PROT_WRITE, MAP_FIXED, 4 * page_size);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Check the procmap to ensure that this VMA merged with the adjacent
+	 * two. The guard region flag is 'sticky' so should not preclude
+	 * merging.
+	 */
+	ASSERT_EQ(open_self_procmap(&procmap), 0);
+	ASSERT_TRUE(find_vma_procmap(&procmap, ptr));
+	ASSERT_EQ(procmap.query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap.query.vma_end, (unsigned long)ptr + 10 * page_size);
+	ASSERT_EQ(close_procmap(&procmap), 0);
+	/* And, of course, this VMA should have the guard flag set. */
+	ASSERT_TRUE(check_vmflag_guard(ptr));
+}
+
 TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index e33cda301dad..605cb58ea5c3 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -449,6 +449,11 @@ bool check_vmflag_pfnmap(void *addr)
 	return check_vmflag(addr, "pf");
 }
 
+bool check_vmflag_guard(void *addr)
+{
+	return check_vmflag(addr, "gu");
+}
+
 bool softdirty_supported(void)
 {
 	char *addr;
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 26c30fdc0241..a8abdf414d46 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -98,6 +98,7 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len,
 unsigned long get_free_hugepages(void);
 bool check_vmflag_io(void *addr);
 bool check_vmflag_pfnmap(void *addr);
+bool check_vmflag_guard(void *addr);
 int open_procmap(pid_t pid, struct procmap_fd *procmap_out);
 int query_procmap(struct procmap_fd *procmap);
 bool find_vma_procmap(struct procmap_fd *procmap, void *address);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-29 16:50 ` [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions Lorenzo Stoakes
@ 2025-10-29 19:50   ` Randy Dunlap
  2025-10-30  8:13     ` Lorenzo Stoakes
  2025-10-30  1:05   ` Suren Baghdasaryan
  2025-10-30 16:16   ` Pedro Falcato
  2 siblings, 1 reply; 28+ messages in thread
From: Randy Dunlap @ 2025-10-29 19:50 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin


On 10/29/25 9:50 AM, Lorenzo Stoakes wrote:

> diff --git a/mm/memory.c b/mm/memory.c
> index 4c3a7e09a159..a2c79ee43d68 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  	if (src_vma->anon_vma)
>  		return true;
>  
> +	/* Guard regions have momdified page tables that require copying. */

	                      modified

> +	if (src_vma->vm_flags & VM_MAYBE_GUARD)
> +		return true;
> +
>  	/*
>  	 * Don't copy ptes where a page fault will fill them correctly.  Fork
>  	 * becomes much lighter when there are big shared or private readonly
-- 
~Randy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-29 16:50 ` [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions Lorenzo Stoakes
  2025-10-29 19:50   ` Randy Dunlap
@ 2025-10-30  1:05   ` Suren Baghdasaryan
  2025-10-30  8:22     ` Lorenzo Stoakes
  2025-10-30 16:16   ` Pedro Falcato
  2 siblings, 1 reply; 28+ messages in thread
From: Suren Baghdasaryan @ 2025-10-30  1:05 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin

On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Currently, if a user needs to determine if guard regions are present in a
> range, they have to scan all VMAs (or have knowledge of which ones might
> have guard regions).
>
> Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> operation at a virtual address level.
>
> This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> that guard regions exist in ranges.
>
> This patch remedies the situation by establishing a new VMA flag,
> VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> uncertain because we cannot reasonably determine whether a
> MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> additionally VMAs may change across merge/split).

nit: I know I suck at naming but I think VM_MAY_HAVE_GUARDS would
better represent the meaning.

>
> We utilise 0x800 for this flag which makes it available to 32-bit
> architectures also, a flag that was previously used by VM_DENYWRITE, which
> was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> bee reused yet.

s/bee/been
but I'm not even sure the above paragraph has to be included in the
changelog. It's a technical detail IMHO.

>
> The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> lock (and also VMA write lock) whereas previously it did not, but this
> seems a reasonable overhead.

I guess this is because it is modifying vm_flags now?

>
> We also update the smaps logic and documentation to identify these VMAs.
>
> Another major use of this functionality is that we can use it to identify
> that we ought to copy page tables on fork.
>
> For anonymous mappings this is inherent, however since commit f807123d578d
>  ("mm: allow guard regions in file-backed and read-only mappings") which
>  allowed file-backed guard regions, we have unfortunately had to enforce
> this behaviour by settings vma->anon_vma to force page table copying.
>
> The existence of this flag removes the need for this, so we simply update
> vma_needs_copy() to check for this flag instead.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Overall, makes sense to me and I think we could use it.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

It would be nice to have a way for userspace to reset this flag if it
confirms that the VMA does not really have any guards (using say
PAGEMAP_SCAN) but I think such an API can be abused.

> ---
>  Documentation/filesystems/proc.rst |  1 +
>  fs/proc/task_mmu.c                 |  1 +
>  include/linux/mm.h                 |  1 +
>  include/trace/events/mmflags.h     |  1 +
>  mm/madvise.c                       | 22 ++++++++++++++--------
>  mm/memory.c                        |  4 ++++
>  tools/testing/vma/vma_internal.h   |  1 +
>  7 files changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 0b86a8022fa1..b8a423ca590a 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -591,6 +591,7 @@ encoded manner. The codes are the following:
>      sl    sealed
>      lf    lock on fault pages
>      dp    always lazily freeable mapping
> +    gu    maybe contains guard regions (if not set, definitely doesn't)
>      ==    =======================================
>
>  Note that there is no guarantee that every flag and associated mnemonic will
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index fc35a0543f01..db16ed91c269 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>                 [ilog2(VM_MAYSHARE)]    = "ms",
>                 [ilog2(VM_GROWSDOWN)]   = "gd",
>                 [ilog2(VM_PFNMAP)]      = "pf",
> +               [ilog2(VM_MAYBE_GUARD)] = "gu",
>                 [ilog2(VM_LOCKED)]      = "lo",
>                 [ilog2(VM_IO)]          = "io",
>                 [ilog2(VM_SEQ_READ)]    = "sr",
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index aada935c4950..f963afa1b9de 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_UFFD_MISSING        0
>  #endif /* CONFIG_MMU */
>  #define VM_PFNMAP      0x00000400      /* Page-ranges managed without "struct page", just pure PFN */
> +#define VM_MAYBE_GUARD 0x00000800      /* The VMA maybe contains guard regions. */
>  #define VM_UFFD_WP     0x00001000      /* wrprotect pages tracking */
>
>  #define VM_LOCKED      0x00002000
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index aa441f593e9a..a6e5a44c9b42 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3)
>         {VM_UFFD_MISSING,               "uffd_missing"  },              \
>  IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR,      "uffd_minor"    )               \
>         {VM_PFNMAP,                     "pfnmap"        },              \
> +       {VM_MAYBE_GUARD,                "maybe_guard"   },              \
>         {VM_UFFD_WP,                    "uffd_wp"       },              \
>         {VM_LOCKED,                     "locked"        },              \
>         {VM_IO,                         "io"            },              \
> diff --git a/mm/madvise.c b/mm/madvise.c
> index fb1c86e630b6..216ae6ed344e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
>                 return -EINVAL;
>
>         /*
> -        * If we install guard markers, then the range is no longer
> -        * empty from a page table perspective and therefore it's
> -        * appropriate to have an anon_vma.
> +        * It would be confusing for anonymous mappings to have page table
> +        * entries but no anon_vma established, so ensure that it is.
> +        */
> +       if (vma_is_anonymous(vma))
> +               anon_vma_prepare(vma);
> +
> +       /*
> +        * Indicate that the VMA may contain guard regions, making it visible to
> +        * the user that a VMA may contain these, narrowing down the range which
> +        * must be scanned in order to detect them.
>          *
> -        * This ensures that on fork, we copy page tables correctly.
> +        * This additionally causes page tables to be copied on fork regardless
> +        * of whether the VMA is anonymous or not, correctly preserving the
> +        * guard region page table entries.
>          */
> -       err = anon_vma_prepare(vma);
> -       if (err)
> -               return err;
> +       vm_flags_set(vma, VM_MAYBE_GUARD);
>
>         /*
>          * Optimistically try to install the guard marker pages first. If any
> @@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
>         case MADV_POPULATE_READ:
>         case MADV_POPULATE_WRITE:
>         case MADV_COLLAPSE:
> -       case MADV_GUARD_INSTALL:
>         case MADV_GUARD_REMOVE:
>                 return MADVISE_MMAP_READ_LOCK;
>         case MADV_DONTNEED:
> diff --git a/mm/memory.c b/mm/memory.c
> index 4c3a7e09a159..a2c79ee43d68 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>         if (src_vma->anon_vma)
>                 return true;
>
> +       /* Guard regions have momdified page tables that require copying. */
> +       if (src_vma->vm_flags & VM_MAYBE_GUARD)
> +               return true;
> +
>         /*
>          * Don't copy ptes where a page fault will fill them correctly.  Fork
>          * becomes much lighter when there are big shared or private readonly
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index d873667704e8..e40c93edc5a7 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr;
>  #define VM_MAYEXEC     0x00000040
>  #define VM_GROWSDOWN   0x00000100
>  #define VM_PFNMAP      0x00000400
> +#define VM_MAYBE_GUARD 0x00000800
>  #define VM_LOCKED      0x00002000
>  #define VM_IO           0x00004000
>  #define VM_SEQ_READ    0x00008000      /* App will access data sequentially */
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm: implement sticky, copy on fork VMA flags
  2025-10-29 16:50 ` [PATCH 2/3] mm: implement sticky, copy on fork VMA flags Lorenzo Stoakes
@ 2025-10-30  4:35   ` Suren Baghdasaryan
  2025-10-30  8:25     ` Lorenzo Stoakes
  2025-10-30 16:25   ` Pedro Falcato
  1 sibling, 1 reply; 28+ messages in thread
From: Suren Baghdasaryan @ 2025-10-30  4:35 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin

On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> It's useful to be able to force a VMA to be copied on fork outside of the
> parameters specified by vma_needs_copy(), which otherwise only copies page
> tables if:
>
> * The destination VMA has VM_UFFD_WP set
> * The mapping is a PFN or mixed map
> * The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)
>
> Setting this flag implies that the page tables mapping the VMA are such
> that simply re-faulting the VMA will not re-establish them in identical
> form.
>
> We introduce VM_COPY_ON_FORK to clearly identify which flags require this
> behaviour, which currently is only VM_MAYBE_GUARD.
>
> Any VMA flags which require this behaviour are inherently 'sticky', that
> is, should we merge two VMAs together, this implies that the newly merged
> VMA maps a range that requires page table copying on fork.
>
> In order to implement this we must both introduce the concept of a 'sticky'
> VMA flag and adjust the VMA merge logic accordingly, and also have VMA
> merge still successfully succeed should one VMA have the flag set and
> another not.

"successfully succeed" sounds weird. Just "succeed"?

>
> Note that we update the VMA expand logic to handle new VMA merging, as this
> function is the one ultimately called by all instances of merging of new
> VMAs.
>
> This patch implements this, establishing VM_STICKY to contain all such
> flags and VM_IGNORE_MERGE for those flags which should be ignored when
> comparing adjacent VMA's flags for the purposes of merging.
>
> As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it
> already had this behaviour, alongside VM_STICKY as sticky flags by
> implication must not disallow merge.
>
> We update the VMA userland tests to account for the changes and,
> furthermore, in order to assert that the functionality is workingly

s/workingly/working

> correctly, update the new VMA and existing VMA merging logic to consider
> every permutation of the flag being set/not set in all VMAs being
> considered for merge.
>
> As a result of this change, VMAs with guard ranges will now not have their
> merge behaviour impacted by doing so and can be freely merged with other
> VMAs without VM_MAYBE_GUARD set.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  include/linux/mm.h               | 32 ++++++++++++
>  mm/memory.c                      |  3 +-
>  mm/vma.c                         | 22 ++++----
>  tools/testing/vma/vma.c          | 89 ++++++++++++++++++++++++++++----
>  tools/testing/vma/vma_internal.h | 32 ++++++++++++
>  5 files changed, 156 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f963afa1b9de..a8811ba57150 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -522,6 +522,38 @@ extern unsigned int kobjsize(const void *objp);
>  #endif
>  #define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)
>
> +/* Flags which should result in page tables being copied on fork. */
> +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
> +
> +/*
> + * Flags which should be 'sticky' on merge - that is, flags which, when one VMA
> + * possesses it but the other does not, the merged VMA should nonetheless have
> + * applied to it:
> + *
> + * VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
> + *                   metadata which should be unconditionally propagated upon
> + *                   fork. When merging two VMAs, we encapsulate this range in
> + *                   the merged VMA, so the flag should be 'sticky' as a result.

It's probably worth noting that after a split, we do not remove
"sticky" flags even if the VMA acquired them as a result of a previous
merge.

> + */
> +#define VM_STICKY VM_COPY_ON_FORK
> +
> +/*
> + * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
> + * of these flags and the other not does not preclude a merge.
> + *
> + * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
> + *                dirty bit -- the caller should mark merged VMA as dirty. If
> + *                dirty bit won't be excluded from comparison, we increase
> + *                pressure on the memory system forcing the kernel to generate
> + *                new VMAs when old one could be extended instead.
> + *
> + *    VM_STICKY - If one VMA has flags which most be 'sticky', that is ones

s/most/must ?

> + *                which should propagate to all VMAs, but the other does not,
> + *                the merge should still proceed with the merge logic applying
> + *                sticky flags to the final VMA.
> + */
> +#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
> +
>  /*
>   * mapping from the currently active vm_flags protection bits (the
>   * low four bits) to a page protection mask..
> diff --git a/mm/memory.c b/mm/memory.c
> index a2c79ee43d68..9528133e5147 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1478,8 +1478,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>         if (src_vma->anon_vma)
>                 return true;
>
> -       /* Guard regions have momdified page tables that require copying. */
> -       if (src_vma->vm_flags & VM_MAYBE_GUARD)
> +       if (src_vma->vm_flags & VM_COPY_ON_FORK)
>                 return true;
>
>         /*
> diff --git a/mm/vma.c b/mm/vma.c
> index 919d1fc63a52..50a6909c4be3 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex
>
>         if (!mpol_equal(vmg->policy, vma_policy(vma)))
>                 return false;
> -       /*
> -        * VM_SOFTDIRTY should not prevent from VMA merging, if we
> -        * match the flags but dirty bit -- the caller should mark
> -        * merged VMA as dirty. If dirty bit won't be excluded from
> -        * comparison, we increase pressure on the memory system forcing
> -        * the kernel to generate new VMAs when old one could be
> -        * extended instead.
> -        */
> -       if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY)
> +       if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE)
>                 return false;
>         if (vma->vm_file != vmg->file)
>                 return false;
> @@ -809,6 +801,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma)
>  static __must_check struct vm_area_struct *vma_merge_existing_range(
>                 struct vma_merge_struct *vmg)
>  {
> +       vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY;
>         struct vm_area_struct *middle = vmg->middle;
>         struct vm_area_struct *prev = vmg->prev;
>         struct vm_area_struct *next;
> @@ -901,11 +894,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
>         if (merge_right) {
>                 vma_start_write(next);
>                 vmg->target = next;
> +               sticky_flags |= (next->vm_flags & VM_STICKY);
>         }
>
>         if (merge_left) {
>                 vma_start_write(prev);
>                 vmg->target = prev;
> +               sticky_flags |= (prev->vm_flags & VM_STICKY);
>         }
>
>         if (merge_both) {
> @@ -975,6 +970,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
>         if (err || commit_merge(vmg))
>                 goto abort;
>
> +       vm_flags_set(vmg->target, sticky_flags);
>         khugepaged_enter_vma(vmg->target, vmg->vm_flags);
>         vmg->state = VMA_MERGE_SUCCESS;
>         return vmg->target;
> @@ -1125,6 +1121,10 @@ int vma_expand(struct vma_merge_struct *vmg)
>         bool remove_next = false;
>         struct vm_area_struct *target = vmg->target;
>         struct vm_area_struct *next = vmg->next;
> +       vm_flags_t sticky_flags;
> +
> +       sticky_flags = vmg->vm_flags & VM_STICKY;
> +       sticky_flags |= target->vm_flags & VM_STICKY;
>
>         VM_WARN_ON_VMG(!target, vmg);
>
> @@ -1134,6 +1134,7 @@ int vma_expand(struct vma_merge_struct *vmg)
>         if (next && (target != next) && (vmg->end == next->vm_end)) {
>                 int ret;
>
> +               sticky_flags |= next->vm_flags & VM_STICKY;
>                 remove_next = true;
>                 /* This should already have been checked by this point. */
>                 VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);
> @@ -1160,6 +1161,7 @@ int vma_expand(struct vma_merge_struct *vmg)
>         if (commit_merge(vmg))
>                 goto nomem;
>
> +       vm_flags_set(target, sticky_flags);
>         return 0;
>
>  nomem:
> @@ -1903,7 +1905,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *
>         return a->vm_end == b->vm_start &&
>                 mpol_equal(vma_policy(a), vma_policy(b)) &&
>                 a->vm_file == b->vm_file &&
> -               !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) &&
> +               !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) &&
>                 b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
>  }
>
> diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> index 656e1c75b711..ee9d3547c421 100644
> --- a/tools/testing/vma/vma.c
> +++ b/tools/testing/vma/vma.c

I prefer tests in a separate patch, but that might just be me. Feel
free to ignore.

> @@ -48,6 +48,8 @@ static struct anon_vma dummy_anon_vma;
>  #define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2))
>  #define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2))
>
> +#define IS_SET(_val, _flags) ((_val & _flags) == _flags)
> +
>  static struct task_struct __current;
>
>  struct task_struct *get_current(void)
> @@ -441,7 +443,7 @@ static bool test_simple_shrink(void)
>         return true;
>  }
>
> -static bool test_merge_new(void)
> +static bool __test_merge_new(bool is_sticky, bool a_is_sticky, bool b_is_sticky, bool c_is_sticky)
>  {
>         vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
>         struct mm_struct mm = {};
> @@ -469,23 +471,32 @@ static bool test_merge_new(void)
>         struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d;
>         bool merged;
>
> +       if (is_sticky)
> +               vm_flags |= VM_STICKY;
> +
>         /*
>          * 0123456789abc
>          * AA B       CC
>          */
>         vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags);
>         ASSERT_NE(vma_a, NULL);
> +       if (a_is_sticky)
> +               vm_flags_set(vma_a, VM_STICKY);
>         /* We give each VMA a single avc so we can test anon_vma duplication. */
>         INIT_LIST_HEAD(&vma_a->anon_vma_chain);
>         list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain);
>
>         vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags);
>         ASSERT_NE(vma_b, NULL);
> +       if (b_is_sticky)
> +               vm_flags_set(vma_b, VM_STICKY);
>         INIT_LIST_HEAD(&vma_b->anon_vma_chain);
>         list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain);
>
>         vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, vm_flags);
>         ASSERT_NE(vma_c, NULL);
> +       if (c_is_sticky)
> +               vm_flags_set(vma_c, VM_STICKY);
>         INIT_LIST_HEAD(&vma_c->anon_vma_chain);
>         list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain);
>
> @@ -520,6 +531,8 @@ static bool test_merge_new(void)
>         ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>         ASSERT_TRUE(vma_write_started(vma));
>         ASSERT_EQ(mm.map_count, 3);
> +       if (is_sticky || a_is_sticky || b_is_sticky)
> +               ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
>
>         /*
>          * Merge to PREVIOUS VMA.
> @@ -537,6 +550,8 @@ static bool test_merge_new(void)
>         ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>         ASSERT_TRUE(vma_write_started(vma));
>         ASSERT_EQ(mm.map_count, 3);
> +       if (is_sticky || a_is_sticky)
> +               ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
>
>         /*
>          * Merge to NEXT VMA.
> @@ -556,6 +571,8 @@ static bool test_merge_new(void)
>         ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>         ASSERT_TRUE(vma_write_started(vma));
>         ASSERT_EQ(mm.map_count, 3);
> +       if (is_sticky) /* D uses is_sticky. */
> +               ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
>
>         /*
>          * Merge BOTH sides.
> @@ -574,6 +591,8 @@ static bool test_merge_new(void)
>         ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>         ASSERT_TRUE(vma_write_started(vma));
>         ASSERT_EQ(mm.map_count, 2);
> +       if (is_sticky || a_is_sticky)
> +               ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
>
>         /*
>          * Merge to NEXT VMA.
> @@ -592,6 +611,8 @@ static bool test_merge_new(void)
>         ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>         ASSERT_TRUE(vma_write_started(vma));
>         ASSERT_EQ(mm.map_count, 2);
> +       if (is_sticky || c_is_sticky)
> +               ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
>
>         /*
>          * Merge BOTH sides.
> @@ -609,6 +630,8 @@ static bool test_merge_new(void)
>         ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>         ASSERT_TRUE(vma_write_started(vma));
>         ASSERT_EQ(mm.map_count, 1);
> +       if (is_sticky || a_is_sticky || c_is_sticky)
> +               ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
>
>         /*
>          * Final state.
> @@ -637,6 +660,20 @@ static bool test_merge_new(void)
>         return true;
>  }
>
> +static bool test_merge_new(void)
> +{
> +       int i, j, k, l;
> +
> +       /* Generate every possible permutation of sticky flags. */
> +       for (i = 0; i < 2; i++)
> +               for (j = 0; j < 2; j++)
> +                       for (k = 0; k < 2; k++)
> +                               for (l = 0; l < 2; l++)
> +                                       ASSERT_TRUE(__test_merge_new(i, j, k, l));
> +
> +       return true;
> +}
> +
>  static bool test_vma_merge_special_flags(void)
>  {
>         vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
> @@ -973,9 +1010,11 @@ static bool test_vma_merge_new_with_close(void)
>         return true;
>  }
>
> -static bool test_merge_existing(void)
> +static bool __test_merge_existing(bool prev_is_sticky, bool middle_is_sticky, bool next_is_sticky)
>  {
>         vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
> +       vm_flags_t prev_flags = vm_flags;
> +       vm_flags_t next_flags = vm_flags;
>         struct mm_struct mm = {};
>         VMA_ITERATOR(vmi, &mm, 0);
>         struct vm_area_struct *vma, *vma_prev, *vma_next;
> @@ -988,6 +1027,13 @@ static bool test_merge_existing(void)
>         };
>         struct anon_vma_chain avc = {};
>
> +       if (prev_is_sticky)
> +               prev_flags |= VM_STICKY;
> +       if (middle_is_sticky)
> +               vm_flags |= VM_STICKY;
> +       if (next_is_sticky)
> +               next_flags |= VM_STICKY;
> +
>         /*
>          * Merge right case - partial span.
>          *
> @@ -1000,7 +1046,7 @@ static bool test_merge_existing(void)
>          */
>         vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags);
>         vma->vm_ops = &vm_ops; /* This should have no impact. */
> -       vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
> +       vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags);
>         vma_next->vm_ops = &vm_ops; /* This should have no impact. */
>         vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma);
>         vmg.middle = vma;
> @@ -1018,6 +1064,8 @@ static bool test_merge_existing(void)
>         ASSERT_TRUE(vma_write_started(vma));
>         ASSERT_TRUE(vma_write_started(vma_next));
>         ASSERT_EQ(mm.map_count, 2);
> +       if (middle_is_sticky || next_is_sticky)
> +               ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));
>
>         /* Clear down and reset. */
>         ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
> @@ -1033,7 +1081,7 @@ static bool test_merge_existing(void)
>          *   NNNNNNN
>          */
>         vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags);
> -       vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
> +       vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags);
>         vma_next->vm_ops = &vm_ops; /* This should have no impact. */
>         vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, vm_flags, &dummy_anon_vma);
>         vmg.middle = vma;
> @@ -1046,6 +1094,8 @@ static bool test_merge_existing(void)
>         ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma);
>         ASSERT_TRUE(vma_write_started(vma_next));
>         ASSERT_EQ(mm.map_count, 1);
> +       if (middle_is_sticky || next_is_sticky)
> +               ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));
>
>         /* Clear down and reset. We should have deleted vma. */
>         ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
> @@ -1060,7 +1110,7 @@ static bool test_merge_existing(void)
>          * 0123456789
>          * PPPPPPV
>          */
> -       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
> +       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
>         vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
>         vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
>         vma->vm_ops = &vm_ops; /* This should have no impact. */
> @@ -1080,6 +1130,8 @@ static bool test_merge_existing(void)
>         ASSERT_TRUE(vma_write_started(vma_prev));
>         ASSERT_TRUE(vma_write_started(vma));
>         ASSERT_EQ(mm.map_count, 2);
> +       if (prev_is_sticky || middle_is_sticky)
> +               ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
>
>         /* Clear down and reset. */
>         ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
> @@ -1094,7 +1146,7 @@ static bool test_merge_existing(void)
>          * 0123456789
>          * PPPPPPP
>          */
> -       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
> +       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
>         vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
>         vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
>         vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);
> @@ -1109,6 +1161,8 @@ static bool test_merge_existing(void)
>         ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
>         ASSERT_TRUE(vma_write_started(vma_prev));
>         ASSERT_EQ(mm.map_count, 1);
> +       if (prev_is_sticky || middle_is_sticky)
> +               ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
>
>         /* Clear down and reset. We should have deleted vma. */
>         ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
> @@ -1123,10 +1177,10 @@ static bool test_merge_existing(void)
>          * 0123456789
>          * PPPPPPPPPP
>          */
> -       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
> +       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
>         vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
>         vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
> -       vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags);
> +       vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, next_flags);
>         vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);
>         vmg.prev = vma_prev;
>         vmg.middle = vma;
> @@ -1139,6 +1193,8 @@ static bool test_merge_existing(void)
>         ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
>         ASSERT_TRUE(vma_write_started(vma_prev));
>         ASSERT_EQ(mm.map_count, 1);
> +       if (prev_is_sticky || middle_is_sticky || next_is_sticky)
> +               ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
>
>         /* Clear down and reset. We should have deleted prev and next. */
>         ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
> @@ -1158,9 +1214,9 @@ static bool test_merge_existing(void)
>          * PPPVVVVVNNN
>          */
>
> -       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
> +       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
>         vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags);
> -       vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, vm_flags);
> +       vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, next_flags);
>
>         vmg_set_range(&vmg, 0x4000, 0x5000, 4, vm_flags);
>         vmg.prev = vma;
> @@ -1203,6 +1259,19 @@ static bool test_merge_existing(void)
>         return true;
>  }
>
> +static bool test_merge_existing(void)
> +{
> +       int i, j, k;
> +
> +       /* Generate every possible permutation of sticky flags. */
> +       for (i = 0; i < 2; i++)
> +               for (j = 0; j < 2; j++)
> +                       for (k = 0; k < 2; k++)
> +                               ASSERT_TRUE(__test_merge_existing(i, j, k));
> +
> +       return true;
> +}
> +
>  static bool test_anon_vma_non_mergeable(void)
>  {
>         vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index e40c93edc5a7..3d9cb3a9411a 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -117,6 +117,38 @@ extern unsigned long dac_mmap_min_addr;
>  #define VM_SEALED      VM_NONE
>  #endif
>
> +/* Flags which should result in page tables being copied on fork. */
> +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
> +
> +/*
> + * Flags which should be 'sticky' on merge - that is, flags which, when one VMA
> + * possesses it but the other does not, the merged VMA should nonetheless have
> + * applied to it:
> + *
> + * VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
> + *                   metadata which should be unconditionally propagated upon
> + *                   fork. When merging two VMAs, we encapsulate this range in
> + *                   the merged VMA, so the flag should be 'sticky' as a result.
> + */
> +#define VM_STICKY VM_COPY_ON_FORK
> +
> +/*
> + * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
> + * of these flags and the other not does not preclude a merge.
> + *
> + * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
> + *                dirty bit -- the caller should mark merged VMA as dirty. If
> + *                dirty bit won't be excluded from comparison, we increase
> + *                pressure on the memory system forcing the kernel to generate
> + *                new VMAs when old one could be extended instead.
> + *
> + *    VM_STICKY - If one VMA has flags which must be 'sticky', that is ones
> + *                which should propagate to all VMAs, but the other does not,
> + *                the merge should still proceed with the merge logic applying
> + *                sticky flags to the final VMA.
> + */
> +#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
> +
>  #define FIRST_USER_ADDRESS     0UL
>  #define USER_PGTABLES_CEILING  0UL
>
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] selftests/mm/guard-regions: add smaps visibility test
  2025-10-29 16:50 ` [PATCH 3/3] selftests/mm/guard-regions: add smaps visibility test Lorenzo Stoakes
@ 2025-10-30  4:40   ` Suren Baghdasaryan
  2025-10-30  8:25     ` Lorenzo Stoakes
  0 siblings, 1 reply; 28+ messages in thread
From: Suren Baghdasaryan @ 2025-10-30  4:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin

On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Assert that we observe guard regions appearing in /proc/$pid/smaps as
> expected, and when split/merge is performed too (with expected sticky
> behaviour).
>
> Also add handling for file systems which don't sanely handle mmap() VMA
> merging so we don't incorrectly encounter a test failure in this situation.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  tools/testing/selftests/mm/guard-regions.c | 120 +++++++++++++++++++++
>  tools/testing/selftests/mm/vm_util.c       |   5 +
>  tools/testing/selftests/mm/vm_util.h       |   1 +
>  3 files changed, 126 insertions(+)
>
> diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
> index 8dd81c0a4a5a..a9be11e03a6a 100644
> --- a/tools/testing/selftests/mm/guard-regions.c
> +++ b/tools/testing/selftests/mm/guard-regions.c
> @@ -94,6 +94,7 @@ static void *mmap_(FIXTURE_DATA(guard_regions) * self,
>         case ANON_BACKED:
>                 flags |= MAP_PRIVATE | MAP_ANON;
>                 fd = -1;
> +               offset = 0;
>                 break;
>         case SHMEM_BACKED:
>         case LOCAL_FILE_BACKED:
> @@ -260,6 +261,54 @@ static bool is_buf_eq(char *buf, size_t size, char chr)
>         return true;
>  }
>
> +/*
> + * Some file systems have issues with merging due to changing merge-sensitive
> + * parameters in the .mmap callback, and prior to .mmap_prepare being
> + * implemented everywhere this will now result in an unexpected failure to
> + * merge (e.g. - overlayfs).
> + *
> + * Perform a simple test to see if the local file system suffers from this, if
> + * it does then we can skip test logic that assumes local file system merging is
> + * sane.
> + */
> +static bool local_fs_has_sane_mmap(FIXTURE_DATA(guard_regions) * self,
> +                                  const FIXTURE_VARIANT(guard_regions) * variant)
> +{
> +       const unsigned long page_size = self->page_size;
> +       char *ptr, *ptr2;
> +       struct procmap_fd procmap;
> +
> +       if (variant->backing != LOCAL_FILE_BACKED)
> +               return true;
> +
> +       /* Map 10 pages. */
> +       ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0);
> +       if (ptr == MAP_FAILED)
> +               return false;
> +       /* Unmap the middle. */
> +       munmap(&ptr[5 * page_size], page_size);
> +
> +       /* Map again. */
> +       ptr2 = mmap_(self, variant, &ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE,
> +                    MAP_FIXED, 5 * page_size);
> +
> +       if (ptr2 == MAP_FAILED)
> +               return false;
> +
> +       /* Now make sure they all merged. */
> +       if (open_self_procmap(&procmap) != 0)
> +               return false;
> +       if (!find_vma_procmap(&procmap, ptr))
> +               return false;
> +       if (procmap.query.vma_start != (unsigned long)ptr)
> +               return false;
> +       if (procmap.query.vma_end != (unsigned long)ptr + 10 * page_size)
> +               return false;
> +       close_procmap(&procmap);
> +
> +       return true;
> +}
> +
>  FIXTURE_SETUP(guard_regions)
>  {
>         self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
> @@ -2138,4 +2187,75 @@ TEST_F(guard_regions, pagemap_scan)
>         ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
>  }
>
> +TEST_F(guard_regions, smaps)
> +{
> +       const unsigned long page_size = self->page_size;
> +       struct procmap_fd procmap;
> +       char *ptr, *ptr2;
> +       int i;
> +
> +       /* Map a region. */
> +       ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0);
> +       ASSERT_NE(ptr, MAP_FAILED);
> +
> +       /* We shouldn't yet see a guard flag. */
> +       ASSERT_FALSE(check_vmflag_guard(ptr));
> +
> +       /* Install a single guard region. */
> +       ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
> +
> +       /* Now we should see a guard flag. */
> +       ASSERT_TRUE(check_vmflag_guard(ptr));
> +
> +       /*
> +        * Removing the guard region should not change things because we simply
> +        * cannot accurately track whether a given VMA has had all of its guard
> +        * regions removed.
> +        */
> +       ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_REMOVE), 0);
> +       ASSERT_TRUE(check_vmflag_guard(ptr));
> +
> +       /* Install guard regions throughout. */
> +       for (i = 0; i < 10; i++) {
> +               ASSERT_EQ(madvise(&ptr[i * page_size], page_size, MADV_GUARD_INSTALL), 0);
> +               /* We should always see the guard region flag. */
> +               ASSERT_TRUE(check_vmflag_guard(ptr));
> +       }
> +
> +       /* Split into two VMAs. */
> +       ASSERT_EQ(munmap(&ptr[4 * page_size], page_size), 0);
> +
> +       /* Both VMAs should have the guard flag set. */
> +       ASSERT_TRUE(check_vmflag_guard(ptr));
> +       ASSERT_TRUE(check_vmflag_guard(&ptr[5 * page_size]));
> +
> +       /*
> +        * If the local file system is unable to merge VMAs due to having
> +        * unusual characteristics, there is no point in asserting merge
> +        * behaviour.
> +        */
> +       if (!local_fs_has_sane_mmap(self, variant)) {
> +               TH_LOG("local filesystem does not support sane merging skipping merge test");
> +               return;
> +       }
> +
> +       /* Map a fresh VMA between the two split VMAs. */
> +       ptr2 = mmap_(self, variant, &ptr[4 * page_size], page_size,
> +                    PROT_READ | PROT_WRITE, MAP_FIXED, 4 * page_size);
> +       ASSERT_NE(ptr2, MAP_FAILED);
> +
> +       /*
> +        * Check the procmap to ensure that this VMA merged with the adjacent
> +        * two. The guard region flag is 'sticky' so should not preclude
> +        * merging.
> +        */
> +       ASSERT_EQ(open_self_procmap(&procmap), 0);
> +       ASSERT_TRUE(find_vma_procmap(&procmap, ptr));
> +       ASSERT_EQ(procmap.query.vma_start, (unsigned long)ptr);
> +       ASSERT_EQ(procmap.query.vma_end, (unsigned long)ptr + 10 * page_size);
> +       ASSERT_EQ(close_procmap(&procmap), 0);
> +       /* And, of course, this VMA should have the guard flag set. */
> +       ASSERT_TRUE(check_vmflag_guard(ptr));
> +}
> +
>  TEST_HARNESS_MAIN
> diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
> index e33cda301dad..605cb58ea5c3 100644
> --- a/tools/testing/selftests/mm/vm_util.c
> +++ b/tools/testing/selftests/mm/vm_util.c
> @@ -449,6 +449,11 @@ bool check_vmflag_pfnmap(void *addr)
>         return check_vmflag(addr, "pf");
>  }
>
> +bool check_vmflag_guard(void *addr)
> +{
> +       return check_vmflag(addr, "gu");
> +}
> +
>  bool softdirty_supported(void)
>  {
>         char *addr;
> diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
> index 26c30fdc0241..a8abdf414d46 100644
> --- a/tools/testing/selftests/mm/vm_util.h
> +++ b/tools/testing/selftests/mm/vm_util.h
> @@ -98,6 +98,7 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len,
>  unsigned long get_free_hugepages(void);
>  bool check_vmflag_io(void *addr);
>  bool check_vmflag_pfnmap(void *addr);
> +bool check_vmflag_guard(void *addr);
>  int open_procmap(pid_t pid, struct procmap_fd *procmap_out);
>  int query_procmap(struct procmap_fd *procmap);
>  bool find_vma_procmap(struct procmap_fd *procmap, void *address);
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-29 19:50   ` Randy Dunlap
@ 2025-10-30  8:13     ` Lorenzo Stoakes
  0 siblings, 0 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-30  8:13 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jann Horn, Pedro Falcato,
	linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin

On Wed, Oct 29, 2025 at 12:50:18PM -0700, Randy Dunlap wrote:
>
> On 10/29/25 9:50 AM, Lorenzo Stoakes wrote:
>
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 4c3a7e09a159..a2c79ee43d68 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >  	if (src_vma->anon_vma)
> >  		return true;
> >
> > +	/* Guard regions have momdified page tables that require copying. */
>
> 	                      modified

I appear to have entered some kind of typo nirvana in this series it seems :)

Thanks, will fix!

>
> > +	if (src_vma->vm_flags & VM_MAYBE_GUARD)
> > +		return true;
> > +
> >  	/*
> >  	 * Don't copy ptes where a page fault will fill them correctly.  Fork
> >  	 * becomes much lighter when there are big shared or private readonly
> --
> ~Randy
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-30  1:05   ` Suren Baghdasaryan
@ 2025-10-30  8:22     ` Lorenzo Stoakes
  0 siblings, 0 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-30  8:22 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin

On Wed, Oct 29, 2025 at 06:05:54PM -0700, Suren Baghdasaryan wrote:
> On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Currently, if a user needs to determine if guard regions are present in a
> > range, they have to scan all VMAs (or have knowledge of which ones might
> > have guard regions).
> >
> > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> > operation at a virtual address level.
> >
> > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> > that guard regions exist in ranges.
> >
> > This patch remedies the situation by establishing a new VMA flag,
> > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> > uncertain because we cannot reasonably determine whether a
> > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> > additionally VMAs may change across merge/split).
>
> nit: I know I suck at naming but I think VM_MAY_HAVE_GUARDS would
> better represent the meaning.

We all suck at naming :) it's the hardest bit! :P

Hm I don't love that, bit overwrought, I do think 'maybe guard' is a better
shorthand for this flag name.

I am open to other suggestions but I think the original wins on
succinctness here!

>
> >
> > We utilise 0x800 for this flag which makes it available to 32-bit
> > architectures also, a flag that was previously used by VM_DENYWRITE, which
> > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> > bee reused yet.
>
> s/bee/been

Yeah this series appears to be a bonanza of typos... not sure why :) will
fix.

> but I'm not even sure the above paragraph has to be included in the
> changelog. It's a technical detail IMHO.

Well, I think it's actually important to highlight that we have a VMA flag
free and why. I know it's bordering on extraneous, but I don't think
there's any harm in mentioning it.

Otherwise people might wonder 'oh is this flag used elsewhere somehow' etc.

>
> >
> > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> > lock (and also VMA write lock) whereas previously it did not, but this
> > seems a reasonable overhead.
>
> I guess this is because it is modifying vm_flags now?

Yes

>
> >
> > We also update the smaps logic and documentation to identify these VMAs.
> >
> > Another major use of this functionality is that we can use it to identify
> > that we ought to copy page tables on fork.
> >
> > For anonymous mappings this is inherent, however since commit f807123d578d
> >  ("mm: allow guard regions in file-backed and read-only mappings") which
> >  allowed file-backed guard regions, we have unfortunately had to enforce
> > this behaviour by settings vma->anon_vma to force page table copying.
> >
> > The existence of this flag removes the need for this, so we simply update
> > vma_needs_copy() to check for this flag instead.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Overall, makes sense to me and I think we could use it.
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Thanks!

>
> It would be nice to have a way for userspace to reset this flag if it
> confirms that the VMA does not really have any guards (using say
> PAGEMAP_SCAN) but I think such an API can be abused.

Yeah, I'd rather not for that reason.

>
> > ---
> >  Documentation/filesystems/proc.rst |  1 +
> >  fs/proc/task_mmu.c                 |  1 +
> >  include/linux/mm.h                 |  1 +
> >  include/trace/events/mmflags.h     |  1 +
> >  mm/madvise.c                       | 22 ++++++++++++++--------
> >  mm/memory.c                        |  4 ++++
> >  tools/testing/vma/vma_internal.h   |  1 +
> >  7 files changed, 23 insertions(+), 8 deletions(-)
> >
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 0b86a8022fa1..b8a423ca590a 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -591,6 +591,7 @@ encoded manner. The codes are the following:
> >      sl    sealed
> >      lf    lock on fault pages
> >      dp    always lazily freeable mapping
> > +    gu    maybe contains guard regions (if not set, definitely doesn't)
> >      ==    =======================================
> >
> >  Note that there is no guarantee that every flag and associated mnemonic will
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index fc35a0543f01..db16ed91c269 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1146,6 +1146,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> >                 [ilog2(VM_MAYSHARE)]    = "ms",
> >                 [ilog2(VM_GROWSDOWN)]   = "gd",
> >                 [ilog2(VM_PFNMAP)]      = "pf",
> > +               [ilog2(VM_MAYBE_GUARD)] = "gu",
> >                 [ilog2(VM_LOCKED)]      = "lo",
> >                 [ilog2(VM_IO)]          = "io",
> >                 [ilog2(VM_SEQ_READ)]    = "sr",
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index aada935c4950..f963afa1b9de 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -296,6 +296,7 @@ extern unsigned int kobjsize(const void *objp);
> >  #define VM_UFFD_MISSING        0
> >  #endif /* CONFIG_MMU */
> >  #define VM_PFNMAP      0x00000400      /* Page-ranges managed without "struct page", just pure PFN */
> > +#define VM_MAYBE_GUARD 0x00000800      /* The VMA maybe contains guard regions. */
> >  #define VM_UFFD_WP     0x00001000      /* wrprotect pages tracking */
> >
> >  #define VM_LOCKED      0x00002000
> > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> > index aa441f593e9a..a6e5a44c9b42 100644
> > --- a/include/trace/events/mmflags.h
> > +++ b/include/trace/events/mmflags.h
> > @@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3)
> >         {VM_UFFD_MISSING,               "uffd_missing"  },              \
> >  IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR,      "uffd_minor"    )               \
> >         {VM_PFNMAP,                     "pfnmap"        },              \
> > +       {VM_MAYBE_GUARD,                "maybe_guard"   },              \
> >         {VM_UFFD_WP,                    "uffd_wp"       },              \
> >         {VM_LOCKED,                     "locked"        },              \
> >         {VM_IO,                         "io"            },              \
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index fb1c86e630b6..216ae6ed344e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1141,15 +1141,22 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
> >                 return -EINVAL;
> >
> >         /*
> > -        * If we install guard markers, then the range is no longer
> > -        * empty from a page table perspective and therefore it's
> > -        * appropriate to have an anon_vma.
> > +        * It would be confusing for anonymous mappings to have page table
> > +        * entries but no anon_vma established, so ensure that it is.
> > +        */
> > +       if (vma_is_anonymous(vma))
> > +               anon_vma_prepare(vma);
> > +
> > +       /*
> > +        * Indicate that the VMA may contain guard regions, making it visible to
> > +        * the user that a VMA may contain these, narrowing down the range which
> > +        * must be scanned in order to detect them.
> >          *
> > -        * This ensures that on fork, we copy page tables correctly.
> > +        * This additionally causes page tables to be copied on fork regardless
> > +        * of whether the VMA is anonymous or not, correctly preserving the
> > +        * guard region page table entries.
> >          */
> > -       err = anon_vma_prepare(vma);
> > -       if (err)
> > -               return err;
> > +       vm_flags_set(vma, VM_MAYBE_GUARD);
> >
> >         /*
> >          * Optimistically try to install the guard marker pages first. If any
> > @@ -1709,7 +1716,6 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
> >         case MADV_POPULATE_READ:
> >         case MADV_POPULATE_WRITE:
> >         case MADV_COLLAPSE:
> > -       case MADV_GUARD_INSTALL:
> >         case MADV_GUARD_REMOVE:
> >                 return MADVISE_MMAP_READ_LOCK;
> >         case MADV_DONTNEED:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 4c3a7e09a159..a2c79ee43d68 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1478,6 +1478,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >         if (src_vma->anon_vma)
> >                 return true;
> >
> > +       /* Guard regions have momdified page tables that require copying. */
> > +       if (src_vma->vm_flags & VM_MAYBE_GUARD)
> > +               return true;
> > +
> >         /*
> >          * Don't copy ptes where a page fault will fill them correctly.  Fork
> >          * becomes much lighter when there are big shared or private readonly
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index d873667704e8..e40c93edc5a7 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr;
> >  #define VM_MAYEXEC     0x00000040
> >  #define VM_GROWSDOWN   0x00000100
> >  #define VM_PFNMAP      0x00000400
> > +#define VM_MAYBE_GUARD 0x00000800
> >  #define VM_LOCKED      0x00002000
> >  #define VM_IO           0x00004000
> >  #define VM_SEQ_READ    0x00008000      /* App will access data sequentially */
> > --
> > 2.51.0
> >

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm: implement sticky, copy on fork VMA flags
  2025-10-30  4:35   ` Suren Baghdasaryan
@ 2025-10-30  8:25     ` Lorenzo Stoakes
  0 siblings, 0 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-30  8:25 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin

On Wed, Oct 29, 2025 at 09:35:25PM -0700, Suren Baghdasaryan wrote:
> On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > It's useful to be able to force a VMA to be copied on fork outside of the
> > parameters specified by vma_needs_copy(), which otherwise only copies page
> > tables if:
> >
> > * The destination VMA has VM_UFFD_WP set
> > * The mapping is a PFN or mixed map
> > * The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)
> >
> > Setting this flag implies that the page tables mapping the VMA are such
> > that simply re-faulting the VMA will not re-establish them in identical
> > form.
> >
> > We introduce VM_COPY_ON_FORK to clearly identify which flags require this
> > behaviour, which currently is only VM_MAYBE_GUARD.
> >
> > Any VMA flags which require this behaviour are inherently 'sticky', that
> > is, should we merge two VMAs together, this implies that the newly merged
> > VMA maps a range that requires page table copying on fork.
> >
> > In order to implement this we must both introduce the concept of a 'sticky'
> > VMA flag and adjust the VMA merge logic accordingly, and also have VMA
> > merge still successfully succeed should one VMA have the flag set and
> > another not.
>
> "successfully succeed" sounds weird. Just "succeed"?

Yeah... typo bonanza this series :) will fix.

>
> >
> > Note that we update the VMA expand logic to handle new VMA merging, as this
> > function is the one ultimately called by all instances of merging of new
> > VMAs.
> >
> > This patch implements this, establishing VM_STICKY to contain all such
> > flags and VM_IGNORE_MERGE for those flags which should be ignored when
> > comparing adjacent VMA's flags for the purposes of merging.
> >
> > As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it
> > already had this behaviour, alongside VM_STICKY as sticky flags by
> > implication must not disallow merge.
> >
> > We update the VMA userland tests to account for the changes and,
> > furthermore, in order to assert that the functionality is workingly
>
> s/workingly/working

Haha good lord. Will fix also!

>
> > correctly, update the new VMA and existing VMA merging logic to consider
> > every permutation of the flag being set/not set in all VMAs being
> > considered for merge.
> >
> > As a result of this change, VMAs with guard ranges will now not have their
> > merge behaviour impacted by doing so and can be freely merged with other
> > VMAs without VM_MAYBE_GUARD set.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  include/linux/mm.h               | 32 ++++++++++++
> >  mm/memory.c                      |  3 +-
> >  mm/vma.c                         | 22 ++++----
> >  tools/testing/vma/vma.c          | 89 ++++++++++++++++++++++++++++----
> >  tools/testing/vma/vma_internal.h | 32 ++++++++++++
> >  5 files changed, 156 insertions(+), 22 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index f963afa1b9de..a8811ba57150 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -522,6 +522,38 @@ extern unsigned int kobjsize(const void *objp);
> >  #endif
> >  #define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)
> >
> > +/* Flags which should result in page tables being copied on fork. */
> > +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
> > +
> > +/*
> > + * Flags which should be 'sticky' on merge - that is, flags which, when one VMA
> > + * possesses it but the other does not, the merged VMA should nonetheless have
> > + * applied to it:
> > + *
> > + * VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
> > + *                   metadata which should be unconditionally propagated upon
> > + *                   fork. When merging two VMAs, we encapsulate this range in
> > + *                   the merged VMA, so the flag should be 'sticky' as a result.
>
> It's probably worth noting that after a split, we do not remove
> "sticky" flags even if the VMA acquired them as a result of a previous
> merge.

Hm I thought this was implied. Will update to be clear however!

>
> > + */
> > +#define VM_STICKY VM_COPY_ON_FORK
> > +
> > +/*
> > + * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
> > + * of these flags and the other not does not preclude a merge.
> > + *
> > + * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
> > + *                dirty bit -- the caller should mark merged VMA as dirty. If
> > + *                dirty bit won't be excluded from comparison, we increase
> > + *                pressure on the memory system forcing the kernel to generate
> > + *                new VMAs when old one could be extended instead.
> > + *
> > + *    VM_STICKY - If one VMA has flags which most be 'sticky', that is ones
>
> s/most/must ?

I most learn to not typo so much :)

Yes you're right, will fix! :P

>
> > + *                which should propagate to all VMAs, but the other does not,
> > + *                the merge should still proceed with the merge logic applying
> > + *                sticky flags to the final VMA.
> > + */
> > +#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
> > +
> >  /*
> >   * mapping from the currently active vm_flags protection bits (the
> >   * low four bits) to a page protection mask..
> > diff --git a/mm/memory.c b/mm/memory.c
> > index a2c79ee43d68..9528133e5147 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1478,8 +1478,7 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >         if (src_vma->anon_vma)
> >                 return true;
> >
> > -       /* Guard regions have momdified page tables that require copying. */
> > -       if (src_vma->vm_flags & VM_MAYBE_GUARD)
> > +       if (src_vma->vm_flags & VM_COPY_ON_FORK)
> >                 return true;
> >
> >         /*
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 919d1fc63a52..50a6909c4be3 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex
> >
> >         if (!mpol_equal(vmg->policy, vma_policy(vma)))
> >                 return false;
> > -       /*
> > -        * VM_SOFTDIRTY should not prevent from VMA merging, if we
> > -        * match the flags but dirty bit -- the caller should mark
> > -        * merged VMA as dirty. If dirty bit won't be excluded from
> > -        * comparison, we increase pressure on the memory system forcing
> > -        * the kernel to generate new VMAs when old one could be
> > -        * extended instead.
> > -        */
> > -       if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY)
> > +       if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE)
> >                 return false;
> >         if (vma->vm_file != vmg->file)
> >                 return false;
> > @@ -809,6 +801,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma)
> >  static __must_check struct vm_area_struct *vma_merge_existing_range(
> >                 struct vma_merge_struct *vmg)
> >  {
> > +       vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY;
> >         struct vm_area_struct *middle = vmg->middle;
> >         struct vm_area_struct *prev = vmg->prev;
> >         struct vm_area_struct *next;
> > @@ -901,11 +894,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
> >         if (merge_right) {
> >                 vma_start_write(next);
> >                 vmg->target = next;
> > +               sticky_flags |= (next->vm_flags & VM_STICKY);
> >         }
> >
> >         if (merge_left) {
> >                 vma_start_write(prev);
> >                 vmg->target = prev;
> > +               sticky_flags |= (prev->vm_flags & VM_STICKY);
> >         }
> >
> >         if (merge_both) {
> > @@ -975,6 +970,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
> >         if (err || commit_merge(vmg))
> >                 goto abort;
> >
> > +       vm_flags_set(vmg->target, sticky_flags);
> >         khugepaged_enter_vma(vmg->target, vmg->vm_flags);
> >         vmg->state = VMA_MERGE_SUCCESS;
> >         return vmg->target;
> > @@ -1125,6 +1121,10 @@ int vma_expand(struct vma_merge_struct *vmg)
> >         bool remove_next = false;
> >         struct vm_area_struct *target = vmg->target;
> >         struct vm_area_struct *next = vmg->next;
> > +       vm_flags_t sticky_flags;
> > +
> > +       sticky_flags = vmg->vm_flags & VM_STICKY;
> > +       sticky_flags |= target->vm_flags & VM_STICKY;
> >
> >         VM_WARN_ON_VMG(!target, vmg);
> >
> > @@ -1134,6 +1134,7 @@ int vma_expand(struct vma_merge_struct *vmg)
> >         if (next && (target != next) && (vmg->end == next->vm_end)) {
> >                 int ret;
> >
> > +               sticky_flags |= next->vm_flags & VM_STICKY;
> >                 remove_next = true;
> >                 /* This should already have been checked by this point. */
> >                 VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);
> > @@ -1160,6 +1161,7 @@ int vma_expand(struct vma_merge_struct *vmg)
> >         if (commit_merge(vmg))
> >                 goto nomem;
> >
> > +       vm_flags_set(target, sticky_flags);
> >         return 0;
> >
> >  nomem:
> > @@ -1903,7 +1905,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *
> >         return a->vm_end == b->vm_start &&
> >                 mpol_equal(vma_policy(a), vma_policy(b)) &&
> >                 a->vm_file == b->vm_file &&
> > -               !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) &&
> > +               !((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) &&
> >                 b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
> >  }
> >
> > diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> > index 656e1c75b711..ee9d3547c421 100644
> > --- a/tools/testing/vma/vma.c
> > +++ b/tools/testing/vma/vma.c
>
> I prefer tests in a separate patch, but that might just be me. Feel
> free to ignore.

Yeah can split it out! I do tend to do that actually, not sure why I
deviated from that here.

>
> > @@ -48,6 +48,8 @@ static struct anon_vma dummy_anon_vma;
> >  #define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2))
> >  #define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2))
> >
> > +#define IS_SET(_val, _flags) ((_val & _flags) == _flags)
> > +
> >  static struct task_struct __current;
> >
> >  struct task_struct *get_current(void)
> > @@ -441,7 +443,7 @@ static bool test_simple_shrink(void)
> >         return true;
> >  }
> >
> > -static bool test_merge_new(void)
> > +static bool __test_merge_new(bool is_sticky, bool a_is_sticky, bool b_is_sticky, bool c_is_sticky)
> >  {
> >         vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
> >         struct mm_struct mm = {};
> > @@ -469,23 +471,32 @@ static bool test_merge_new(void)
> >         struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d;
> >         bool merged;
> >
> > +       if (is_sticky)
> > +               vm_flags |= VM_STICKY;
> > +
> >         /*
> >          * 0123456789abc
> >          * AA B       CC
> >          */
> >         vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags);
> >         ASSERT_NE(vma_a, NULL);
> > +       if (a_is_sticky)
> > +               vm_flags_set(vma_a, VM_STICKY);
> >         /* We give each VMA a single avc so we can test anon_vma duplication. */
> >         INIT_LIST_HEAD(&vma_a->anon_vma_chain);
> >         list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain);
> >
> >         vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags);
> >         ASSERT_NE(vma_b, NULL);
> > +       if (b_is_sticky)
> > +               vm_flags_set(vma_b, VM_STICKY);
> >         INIT_LIST_HEAD(&vma_b->anon_vma_chain);
> >         list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain);
> >
> >         vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, vm_flags);
> >         ASSERT_NE(vma_c, NULL);
> > +       if (c_is_sticky)
> > +               vm_flags_set(vma_c, VM_STICKY);
> >         INIT_LIST_HEAD(&vma_c->anon_vma_chain);
> >         list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain);
> >
> > @@ -520,6 +531,8 @@ static bool test_merge_new(void)
> >         ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> >         ASSERT_TRUE(vma_write_started(vma));
> >         ASSERT_EQ(mm.map_count, 3);
> > +       if (is_sticky || a_is_sticky || b_is_sticky)
> > +               ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
> >
> >         /*
> >          * Merge to PREVIOUS VMA.
> > @@ -537,6 +550,8 @@ static bool test_merge_new(void)
> >         ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> >         ASSERT_TRUE(vma_write_started(vma));
> >         ASSERT_EQ(mm.map_count, 3);
> > +       if (is_sticky || a_is_sticky)
> > +               ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
> >
> >         /*
> >          * Merge to NEXT VMA.
> > @@ -556,6 +571,8 @@ static bool test_merge_new(void)
> >         ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> >         ASSERT_TRUE(vma_write_started(vma));
> >         ASSERT_EQ(mm.map_count, 3);
> > +       if (is_sticky) /* D uses is_sticky. */
> > +               ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
> >
> >         /*
> >          * Merge BOTH sides.
> > @@ -574,6 +591,8 @@ static bool test_merge_new(void)
> >         ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> >         ASSERT_TRUE(vma_write_started(vma));
> >         ASSERT_EQ(mm.map_count, 2);
> > +       if (is_sticky || a_is_sticky)
> > +               ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
> >
> >         /*
> >          * Merge to NEXT VMA.
> > @@ -592,6 +611,8 @@ static bool test_merge_new(void)
> >         ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> >         ASSERT_TRUE(vma_write_started(vma));
> >         ASSERT_EQ(mm.map_count, 2);
> > +       if (is_sticky || c_is_sticky)
> > +               ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
> >
> >         /*
> >          * Merge BOTH sides.
> > @@ -609,6 +630,8 @@ static bool test_merge_new(void)
> >         ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> >         ASSERT_TRUE(vma_write_started(vma));
> >         ASSERT_EQ(mm.map_count, 1);
> > +       if (is_sticky || a_is_sticky || c_is_sticky)
> > +               ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
> >
> >         /*
> >          * Final state.
> > @@ -637,6 +660,20 @@ static bool test_merge_new(void)
> >         return true;
> >  }
> >
> > +static bool test_merge_new(void)
> > +{
> > +       int i, j, k, l;
> > +
> > +       /* Generate every possible permutation of sticky flags. */
> > +       for (i = 0; i < 2; i++)
> > +               for (j = 0; j < 2; j++)
> > +                       for (k = 0; k < 2; k++)
> > +                               for (l = 0; l < 2; l++)
> > +                                       ASSERT_TRUE(__test_merge_new(i, j, k, l));
> > +
> > +       return true;
> > +}
> > +
> >  static bool test_vma_merge_special_flags(void)
> >  {
> >         vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
> > @@ -973,9 +1010,11 @@ static bool test_vma_merge_new_with_close(void)
> >         return true;
> >  }
> >
> > -static bool test_merge_existing(void)
> > +static bool __test_merge_existing(bool prev_is_sticky, bool middle_is_sticky, bool next_is_sticky)
> >  {
> >         vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
> > +       vm_flags_t prev_flags = vm_flags;
> > +       vm_flags_t next_flags = vm_flags;
> >         struct mm_struct mm = {};
> >         VMA_ITERATOR(vmi, &mm, 0);
> >         struct vm_area_struct *vma, *vma_prev, *vma_next;
> > @@ -988,6 +1027,13 @@ static bool test_merge_existing(void)
> >         };
> >         struct anon_vma_chain avc = {};
> >
> > +       if (prev_is_sticky)
> > +               prev_flags |= VM_STICKY;
> > +       if (middle_is_sticky)
> > +               vm_flags |= VM_STICKY;
> > +       if (next_is_sticky)
> > +               next_flags |= VM_STICKY;
> > +
> >         /*
> >          * Merge right case - partial span.
> >          *
> > @@ -1000,7 +1046,7 @@ static bool test_merge_existing(void)
> >          */
> >         vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags);
> >         vma->vm_ops = &vm_ops; /* This should have no impact. */
> > -       vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
> > +       vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags);
> >         vma_next->vm_ops = &vm_ops; /* This should have no impact. */
> >         vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma);
> >         vmg.middle = vma;
> > @@ -1018,6 +1064,8 @@ static bool test_merge_existing(void)
> >         ASSERT_TRUE(vma_write_started(vma));
> >         ASSERT_TRUE(vma_write_started(vma_next));
> >         ASSERT_EQ(mm.map_count, 2);
> > +       if (middle_is_sticky || next_is_sticky)
> > +               ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));
> >
> >         /* Clear down and reset. */
> >         ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
> > @@ -1033,7 +1081,7 @@ static bool test_merge_existing(void)
> >          *   NNNNNNN
> >          */
> >         vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags);
> > -       vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
> > +       vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags);
> >         vma_next->vm_ops = &vm_ops; /* This should have no impact. */
> >         vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, vm_flags, &dummy_anon_vma);
> >         vmg.middle = vma;
> > @@ -1046,6 +1094,8 @@ static bool test_merge_existing(void)
> >         ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma);
> >         ASSERT_TRUE(vma_write_started(vma_next));
> >         ASSERT_EQ(mm.map_count, 1);
> > +       if (middle_is_sticky || next_is_sticky)
> > +               ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));
> >
> >         /* Clear down and reset. We should have deleted vma. */
> >         ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
> > @@ -1060,7 +1110,7 @@ static bool test_merge_existing(void)
> >          * 0123456789
> >          * PPPPPPV
> >          */
> > -       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
> > +       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
> >         vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
> >         vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
> >         vma->vm_ops = &vm_ops; /* This should have no impact. */
> > @@ -1080,6 +1130,8 @@ static bool test_merge_existing(void)
> >         ASSERT_TRUE(vma_write_started(vma_prev));
> >         ASSERT_TRUE(vma_write_started(vma));
> >         ASSERT_EQ(mm.map_count, 2);
> > +       if (prev_is_sticky || middle_is_sticky)
> > +               ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
> >
> >         /* Clear down and reset. */
> >         ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
> > @@ -1094,7 +1146,7 @@ static bool test_merge_existing(void)
> >          * 0123456789
> >          * PPPPPPP
> >          */
> > -       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
> > +       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
> >         vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
> >         vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
> >         vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);
> > @@ -1109,6 +1161,8 @@ static bool test_merge_existing(void)
> >         ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
> >         ASSERT_TRUE(vma_write_started(vma_prev));
> >         ASSERT_EQ(mm.map_count, 1);
> > +       if (prev_is_sticky || middle_is_sticky)
> > +               ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
> >
> >         /* Clear down and reset. We should have deleted vma. */
> >         ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
> > @@ -1123,10 +1177,10 @@ static bool test_merge_existing(void)
> >          * 0123456789
> >          * PPPPPPPPPP
> >          */
> > -       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
> > +       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
> >         vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
> >         vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
> > -       vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags);
> > +       vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, next_flags);
> >         vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);
> >         vmg.prev = vma_prev;
> >         vmg.middle = vma;
> > @@ -1139,6 +1193,8 @@ static bool test_merge_existing(void)
> >         ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
> >         ASSERT_TRUE(vma_write_started(vma_prev));
> >         ASSERT_EQ(mm.map_count, 1);
> > +       if (prev_is_sticky || middle_is_sticky || next_is_sticky)
> > +               ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
> >
> >         /* Clear down and reset. We should have deleted prev and next. */
> >         ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
> > @@ -1158,9 +1214,9 @@ static bool test_merge_existing(void)
> >          * PPPVVVVVNNN
> >          */
> >
> > -       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
> > +       vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
> >         vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags);
> > -       vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, vm_flags);
> > +       vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, next_flags);
> >
> >         vmg_set_range(&vmg, 0x4000, 0x5000, 4, vm_flags);
> >         vmg.prev = vma;
> > @@ -1203,6 +1259,19 @@ static bool test_merge_existing(void)
> >         return true;
> >  }
> >
> > +static bool test_merge_existing(void)
> > +{
> > +       int i, j, k;
> > +
> > +       /* Generate every possible permutation of sticky flags. */
> > +       for (i = 0; i < 2; i++)
> > +               for (j = 0; j < 2; j++)
> > +                       for (k = 0; k < 2; k++)
> > +                               ASSERT_TRUE(__test_merge_existing(i, j, k));
> > +
> > +       return true;
> > +}
> > +
> >  static bool test_anon_vma_non_mergeable(void)
> >  {
> >         vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index e40c93edc5a7..3d9cb3a9411a 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -117,6 +117,38 @@ extern unsigned long dac_mmap_min_addr;
> >  #define VM_SEALED      VM_NONE
> >  #endif
> >
> > +/* Flags which should result in page tables being copied on fork. */
> > +#define VM_COPY_ON_FORK VM_MAYBE_GUARD
> > +
> > +/*
> > + * Flags which should be 'sticky' on merge - that is, flags which, when one VMA
> > + * possesses it but the other does not, the merged VMA should nonetheless have
> > + * applied to it:
> > + *
> > + * VM_COPY_ON_FORK - These flags indicates that a VMA maps a range that contains
> > + *                   metadata which should be unconditionally propagated upon
> > + *                   fork. When merging two VMAs, we encapsulate this range in
> > + *                   the merged VMA, so the flag should be 'sticky' as a result.
> > + */
> > +#define VM_STICKY VM_COPY_ON_FORK
> > +
> > +/*
> > + * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
> > + * of these flags and the other not does not preclude a merge.
> > + *
> > + * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
> > + *                dirty bit -- the caller should mark merged VMA as dirty. If
> > + *                dirty bit won't be excluded from comparison, we increase
> > + *                pressure on the memory system forcing the kernel to generate
> > + *                new VMAs when old one could be extended instead.
> > + *
> > + *    VM_STICKY - If one VMA has flags which must be 'sticky', that is ones
> > + *                which should propagate to all VMAs, but the other does not,
> > + *                the merge should still proceed with the merge logic applying
> > + *                sticky flags to the final VMA.
> > + */
> > +#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
> > +
> >  #define FIRST_USER_ADDRESS     0UL
> >  #define USER_PGTABLES_CEILING  0UL
> >
> > --
> > 2.51.0
> >

Thanks for review!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] selftests/mm/guard-regions: add smaps visibility test
  2025-10-30  4:40   ` Suren Baghdasaryan
@ 2025-10-30  8:25     ` Lorenzo Stoakes
  0 siblings, 0 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-30  8:25 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin

On Wed, Oct 29, 2025 at 09:40:19PM -0700, Suren Baghdasaryan wrote:
> On Wed, Oct 29, 2025 at 9:51 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Assert that we observe guard regions appearing in /proc/$pid/smaps as
> > expected, and when split/merge is performed too (with expected sticky
> > behaviour).
> >
> > Also add handling for file systems which don't sanely handle mmap() VMA
> > merging so we don't incorrectly encounter a test failure in this situation.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-29 16:50 ` [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions Lorenzo Stoakes
  2025-10-29 19:50   ` Randy Dunlap
  2025-10-30  1:05   ` Suren Baghdasaryan
@ 2025-10-30 16:16   ` Pedro Falcato
  2025-10-30 16:23     ` Lorenzo Stoakes
  2 siblings, 1 reply; 28+ messages in thread
From: Pedro Falcato @ 2025-10-30 16:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jann Horn, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, linux-trace-kernel,
	linux-kselftest, Andrei Vagin

On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> Currently, if a user needs to determine if guard regions are present in a
> range, they have to scan all VMAs (or have knowledge of which ones might
> have guard regions).
> 
> Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> operation at a virtual address level.
> 
> This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> that guard regions exist in ranges.
> 
> This patch remedies the situation by establishing a new VMA flag,
> VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> uncertain because we cannot reasonably determine whether a
> MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> additionally VMAs may change across merge/split).
> 
> We utilise 0x800 for this flag which makes it available to 32-bit
> architectures also, a flag that was previously used by VM_DENYWRITE, which
> was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> bee reused yet.
> 
> The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> lock (and also VMA write lock) whereas previously it did not, but this
> seems a reasonable overhead.

Do you though? Could it be possible to simply atomically set the flag with
the read lock held? This would make it so we can't split the VMA (and tightly
define what "may have a guard page"), but it sounds much better than introducing
lock contention. I don't think it is reasonable to add a write lock to a feature
that may be used by such things as thread stack allocation, malloc, etc.

-- 
Pedro

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-30 16:16   ` Pedro Falcato
@ 2025-10-30 16:23     ` Lorenzo Stoakes
  2025-10-30 16:31       ` Pedro Falcato
  0 siblings, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-30 16:23 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jann Horn, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, linux-trace-kernel,
	linux-kselftest, Andrei Vagin

On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> > Currently, if a user needs to determine if guard regions are present in a
> > range, they have to scan all VMAs (or have knowledge of which ones might
> > have guard regions).
> >
> > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> > operation at a virtual address level.
> >
> > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> > that guard regions exist in ranges.
> >
> > This patch remedies the situation by establishing a new VMA flag,
> > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> > uncertain because we cannot reasonably determine whether a
> > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> > additionally VMAs may change across merge/split).
> >
> > We utilise 0x800 for this flag which makes it available to 32-bit
> > architectures also, a flag that was previously used by VM_DENYWRITE, which
> > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> > bee reused yet.
> >
> > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> > lock (and also VMA write lock) whereas previously it did not, but this
> > seems a reasonable overhead.
>
> Do you though? Could it be possible to simply atomically set the flag with
> the read lock held? This would make it so we can't split the VMA (and tightly

VMA flags are not accessed atomically so no I don't think we can do that in any
workable way.

I also don't think it's at all necessary, see below.

> define what "may have a guard page"), but it sounds much better than introducing
> lock contention. I don't think it is reasonable to add a write lock to a feature
> that may be used by such things as thread stack allocation, malloc, etc.

What lock contention? It's per-VMA so the contention is limited to the VMA in
question, and only over the span of time you are setting the gaurd region.

When allocating thread stacks you'll be mapping things into memory which... take
the write lock. malloc() if it goes to the kernel will also take the write lock.

So I think you're overly worried about an operation that a. isn't going to be
something that happens all that often, b. when it does, it's at a time when
you'd be taking write locks anyway and c. won't contend important stuff like
page faults for any VMA other than the one having the the guard region
installed.

>
> --
> Pedro

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm: implement sticky, copy on fork VMA flags
  2025-10-29 16:50 ` [PATCH 2/3] mm: implement sticky, copy on fork VMA flags Lorenzo Stoakes
  2025-10-30  4:35   ` Suren Baghdasaryan
@ 2025-10-30 16:25   ` Pedro Falcato
  2025-10-30 16:34     ` Lorenzo Stoakes
  1 sibling, 1 reply; 28+ messages in thread
From: Pedro Falcato @ 2025-10-30 16:25 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jann Horn, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, linux-trace-kernel,
	linux-kselftest, Andrei Vagin

On Wed, Oct 29, 2025 at 04:50:32PM +0000, Lorenzo Stoakes wrote:
> It's useful to be able to force a VMA to be copied on fork outside of the
> parameters specified by vma_needs_copy(), which otherwise only copies page
> tables if:
> 
> * The destination VMA has VM_UFFD_WP set
> * The mapping is a PFN or mixed map
> * The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)
> 
> Setting this flag implies that the page tables mapping the VMA are such
> that simply re-faulting the VMA will not re-establish them in identical
> form.
> 
> We introduce VM_COPY_ON_FORK to clearly identify which flags require this
> behaviour, which currently is only VM_MAYBE_GUARD.

Do we want this to be sticky though? If you're looking for more granularity
with this flag, the best option might be to stop merges from happening there.
If not, I can imagine a VMA that merges with other VMAs far past the original
guarded range, and thus you get no granularity (possibly, not even useful).

If you're _not_ looking for granularity, then maybe using a per-mm flag for
guard ranges or some other solution would be superior?

The rest of the patch (superficially) looks good to me, though.

-- 
Pedro

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-30 16:23     ` Lorenzo Stoakes
@ 2025-10-30 16:31       ` Pedro Falcato
  2025-10-30 16:43         ` Lorenzo Stoakes
  0 siblings, 1 reply; 28+ messages in thread
From: Pedro Falcato @ 2025-10-30 16:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jann Horn, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, linux-trace-kernel,
	linux-kselftest, Andrei Vagin

On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
> On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> > > Currently, if a user needs to determine if guard regions are present in a
> > > range, they have to scan all VMAs (or have knowledge of which ones might
> > > have guard regions).
> > >
> > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> > > operation at a virtual address level.
> > >
> > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> > > that guard regions exist in ranges.
> > >
> > > This patch remedies the situation by establishing a new VMA flag,
> > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> > > uncertain because we cannot reasonably determine whether a
> > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> > > additionally VMAs may change across merge/split).
> > >
> > > We utilise 0x800 for this flag which makes it available to 32-bit
> > > architectures also, a flag that was previously used by VM_DENYWRITE, which
> > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> > > bee reused yet.
> > >
> > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> > > lock (and also VMA write lock) whereas previously it did not, but this
> > > seems a reasonable overhead.
> >
> > Do you though? Could it be possible to simply atomically set the flag with
> > the read lock held? This would make it so we can't split the VMA (and tightly
> 
> VMA flags are not accessed atomically so no I don't think we can do that in any
> workable way.
>

FWIW I think you could work it as an atomic flag and treat those races as benign
(this one, at least).

> I also don't think it's at all necessary, see below.
> 
> > define what "may have a guard page"), but it sounds much better than introducing
> > lock contention. I don't think it is reasonable to add a write lock to a feature
> > that may be used by such things as thread stack allocation, malloc, etc.
> 
> What lock contention? It's per-VMA so the contention is limited to the VMA in
> question, and only over the span of time you are setting the gaurd region.

Don't we always need to take the mmap write lock when grabbing a VMA write
lock as well?

> When allocating thread stacks you'll be mapping things into memory which... take
> the write lock. malloc() if it goes to the kernel will also take the write lock.
>

But yes, good point, you're already serializing anyway. I don't think this is
a big deal.

> So I think you're overly worried about an operation that a. isn't going to be
> something that happens all that often, b. when it does, it's at a time when
> you'd be taking write locks anyway and c. won't contend important stuff like
> page faults for any VMA other than the one having the the guard region
> installed.

Yep, thanks.

-- 
Pedro

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] mm: implement sticky, copy on fork VMA flags
  2025-10-30 16:25   ` Pedro Falcato
@ 2025-10-30 16:34     ` Lorenzo Stoakes
  0 siblings, 0 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-30 16:34 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jann Horn, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, linux-trace-kernel,
	linux-kselftest, Andrei Vagin

On Thu, Oct 30, 2025 at 04:25:54PM +0000, Pedro Falcato wrote:
> On Wed, Oct 29, 2025 at 04:50:32PM +0000, Lorenzo Stoakes wrote:
> > It's useful to be able to force a VMA to be copied on fork outside of the
> > parameters specified by vma_needs_copy(), which otherwise only copies page
> > tables if:
> >
> > * The destination VMA has VM_UFFD_WP set
> > * The mapping is a PFN or mixed map
> > * The mapping is anonymous and forked in (i.e. vma->anon_vma is non-NULL)
> >
> > Setting this flag implies that the page tables mapping the VMA are such
> > that simply re-faulting the VMA will not re-establish them in identical
> > form.
> >
> > We introduce VM_COPY_ON_FORK to clearly identify which flags require this
> > behaviour, which currently is only VM_MAYBE_GUARD.
>
> Do we want this to be sticky though? If you're looking for more granularity

Yes?

> with this flag, the best option might be to stop merges from happening there.

No?

That'd entirely break VMA merging for any VMA you are installing guard regions
into.

That'd be a regression, as the property of guard regions belongs to the page
tables which can propagate across split/merge.

Also, a key purpose of this flag is to be able to correctly propagate page
tables on fork for file-backed VMAs.

Without this we have to install an anon_vma in file-backed VMAs (what we do
now), which has all the same drawbacks.

> If not, I can imagine a VMA that merges with other VMAs far past the original
> guarded range, and thus you get no granularity (possibly, not even useful).

Err? What? It gets you VMA granularity. You are always going to do better than
'anywhere in the entire mm'. Of course you can imagine scenarios where one VMA
somehow dominates everything, or guard regions are removed etc. but in most
cases you're not going to encounter that.

Also again, the _more important_ purpose here is correct behaviour on fork.

>
> If you're _not_ looking for granularity, then maybe using a per-mm flag for
> guard ranges or some other solution would be superior?

I'm not sure what solution you think would be superior that wouldn't involve
significant overhead in having to look up guard regions on split/merge.

This is a property that belongs to the page tables that we're relating to VMAs
which may or may not contain page tables which have this property.

mm granularity would be utterly pointless and leave us with the same anon_vma
hack.

>
> The rest of the patch (superficially) looks good to me, though.

Well there's that at least :)

>
> --
> Pedro

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-30 16:31       ` Pedro Falcato
@ 2025-10-30 16:43         ` Lorenzo Stoakes
  2025-10-30 18:31           ` Vlastimil Babka
  0 siblings, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-30 16:43 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jann Horn, linux-kernel,
	linux-fsdevel, linux-doc, linux-mm, linux-trace-kernel,
	linux-kselftest, Andrei Vagin

On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> > > > Currently, if a user needs to determine if guard regions are present in a
> > > > range, they have to scan all VMAs (or have knowledge of which ones might
> > > > have guard regions).
> > > >
> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> > > > operation at a virtual address level.
> > > >
> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> > > > that guard regions exist in ranges.
> > > >
> > > > This patch remedies the situation by establishing a new VMA flag,
> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> > > > uncertain because we cannot reasonably determine whether a
> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> > > > additionally VMAs may change across merge/split).
> > > >
> > > > We utilise 0x800 for this flag which makes it available to 32-bit
> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> > > > bee reused yet.
> > > >
> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> > > > lock (and also VMA write lock) whereas previously it did not, but this
> > > > seems a reasonable overhead.
> > >
> > > Do you though? Could it be possible to simply atomically set the flag with
> > > the read lock held? This would make it so we can't split the VMA (and tightly
> >
> > VMA flags are not accessed atomically so no I don't think we can do that in any
> > workable way.
> >
>
> FWIW I think you could work it as an atomic flag and treat those races as benign
> (this one, at least).

It's not benign as we need to ensure that page tables are correctly propagated
on fork.

>
> > I also don't think it's at all necessary, see below.
> >
> > > define what "may have a guard page"), but it sounds much better than introducing
> > > lock contention. I don't think it is reasonable to add a write lock to a feature
> > > that may be used by such things as thread stack allocation, malloc, etc.
> >
> > What lock contention? It's per-VMA so the contention is limited to the VMA in
> > question, and only over the span of time you are setting the gaurd region.
>
> Don't we always need to take the mmap write lock when grabbing a VMA write
> lock as well?

Yup. But at the same time you're doing the kind of operations that'd use this
you'd already be taking the lock anyway.

You don't hold it for long and you won't be doing this any more often than you'd
be doing other write operations, which you're also not going to be holding up
faults on other VMAs either (they can access other VMAs despite mmap write lock
being held), so I don't think there's ay issue here.

>
> > When allocating thread stacks you'll be mapping things into memory which... take
> > the write lock. malloc() if it goes to the kernel will also take the write lock.
> >
>
> But yes, good point, you're already serializing anyway. I don't think this is
> a big deal.

Indeed

>
> > So I think you're overly worried about an operation that a. isn't going to be
> > something that happens all that often, b. when it does, it's at a time when
> > you'd be taking write locks anyway and c. won't contend important stuff like
> > page faults for any VMA other than the one having the the guard region
> > installed.
>
> Yep, thanks.

No problemo, you can get yourself into sticky situations with lock contention
but I think this is not one! :)

>
> --
> Pedro

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-30 16:43         ` Lorenzo Stoakes
@ 2025-10-30 18:31           ` Vlastimil Babka
  2025-10-30 18:47             ` Vlastimil Babka
  2025-10-30 19:16             ` Lorenzo Stoakes
  0 siblings, 2 replies; 28+ messages in thread
From: Vlastimil Babka @ 2025-10-30 18:31 UTC (permalink / raw)
  To: Lorenzo Stoakes, Pedro Falcato
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin, Barry Song

On 10/30/25 17:43, Lorenzo Stoakes wrote:
> On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
>> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
>> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
>> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
>> > > > Currently, if a user needs to determine if guard regions are present in a
>> > > > range, they have to scan all VMAs (or have knowledge of which ones might
>> > > > have guard regions).
>> > > >
>> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
>> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
>> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
>> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
>> > > > operation at a virtual address level.
>> > > >
>> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
>> > > > that guard regions exist in ranges.
>> > > >
>> > > > This patch remedies the situation by establishing a new VMA flag,
>> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
>> > > > uncertain because we cannot reasonably determine whether a
>> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
>> > > > additionally VMAs may change across merge/split).
>> > > >
>> > > > We utilise 0x800 for this flag which makes it available to 32-bit
>> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
>> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
>> > > > bee reused yet.
>> > > >
>> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
>> > > > lock (and also VMA write lock) whereas previously it did not, but this
>> > > > seems a reasonable overhead.
>> > >
>> > > Do you though? Could it be possible to simply atomically set the flag with
>> > > the read lock held? This would make it so we can't split the VMA (and tightly
>> >
>> > VMA flags are not accessed atomically so no I don't think we can do that in any
>> > workable way.
>> >
>>
>> FWIW I think you could work it as an atomic flag and treat those races as benign
>> (this one, at least).
> 
> It's not benign as we need to ensure that page tables are correctly propagated
> on fork.

Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
setting? I think the places that could race with us to cause RMW use vma
write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
the oldmm) and it probably would't make sense to start doing it. Maybe we
could think of something to deal with this special case...

>>
>> > I also don't think it's at all necessary, see below.
>> >
>> > > define what "may have a guard page"), but it sounds much better than introducing
>> > > lock contention. I don't think it is reasonable to add a write lock to a feature
>> > > that may be used by such things as thread stack allocation, malloc, etc.
>> >
>> > What lock contention? It's per-VMA so the contention is limited to the VMA in
>> > question, and only over the span of time you are setting the gaurd region.
>>
>> Don't we always need to take the mmap write lock when grabbing a VMA write
>> lock as well?
> 
> Yup. But at the same time you're doing the kind of operations that'd use this
> you'd already be taking the lock anyway.
> 
> You don't hold it for long and you won't be doing this any more often than you'd
> be doing other write operations, which you're also not going to be holding up
> faults on other VMAs either (they can access other VMAs despite mmap write lock
> being held), so I don't think there's ay issue here.
> 
>>
>> > When allocating thread stacks you'll be mapping things into memory which... take
>> > the write lock. malloc() if it goes to the kernel will also take the write lock.
>> >
>>
>> But yes, good point, you're already serializing anyway. I don't think this is
>> a big deal.
> 
> Indeed

Besides thread stacks, I'm thinking of the userspace allocators usecase
(jemalloc etc) where guard regions were supposed to allow a cheap
use-after-free protection by replacing their existing
MADV_DONTNEED/MADV_FREE use with MADV_GUARD_INSTALL.
Now MADV_DONTNEED/MADV_FREE use MADVISE_VMA_READ_LOCK, and guard regions
moves from (worse but still reasonable) MADVISE_MMAP_READ_LOCK to the heavy
MADVISE_MMAP_WRITE_LOCK. I'm afraid this might be too heavy price for that
usecase :(

>>
>> > So I think you're overly worried about an operation that a. isn't going to be
>> > something that happens all that often, b. when it does, it's at a time when
>> > you'd be taking write locks anyway and c. won't contend important stuff like
>> > page faults for any VMA other than the one having the the guard region
>> > installed.
>>
>> Yep, thanks.
> 
> No problemo, you can get yourself into sticky situations with lock contention
> but I think this is not one! :)
> 
>>
>> --
>> Pedro
> 
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-30 18:31           ` Vlastimil Babka
@ 2025-10-30 18:47             ` Vlastimil Babka
  2025-10-30 19:47               ` Lorenzo Stoakes
  2025-11-05 19:48               ` Lorenzo Stoakes
  2025-10-30 19:16             ` Lorenzo Stoakes
  1 sibling, 2 replies; 28+ messages in thread
From: Vlastimil Babka @ 2025-10-30 18:47 UTC (permalink / raw)
  To: Lorenzo Stoakes, Pedro Falcato
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin, Barry Song

On 10/30/25 19:31, Vlastimil Babka wrote:
> On 10/30/25 17:43, Lorenzo Stoakes wrote:
>> On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
>>> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
>>> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
>>> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
>>> > > > Currently, if a user needs to determine if guard regions are present in a
>>> > > > range, they have to scan all VMAs (or have knowledge of which ones might
>>> > > > have guard regions).
>>> > > >
>>> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
>>> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
>>> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
>>> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
>>> > > > operation at a virtual address level.
>>> > > >
>>> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
>>> > > > that guard regions exist in ranges.
>>> > > >
>>> > > > This patch remedies the situation by establishing a new VMA flag,
>>> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
>>> > > > uncertain because we cannot reasonably determine whether a
>>> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
>>> > > > additionally VMAs may change across merge/split).
>>> > > >
>>> > > > We utilise 0x800 for this flag which makes it available to 32-bit
>>> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
>>> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
>>> > > > bee reused yet.
>>> > > >
>>> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
>>> > > > lock (and also VMA write lock) whereas previously it did not, but this
>>> > > > seems a reasonable overhead.
>>> > >
>>> > > Do you though? Could it be possible to simply atomically set the flag with
>>> > > the read lock held? This would make it so we can't split the VMA (and tightly
>>> >
>>> > VMA flags are not accessed atomically so no I don't think we can do that in any
>>> > workable way.
>>> >
>>>
>>> FWIW I think you could work it as an atomic flag and treat those races as benign
>>> (this one, at least).
>> 
>> It's not benign as we need to ensure that page tables are correctly propagated
>> on fork.
> 
> Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> setting? I think the places that could race with us to cause RMW use vma
> write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
> the oldmm) and it probably would't make sense to start doing it. Maybe we
> could think of something to deal with this special case...

During discussion with Pedro off-list I realized fork takes mmap lock for
write on the old mm, so if we kept taking mmap sem for read, then vma lock
for read in addition (which should be cheap enough, also we'd only need it
in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
that would cover all non-bening races?



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-30 18:31           ` Vlastimil Babka
  2025-10-30 18:47             ` Vlastimil Babka
@ 2025-10-30 19:16             ` Lorenzo Stoakes
  2025-10-30 19:37               ` Lorenzo Stoakes
  1 sibling, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-30 19:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pedro Falcato, Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin, Barry Song

On Thu, Oct 30, 2025 at 07:31:26PM +0100, Vlastimil Babka wrote:
> On 10/30/25 17:43, Lorenzo Stoakes wrote:
> > On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
> >> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
> >> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> >> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> >> > > > Currently, if a user needs to determine if guard regions are present in a
> >> > > > range, they have to scan all VMAs (or have knowledge of which ones might
> >> > > > have guard regions).
> >> > > >
> >> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> >> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> >> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> >> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> >> > > > operation at a virtual address level.
> >> > > >
> >> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> >> > > > that guard regions exist in ranges.
> >> > > >
> >> > > > This patch remedies the situation by establishing a new VMA flag,
> >> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> >> > > > uncertain because we cannot reasonably determine whether a
> >> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> >> > > > additionally VMAs may change across merge/split).
> >> > > >
> >> > > > We utilise 0x800 for this flag which makes it available to 32-bit
> >> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
> >> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> >> > > > bee reused yet.
> >> > > >
> >> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> >> > > > lock (and also VMA write lock) whereas previously it did not, but this
> >> > > > seems a reasonable overhead.
> >> > >
> >> > > Do you though? Could it be possible to simply atomically set the flag with
> >> > > the read lock held? This would make it so we can't split the VMA (and tightly
> >> >
> >> > VMA flags are not accessed atomically so no I don't think we can do that in any
> >> > workable way.
> >> >
> >>
> >> FWIW I think you could work it as an atomic flag and treat those races as benign
> >> (this one, at least).
> >
> > It's not benign as we need to ensure that page tables are correctly propagated
> > on fork.
>
> Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> setting? I think the places that could race with us to cause RMW use vma

I mean, I just spoke about why I didn't think introducing an entirely new
(afaik) one-sided atomic VMA flag write, so maybe deal with that first before
proposing something new...

We probably can use VMA read lock (actually introduced _after_ I introduced
guard regions I believe) if we do not need the write lock.

But I need to hear a compelling non-hand waving reason for introducing an
entirely new mechanism like that which is going to violate an assumption we have
about VMA flags everywhere...

> write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
> the oldmm) and it probably would't make sense to start doing it. Maybe we

I mean, the entire point is fork behaviour, so that makes this a non-starter...

> could think of something to deal with this special case...

If we implemented that I guess we'd have to atomically read the VMA flag,
but a race between the two would be very problematic.

Note that we'd also have to have a new mechanism for writing a VMA flag
(this is entirely novel now) because vm_flags_set() itself _literally_ has
vma_start_write() in it.

We'd also have to audit any and all cases where VMA flags are accessed to
ensure that this new novel method does not cause any issues.

We're also inviting others to do this kind of thing (e.g. drivers).

Seems dangerous in itself.

In any case, since you're now asking for something entirely novel, you
really have to provide a compelling argument as to why we can't write lock.

We can't allow races there or we break forking.

>
> >>
> >> > I also don't think it's at all necessary, see below.
> >> >
> >> > > define what "may have a guard page"), but it sounds much better than introducing
> >> > > lock contention. I don't think it is reasonable to add a write lock to a feature
> >> > > that may be used by such things as thread stack allocation, malloc, etc.
> >> >
> >> > What lock contention? It's per-VMA so the contention is limited to the VMA in
> >> > question, and only over the span of time you are setting the gaurd region.
> >>
> >> Don't we always need to take the mmap write lock when grabbing a VMA write
> >> lock as well?
> >
> > Yup. But at the same time you're doing the kind of operations that'd use this
> > you'd already be taking the lock anyway.
> >
> > You don't hold it for long and you won't be doing this any more often than you'd
> > be doing other write operations, which you're also not going to be holding up
> > faults on other VMAs either (they can access other VMAs despite mmap write lock
> > being held), so I don't think there's ay issue here.
> >
> >>
> >> > When allocating thread stacks you'll be mapping things into memory which... take
> >> > the write lock. malloc() if it goes to the kernel will also take the write lock.
> >> >
> >>
> >> But yes, good point, you're already serializing anyway. I don't think this is
> >> a big deal.
> >
> > Indeed
>
> Besides thread stacks, I'm thinking of the userspace allocators usecase

Thread stacks require mmap/VMA write lock to establish no?

> (jemalloc etc) where guard regions were supposed to allow a cheap
> use-after-free protection by replacing their existing
> MADV_DONTNEED/MADV_FREE use with MADV_GUARD_INSTALL.

First I'm hearing of this as far as I can recall... maybe worth mentioning
at the time?

> Now MADV_DONTNEED/MADV_FREE use MADVISE_VMA_READ_LOCK, and guard regions
> moves from (worse but still reasonable) MADVISE_MMAP_READ_LOCK to the heavy

I think you're hand waving a lot here, do you have data to back anything up
here?

Be good to get some timings with the different locks and some indication that
the contention is meaningful or that this installation is in a hot-path.

Are these users not in any way manipulating VMAs (seems odd for a malloc but
anyway)? Because then you're already having to take the write lock to do so.

process_madvise() can be used to hold the lock over the whole operation
more efficiently for multiple guard region installations.

> MADVISE_MMAP_WRITE_LOCK. I'm afraid this might be too heavy price for that
> usecase :(

Based on what? Provide some data or argument to back this up please. Since
you're essentially asking me to either throw away my series or implement a
risky novel means of manipulating VMA flags I think you're going to have to
do better than hand waving.

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-30 19:16             ` Lorenzo Stoakes
@ 2025-10-30 19:37               ` Lorenzo Stoakes
  0 siblings, 0 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-30 19:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pedro Falcato, Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin, Barry Song

On Thu, Oct 30, 2025 at 07:16:56PM +0000, Lorenzo Stoakes wrote:
> On Thu, Oct 30, 2025 at 07:31:26PM +0100, Vlastimil Babka wrote:
> > On 10/30/25 17:43, Lorenzo Stoakes wrote:
> > > On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
> > >> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
> > >> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> > >> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> > >> > > > Currently, if a user needs to determine if guard regions are present in a
> > >> > > > range, they have to scan all VMAs (or have knowledge of which ones might
> > >> > > > have guard regions).
> > >> > > >
> > >> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> > >> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> > >> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> > >> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> > >> > > > operation at a virtual address level.
> > >> > > >
> > >> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> > >> > > > that guard regions exist in ranges.
> > >> > > >
> > >> > > > This patch remedies the situation by establishing a new VMA flag,
> > >> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> > >> > > > uncertain because we cannot reasonably determine whether a
> > >> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> > >> > > > additionally VMAs may change across merge/split).
> > >> > > >
> > >> > > > We utilise 0x800 for this flag which makes it available to 32-bit
> > >> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
> > >> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> > >> > > > bee reused yet.
> > >> > > >
> > >> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> > >> > > > lock (and also VMA write lock) whereas previously it did not, but this
> > >> > > > seems a reasonable overhead.
> > >> > >
> > >> > > Do you though? Could it be possible to simply atomically set the flag with
> > >> > > the read lock held? This would make it so we can't split the VMA (and tightly
> > >> >
> > >> > VMA flags are not accessed atomically so no I don't think we can do that in any
> > >> > workable way.
> > >> >
> > >>
> > >> FWIW I think you could work it as an atomic flag and treat those races as benign
> > >> (this one, at least).
> > >
> > > It's not benign as we need to ensure that page tables are correctly propagated
> > > on fork.
> >
> > Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> > over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> > setting? I think the places that could race with us to cause RMW use vma
>
> I mean, I just spoke about why I didn't think introducing an entirely new
> (afaik) one-sided atomic VMA flag write, so maybe deal with that first before
> proposing something new...

On the other hand, it's going to be difficult to get compelling data either
way as will always be workload dependent etc.

So since you and Pedro both bring this up, and it'd be a pity to establish more
stringent locking requirements here, let me look into an atomic write situation.

We'll need to tread carefully here but if we can achieve that it would obviously
be entirely preferable to requiring write lock.

I'll dig into it some :)

BTW I do think a VMA read lock is entirely possible here as-is, so we should try
to shift to that if we can make atomic VMA flag write here work.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-30 18:47             ` Vlastimil Babka
@ 2025-10-30 19:47               ` Lorenzo Stoakes
  2025-10-30 21:48                 ` Vlastimil Babka
  2025-11-05 19:48               ` Lorenzo Stoakes
  1 sibling, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-10-30 19:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pedro Falcato, Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin, Barry Song

On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:
> On 10/30/25 19:31, Vlastimil Babka wrote:
> > On 10/30/25 17:43, Lorenzo Stoakes wrote:
> >> On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
> >>> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
> >>> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> >>> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> >>> > > > Currently, if a user needs to determine if guard regions are present in a
> >>> > > > range, they have to scan all VMAs (or have knowledge of which ones might
> >>> > > > have guard regions).
> >>> > > >
> >>> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> >>> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> >>> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> >>> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> >>> > > > operation at a virtual address level.
> >>> > > >
> >>> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> >>> > > > that guard regions exist in ranges.
> >>> > > >
> >>> > > > This patch remedies the situation by establishing a new VMA flag,
> >>> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> >>> > > > uncertain because we cannot reasonably determine whether a
> >>> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> >>> > > > additionally VMAs may change across merge/split).
> >>> > > >
> >>> > > > We utilise 0x800 for this flag which makes it available to 32-bit
> >>> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
> >>> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> >>> > > > bee reused yet.
> >>> > > >
> >>> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> >>> > > > lock (and also VMA write lock) whereas previously it did not, but this
> >>> > > > seems a reasonable overhead.
> >>> > >
> >>> > > Do you though? Could it be possible to simply atomically set the flag with
> >>> > > the read lock held? This would make it so we can't split the VMA (and tightly
> >>> >
> >>> > VMA flags are not accessed atomically so no I don't think we can do that in any
> >>> > workable way.
> >>> >
> >>>
> >>> FWIW I think you could work it as an atomic flag and treat those races as benign
> >>> (this one, at least).
> >>
> >> It's not benign as we need to ensure that page tables are correctly propagated
> >> on fork.
> >
> > Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> > over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> > setting? I think the places that could race with us to cause RMW use vma
> > write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
> > the oldmm) and it probably would't make sense to start doing it. Maybe we
> > could think of something to deal with this special case...
>
> During discussion with Pedro off-list I realized fork takes mmap lock for
> write on the old mm, so if we kept taking mmap sem for read, then vma lock
> for read in addition (which should be cheap enough, also we'd only need it
> in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
> that would cover all non-bening races?
>
>

We take VMA write lock in dup_mmap() on each mpnt (old VMA).

We take the VMA write lock (vma_start_write()) for each mpnt.

We then vm_area_dup() the mpnt to the new VMA before calling:

copy_page_range()
-> vma_needs_copy()

Which is where the check is done.

So we are holding the VMA write lock, so a VMA read lock should suffice no?

For belts + braces we could atomically read the flag in vma_needs_copy(),
though note it's intended VM_COPY_ON_FORK could have more than one flag.

We could drop that for now and be explicit.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-30 19:47               ` Lorenzo Stoakes
@ 2025-10-30 21:48                 ` Vlastimil Babka
  2025-10-31 23:12                   ` Suren Baghdasaryan
  0 siblings, 1 reply; 28+ messages in thread
From: Vlastimil Babka @ 2025-10-30 21:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Pedro Falcato, Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin, Barry Song

On 10/30/25 20:47, Lorenzo Stoakes wrote:
> On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:
>> >
>> > Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
>> > over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
>> > setting? I think the places that could race with us to cause RMW use vma
>> > write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
>> > the oldmm) and it probably would't make sense to start doing it. Maybe we
>> > could think of something to deal with this special case...
>>
>> During discussion with Pedro off-list I realized fork takes mmap lock for
>> write on the old mm, so if we kept taking mmap sem for read, then vma lock
>> for read in addition (which should be cheap enough, also we'd only need it
>> in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
>> that would cover all non-bening races?
>>
>>
> 
> We take VMA write lock in dup_mmap() on each mpnt (old VMA).

Ah yes I thought it was the new one.

> We take the VMA write lock (vma_start_write()) for each mpnt.
> 
> We then vm_area_dup() the mpnt to the new VMA before calling:
> 
> copy_page_range()
> -> vma_needs_copy()
> 
> Which is where the check is done.
> 
> So we are holding the VMA write lock, so a VMA read lock should suffice no?

Yeah, even better!

> For belts + braces we could atomically read the flag in vma_needs_copy(),
> though note it's intended VM_COPY_ON_FORK could have more than one flag.
> 
> We could drop that for now and be explicit.

Great!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-30 21:48                 ` Vlastimil Babka
@ 2025-10-31 23:12                   ` Suren Baghdasaryan
  2025-11-03  9:34                     ` Lorenzo Stoakes
  0 siblings, 1 reply; 28+ messages in thread
From: Suren Baghdasaryan @ 2025-10-31 23:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Lorenzo Stoakes, Pedro Falcato, Andrew Morton, Jonathan Corbet,
	David Hildenbrand, Liam R . Howlett, Mike Rapoport, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin, Barry Song

On Thu, Oct 30, 2025 at 2:48 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/30/25 20:47, Lorenzo Stoakes wrote:
> > On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:
> >> >
> >> > Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> >> > over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> >> > setting? I think the places that could race with us to cause RMW use vma
> >> > write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
> >> > the oldmm) and it probably would't make sense to start doing it. Maybe we
> >> > could think of something to deal with this special case...
> >>
> >> During discussion with Pedro off-list I realized fork takes mmap lock for
> >> write on the old mm, so if we kept taking mmap sem for read, then vma lock
> >> for read in addition (which should be cheap enough, also we'd only need it
> >> in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
> >> that would cover all non-bening races?
> >>
> >>
> >
> > We take VMA write lock in dup_mmap() on each mpnt (old VMA).
>
> Ah yes I thought it was the new one.
>
> > We take the VMA write lock (vma_start_write()) for each mpnt.
> >
> > We then vm_area_dup() the mpnt to the new VMA before calling:
> >
> > copy_page_range()
> > -> vma_needs_copy()
> >
> > Which is where the check is done.
> >
> > So we are holding the VMA write lock, so a VMA read lock should suffice no?
>
> Yeah, even better!
>
> > For belts + braces we could atomically read the flag in vma_needs_copy(),
> > though note it's intended VM_COPY_ON_FORK could have more than one flag.
> >
> > We could drop that for now and be explicit.
>
> Great!

Overall, I think it should be possible to set this flag atomically
under VMA read-lock. However, if you introduce new vm_flags
manipulation functions, please make sure they can't be used for other
vm_flags. In Android I've seen several "interesting" attempts to
update vm_flags under a read-lock (specifically in the page-fault
path) and had to explain why that's a bad idea.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-31 23:12                   ` Suren Baghdasaryan
@ 2025-11-03  9:34                     ` Lorenzo Stoakes
  0 siblings, 0 replies; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-11-03  9:34 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Pedro Falcato, Andrew Morton, Jonathan Corbet,
	David Hildenbrand, Liam R . Howlett, Mike Rapoport, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin, Barry Song

On Fri, Oct 31, 2025 at 04:12:51PM -0700, Suren Baghdasaryan wrote:
>
> Overall, I think it should be possible to set this flag atomically
> under VMA read-lock. However, if you introduce new vm_flags
> manipulation functions, please make sure they can't be used for other
> vm_flags. In Android I've seen several "interesting" attempts to
> update vm_flags under a read-lock (specifically in the page-fault
> path) and had to explain why that's a bad idea.

Yeah agreed, so the idea would be to absolutely ring-fence any flag we do
this with entirely. Probably a VM_WARN_ON_ONCE() for anybody trying it with
other flags so bots can catch anybody being naughty.

That kind of 'interesting' stuff is another reason I prefer to really limit
what drivers can do btw ;)

Will have a respin with this idea applied relatively soon hopefully.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-10-30 18:47             ` Vlastimil Babka
  2025-10-30 19:47               ` Lorenzo Stoakes
@ 2025-11-05 19:48               ` Lorenzo Stoakes
  2025-11-06  7:34                 ` Vlastimil Babka
  1 sibling, 1 reply; 28+ messages in thread
From: Lorenzo Stoakes @ 2025-11-05 19:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pedro Falcato, Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin, Barry Song

On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:
> On 10/30/25 19:31, Vlastimil Babka wrote:
> > On 10/30/25 17:43, Lorenzo Stoakes wrote:
> >> On Thu, Oct 30, 2025 at 04:31:56PM +0000, Pedro Falcato wrote:
> >>> On Thu, Oct 30, 2025 at 04:23:58PM +0000, Lorenzo Stoakes wrote:
> >>> > On Thu, Oct 30, 2025 at 04:16:20PM +0000, Pedro Falcato wrote:
> >>> > > On Wed, Oct 29, 2025 at 04:50:31PM +0000, Lorenzo Stoakes wrote:
> >>> > > > Currently, if a user needs to determine if guard regions are present in a
> >>> > > > range, they have to scan all VMAs (or have knowledge of which ones might
> >>> > > > have guard regions).
> >>> > > >
> >>> > > > Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
> >>> > > > pagemap") and the related commit a516403787e0 ("fs/proc: extend the
> >>> > > > PAGEMAP_SCAN ioctl to report guard regions"), users can use either
> >>> > > > /proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
> >>> > > > operation at a virtual address level.
> >>> > > >
> >>> > > > This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
> >>> > > > that guard regions exist in ranges.
> >>> > > >
> >>> > > > This patch remedies the situation by establishing a new VMA flag,
> >>> > > > VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
> >>> > > > uncertain because we cannot reasonably determine whether a
> >>> > > > MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
> >>> > > > additionally VMAs may change across merge/split).
> >>> > > >
> >>> > > > We utilise 0x800 for this flag which makes it available to 32-bit
> >>> > > > architectures also, a flag that was previously used by VM_DENYWRITE, which
> >>> > > > was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
> >>> > > > bee reused yet.
> >>> > > >
> >>> > > > The MADV_GUARD_INSTALL madvise() operation now must take an mmap write
> >>> > > > lock (and also VMA write lock) whereas previously it did not, but this
> >>> > > > seems a reasonable overhead.
> >>> > >
> >>> > > Do you though? Could it be possible to simply atomically set the flag with
> >>> > > the read lock held? This would make it so we can't split the VMA (and tightly
> >>> >
> >>> > VMA flags are not accessed atomically so no I don't think we can do that in any
> >>> > workable way.
> >>> >
> >>>
> >>> FWIW I think you could work it as an atomic flag and treat those races as benign
> >>> (this one, at least).
> >>
> >> It's not benign as we need to ensure that page tables are correctly propagated
> >> on fork.
> >
> > Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
> > over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
> > setting? I think the places that could race with us to cause RMW use vma
> > write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
> > the oldmm) and it probably would't make sense to start doing it. Maybe we
> > could think of something to deal with this special case...
>
> During discussion with Pedro off-list I realized fork takes mmap lock for
> write on the old mm, so if we kept taking mmap sem for read, then vma lock
> for read in addition (which should be cheap enough, also we'd only need it
> in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
> that would cover all non-bening races?
>
>

So thinking about this again, taking mmap read lock will exclude any VMA write
locks (which must hold mmap write lock), so no need to additionally take VMA
read lock.

Also as mentioned later in thread, the invocation of vma_needs_copy() is
performed under VMA write lock (and this mmap write lock) so no need to read
atomically there either.

As per Pedro, we can treat other races as benign.

Merge/Split etc. are done under VMA write lock so there's no race that matters
there.

The only other place we even care about this flag in is for /proc/$pid/smaps,
but there it'd be benign as you'd happen not to observe the flag being set up at
the point at which a concurrent guard region install is happening - something
that you could race with _anyway_.

As Pedro says, remaining cases where you read flags would be benign, as the
readers should never observe a meaningful torn read being a bitmap and given
this flag only matters on fork and smaps.

So I think just having something that sets atomically with an allow list is
fine, but making that very strict so literally just this flag is allowed
(currently).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions
  2025-11-05 19:48               ` Lorenzo Stoakes
@ 2025-11-06  7:34                 ` Vlastimil Babka
  0 siblings, 0 replies; 28+ messages in thread
From: Vlastimil Babka @ 2025-11-06  7:34 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Pedro Falcato, Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	linux-kernel, linux-fsdevel, linux-doc, linux-mm,
	linux-trace-kernel, linux-kselftest, Andrei Vagin, Barry Song

On 11/5/25 20:48, Lorenzo Stoakes wrote:
> On Thu, Oct 30, 2025 at 07:47:34PM +0100, Vlastimil Babka wrote:
>> On 10/30/25 19:31, Vlastimil Babka wrote:
>> > On 10/30/25 17:43, Lorenzo Stoakes wrote:
>> >
>> > Could we use MADVISE_VMA_READ_LOCK mode (would be actually an improvement
>> > over the current MADVISE_MMAP_READ_LOCK), together with the atomic flag
>> > setting? I think the places that could race with us to cause RMW use vma
>> > write lock so that would be excluded. Fork AFAICS unfortunately doesn't (for
>> > the oldmm) and it probably would't make sense to start doing it. Maybe we
>> > could think of something to deal with this special case...
>>
>> During discussion with Pedro off-list I realized fork takes mmap lock for
>> write on the old mm, so if we kept taking mmap sem for read, then vma lock
>> for read in addition (which should be cheap enough, also we'd only need it
>> in case VM_MAYBE_GUARD is not yet set), and set the flag atomicaly, perhaps
>> that would cover all non-bening races?
>>
>>
> 
> So thinking about this again, taking mmap read lock will exclude any VMA write
> locks (which must hold mmap write lock), so no need to additionally take VMA
> read lock.

Right. Of course it would be nice if we could later also replace the mmap
read lock with vma read lock. Hopefully should be feasible because
MADV_DONTNEED can do it and guard ranges perform similar kind of operations
(with more complex page table handling necessary, IIRC but that should
hopefully be still fine with vma read lock).

It's obviously not in scope of the series here, maybe just to keep in mind
if that next step will be compatible with anything done here.
> Also as mentioned later in thread, the invocation of vma_needs_copy() is
> performed under VMA write lock (and this mmap write lock) so no need to read
> atomically there either.
> 
> As per Pedro, we can treat other races as benign.
> 
> Merge/Split etc. are done under VMA write lock so there's no race that matters
> there.
> 
> The only other place we even care about this flag in is for /proc/$pid/smaps,
> but there it'd be benign as you'd happen not to observe the flag being set up at
> the point at which a concurrent guard region install is happening - something
> that you could race with _anyway_.

Right.

> As Pedro says, remaining cases where you read flags would be benign, as the
> readers should never observe a meaningful torn read being a bitmap and given
> this flag only matters on fork and smaps.
> 
> So I think just having something that sets atomically with an allow list is
> fine, but making that very strict so literally just this flag is allowed
> (currently).
Ack.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-11-06  7:34 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-29 16:50 [PATCH 0/3] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
2025-10-29 16:50 ` [PATCH 1/3] mm: introduce VM_MAYBE_GUARD and make visible for guard regions Lorenzo Stoakes
2025-10-29 19:50   ` Randy Dunlap
2025-10-30  8:13     ` Lorenzo Stoakes
2025-10-30  1:05   ` Suren Baghdasaryan
2025-10-30  8:22     ` Lorenzo Stoakes
2025-10-30 16:16   ` Pedro Falcato
2025-10-30 16:23     ` Lorenzo Stoakes
2025-10-30 16:31       ` Pedro Falcato
2025-10-30 16:43         ` Lorenzo Stoakes
2025-10-30 18:31           ` Vlastimil Babka
2025-10-30 18:47             ` Vlastimil Babka
2025-10-30 19:47               ` Lorenzo Stoakes
2025-10-30 21:48                 ` Vlastimil Babka
2025-10-31 23:12                   ` Suren Baghdasaryan
2025-11-03  9:34                     ` Lorenzo Stoakes
2025-11-05 19:48               ` Lorenzo Stoakes
2025-11-06  7:34                 ` Vlastimil Babka
2025-10-30 19:16             ` Lorenzo Stoakes
2025-10-30 19:37               ` Lorenzo Stoakes
2025-10-29 16:50 ` [PATCH 2/3] mm: implement sticky, copy on fork VMA flags Lorenzo Stoakes
2025-10-30  4:35   ` Suren Baghdasaryan
2025-10-30  8:25     ` Lorenzo Stoakes
2025-10-30 16:25   ` Pedro Falcato
2025-10-30 16:34     ` Lorenzo Stoakes
2025-10-29 16:50 ` [PATCH 3/3] selftests/mm/guard-regions: add smaps visibility test Lorenzo Stoakes
2025-10-30  4:40   ` Suren Baghdasaryan
2025-10-30  8:25     ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).