linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/8] introduce VM_MAYBE_GUARD and make it sticky
@ 2025-11-07 16:11 Lorenzo Stoakes
  2025-11-07 16:11 ` [PATCH v3 1/8] mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps Lorenzo Stoakes
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-07 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

Currently, guard regions are not visible to users except through
/proc/$pid/pagemap, with no explicit visibility at the VMA level.

This makes the feature less useful, as it isn't entirely apparent which
VMAs may have these entries present, especially when performing actions
which walk through memory regions such as those performed by CRIU.

This series addresses this issue by introducing the VM_MAYBE_GUARD flag
which fulfils this role, updating the smaps logic to display an entry for
these.

The semantics of this flag are that a guard region MAY be present if set
(we cannot be sure, as we can't efficiently track whether an
MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if
not set the VMA definitely does NOT have any guard regions present.

It's problematic to establish this flag without further action, because
that means that VMAs with guard regions in them become non-mergeable with
adjacent VMAs for no especially good reason.

To work around this, this series also introduces the concept of 'sticky'
VMA flags - that is flags which:

a. if set in one VMA and not in another still permit those VMAs to be
   merged (if otherwise compatible).

b. When they are merged, the resultant VMA must have the flag set.

The VMA logic is updated to propagate these flags correctly.

Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve
an issue with file-backed guard regions - previously these established an
anon_vma object for file-backed mappings solely to have vma_needs_copy()
correctly propagate guard region mappings to child processes.

We introduce a new flag alias VM_COPY_ON_FORK (which currently only
specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly
for this flag and to copy page tables if it is present, which resolves this
issue.

Additionally, we add the ability for allow-listed VMA flags to be
atomically writable with only mmap/VMA read locks held.

The only flag we allow so far is VM_MAYBE_GUARD, which we carefully ensure
does not cause any races by being allowed to do so.

This allows us to maintain guard region installation as a read-locked
operation and not endure the overhead of obtaining a write lock here.

Finally we introduce extensive VMA userland tests to assert that the sticky
VMA logic behaves correctly as well as guard region self tests to assert
that smaps visibility is correctly implemented.


v3:
* Propagated tags thanks Vlastimil & Pedro! :)
* Fixed doc nit as per Pedro.
* Added vma_flag_test_atomic() in preparation for fixing
  retract_page_tables() (see below). We make this not require any locks, as
  we serialise on the page table lock in retract_page_tables().
* Split the atomic flag enablement and actually setting the flag for guard
  install into two separate commits so we clearly separate the various VMA
  flag implementation details and us enabling this feature.
* Mentioned setting anon_vma for anonymous mappings in commit message as
  per Vlastimil.
* Fixed an issue with retract_page_tables() whereby madvise(...,
  MADV_COLLAPSE) relies upon file-backed VMAs not being collapsed due to
  the UFFD WP VMA flag being set or the VMA having vma->anon_vma set
  (i.e. being a MAP_PRIVATE file-backed VMA). This was updated to also
  check for VM_MAYBE_GUARD.
* Introduced MADV_COLLAPSE self test to assert that the behaviour is
  correct. I first reproduced the issue locally and then adapted the test
  to assert that this no longer occurs.
* Mentioned KCSAN permissiveness in commit message as per Pedro.
* Mentioned mmap/VMA read lock excluding mmap/VMA write lock and thus
  avoiding meaningful RMW races in commit message as per Vlastimil.
* Mentioned previous unconditional vma->anon_vma installation on guard
  region installation as per Vlastimil.
* Avoided having merging compromised by reordering patches such that the
  sticky VMA functionality is implemented prior to VM_MAYBE_GUARD being
  utilised upon guard region installation, rendering Vlastimil's request to
  mention this in a commit message unnecessary.
* Separated out sticky and copy on fork patches as per Pedro.
* Added VM_PFNMAP, VM_MIXEDMAP, VM_UFFD_WP to VM_COPY_ON_FORK to make
  things more consistent and clean.
* Added mention of why generally VM_STICKY should be VM_COPY_ON_FORK in
  copy on fork patch.

v2:
* Separated out userland VMA tests for sticky behaviour as per Suren.
* Added the concept of atomic writable VMA flags as per Pedro and Vlastimil.
* Made VM_MAYBE_GUARD an atomic writable flag so we don't have to take a VMA
  write lock in madvise() as per Pedro and Vlastimil.
https://lore.kernel.org/all/cover.1762422915.git.lorenzo.stoakes@oracle.com/

v1:
https://lore.kernel.org/all/cover.1761756437.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (8):
  mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps
  mm: add atomic VMA flags and set VM_MAYBE_GUARD as such
  mm: implement sticky VMA flags
  mm: introduce copy-on-fork VMAs and make VM_MAYBE_GUARD one
  mm: set the VM_MAYBE_GUARD flag on guard region install
  tools/testing/vma: add VMA sticky userland tests
  tools/testing/selftests/mm: add MADV_COLLAPSE test case
  tools/testing/selftests/mm: add smaps visibility guard region test

 Documentation/filesystems/proc.rst         |   5 +-
 fs/proc/task_mmu.c                         |   1 +
 include/linux/mm.h                         | 102 ++++++++++++
 include/trace/events/mmflags.h             |   1 +
 mm/khugepaged.c                            |  72 +++++---
 mm/madvise.c                               |  22 ++-
 mm/memory.c                                |  14 +-
 mm/vma.c                                   |  22 +--
 tools/testing/selftests/mm/guard-regions.c | 185 +++++++++++++++++++++
 tools/testing/selftests/mm/vm_util.c       |   5 +
 tools/testing/selftests/mm/vm_util.h       |   1 +
 tools/testing/vma/vma.c                    |  89 ++++++++--
 tools/testing/vma/vma_internal.h           |  56 +++++++
 13 files changed, 511 insertions(+), 64 deletions(-)

--
2.51.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/8] mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps
  2025-11-07 16:11 [PATCH v3 0/8] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
@ 2025-11-07 16:11 ` Lorenzo Stoakes
  2025-11-07 16:11 ` [PATCH v3 2/8] mm: add atomic VMA flags and set VM_MAYBE_GUARD as such Lorenzo Stoakes
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-07 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

Currently, if a user needs to determine if guard regions are present in a
range, they have to scan all VMAs (or have knowledge of which ones might
have guard regions).

Since commit 8e2f2aeb8b48 ("fs/proc/task_mmu: add guard region bit to
pagemap") and the related commit a516403787e0 ("fs/proc: extend the
PAGEMAP_SCAN ioctl to report guard regions"), users can use either
/proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
operation at a virtual address level.

This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
that guard regions exist in ranges.

This patch remedies the situation by establishing a new VMA flag,
VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
uncertain because we cannot reasonably determine whether a
MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
additionally VMAs may change across merge/split).

We utilise 0x800 for this flag which makes it available to 32-bit
architectures also, a flag that was previously used by VM_DENYWRITE, which
was removed in commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") and hasn't
bee reused yet.

We also update the smaps logic and documentation to identify these VMAs.

Another major use of this functionality is that we can use it to identify
that we ought to copy page tables on fork.

We do not actually implement usage of this flag in mm/madvise.c yet as we
need to allow some VMA flags to be applied atomically under mmap/VMA read
lock in order to avoid the need to acquire a write lock for this purpose.

Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 Documentation/filesystems/proc.rst | 5 +++--
 fs/proc/task_mmu.c                 | 1 +
 include/linux/mm.h                 | 3 +++
 include/trace/events/mmflags.h     | 1 +
 mm/memory.c                        | 4 ++++
 tools/testing/vma/vma_internal.h   | 1 +
 6 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 0b86a8022fa1..8256e857e2d7 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -553,7 +553,7 @@ otherwise.
 kernel flags associated with the particular virtual memory area in two letter
 encoded manner. The codes are the following:
 
-    ==    =======================================
+    ==    =============================================================
     rd    readable
     wr    writeable
     ex    executable
@@ -591,7 +591,8 @@ encoded manner. The codes are the following:
     sl    sealed
     lf    lock on fault pages
     dp    always lazily freeable mapping
-    ==    =======================================
+    gu    maybe contains guard regions (if not set, definitely doesn't)
+    ==    =============================================================
 
 Note that there is no guarantee that every flag and associated mnemonic will
 be present in all further kernel releases. Things get changed, the flags may
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8a9894aefbca..a420dcf9ffbb 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1147,6 +1147,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_MAYSHARE)]	= "ms",
 		[ilog2(VM_GROWSDOWN)]	= "gd",
 		[ilog2(VM_PFNMAP)]	= "pf",
+		[ilog2(VM_MAYBE_GUARD)]	= "gu",
 		[ilog2(VM_LOCKED)]	= "lo",
 		[ilog2(VM_IO)]		= "io",
 		[ilog2(VM_SEQ_READ)]	= "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6e5ca5287e21..2a5516bff75a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -271,6 +271,8 @@ extern struct rw_semaphore nommu_region_sem;
 extern unsigned int kobjsize(const void *objp);
 #endif
 
+#define VM_MAYBE_GUARD_BIT 11
+
 /*
  * vm_flags in vm_area_struct, see mm_types.h.
  * When changing, update also include/trace/events/mmflags.h
@@ -296,6 +298,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_UFFD_MISSING	0
 #endif /* CONFIG_MMU */
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
+#define VM_MAYBE_GUARD	BIT(VM_MAYBE_GUARD_BIT)	/* The VMA maybe contains guard regions. */
 #define VM_UFFD_WP	0x00001000	/* wrprotect pages tracking */
 
 #define VM_LOCKED	0x00002000
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index aa441f593e9a..a6e5a44c9b42 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -213,6 +213,7 @@ IF_HAVE_PG_ARCH_3(arch_3)
 	{VM_UFFD_MISSING,		"uffd_missing"	},		\
 IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR,	"uffd_minor"	)		\
 	{VM_PFNMAP,			"pfnmap"	},		\
+	{VM_MAYBE_GUARD,		"maybe_guard"	},		\
 	{VM_UFFD_WP,			"uffd_wp"	},		\
 	{VM_LOCKED,			"locked"	},		\
 	{VM_IO,				"io"		},		\
diff --git a/mm/memory.c b/mm/memory.c
index 046579a6ec2f..334732ab6733 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1480,6 +1480,10 @@ vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	if (src_vma->anon_vma)
 		return true;
 
+	/* Guard regions have momdified page tables that require copying. */
+	if (src_vma->vm_flags & VM_MAYBE_GUARD)
+		return true;
+
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.  Fork
 	 * becomes much lighter when there are big shared or private readonly
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index c68d382dac81..46acb4df45de 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -56,6 +56,7 @@ extern unsigned long dac_mmap_min_addr;
 #define VM_MAYEXEC	0x00000040
 #define VM_GROWSDOWN	0x00000100
 #define VM_PFNMAP	0x00000400
+#define VM_MAYBE_GUARD	0x00000800
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000
 #define VM_SEQ_READ	0x00008000	/* App will access data sequentially */
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 2/8] mm: add atomic VMA flags and set VM_MAYBE_GUARD as such
  2025-11-07 16:11 [PATCH v3 0/8] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
  2025-11-07 16:11 ` [PATCH v3 1/8] mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps Lorenzo Stoakes
@ 2025-11-07 16:11 ` Lorenzo Stoakes
  2025-11-10 15:51   ` Vlastimil Babka
                     ` (2 more replies)
  2025-11-07 16:11 ` [PATCH v3 3/8] mm: implement sticky VMA flags Lorenzo Stoakes
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-07 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

This patch adds the ability to atomically set VMA flags with only the mmap
read/VMA read lock held.

As this could be hugely problematic for VMA flags in general given that all
other accesses are non-atomic and serialised by the mmap/VMA locks, we
implement this with a strict allow-list - that is, only designated flags
are allowed to do this.

We make VM_MAYBE_GUARD one of these flags.

Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2a5516bff75a..699566c21ff7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -518,6 +518,9 @@ extern unsigned int kobjsize(const void *objp);
 /* This mask represents all the VMA flag bits used by mlock */
 #define VM_LOCKED_MASK	(VM_LOCKED | VM_LOCKONFAULT)
 
+/* These flags can be updated atomically via VMA/mmap read lock. */
+#define VM_ATOMIC_SET_ALLOWED VM_MAYBE_GUARD
+
 /* Arch-specific flags to clear when updating VM flags on protection change */
 #ifndef VM_ARCH_CLEAR
 # define VM_ARCH_CLEAR	VM_NONE
@@ -860,6 +863,45 @@ static inline void vm_flags_mod(struct vm_area_struct *vma,
 	__vm_flags_mod(vma, set, clear);
 }
 
+static inline bool __vma_flag_atomic_valid(struct vm_area_struct *vma,
+				       int bit)
+{
+	const vm_flags_t mask = BIT(bit);
+
+	/* Only specific flags are permitted */
+	if (WARN_ON_ONCE(!(mask & VM_ATOMIC_SET_ALLOWED)))
+		return false;
+
+	return true;
+}
+
+/*
+ * Set VMA flag atomically. Requires only VMA/mmap read lock. Only specific
+ * valid flags are allowed to do this.
+ */
+static inline void vma_flag_set_atomic(struct vm_area_struct *vma, int bit)
+{
+	/* mmap read lock/VMA read lock must be held. */
+	if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
+		vma_assert_locked(vma);
+
+	if (__vma_flag_atomic_valid(vma, bit))
+		set_bit(bit, &vma->__vm_flags);
+}
+
+/*
+ * Test for VMA flag atomically. Requires no locks. Only specific valid flags
+ * are allowed to do this.
+ *
+ * This is necessarily racey, so callers must ensure that serialisation is
+ * achieved through some other means, or that races are permissible.
+ */
+static inline bool vma_flag_test_atomic(struct vm_area_struct *vma, int bit)
+{
+	if (__vma_flag_atomic_valid(vma, bit))
+		return test_bit(bit, &vma->__vm_flags);
+}
+
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
 {
 	vma->vm_ops = NULL;
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 3/8] mm: implement sticky VMA flags
  2025-11-07 16:11 [PATCH v3 0/8] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
  2025-11-07 16:11 ` [PATCH v3 1/8] mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps Lorenzo Stoakes
  2025-11-07 16:11 ` [PATCH v3 2/8] mm: add atomic VMA flags and set VM_MAYBE_GUARD as such Lorenzo Stoakes
@ 2025-11-07 16:11 ` Lorenzo Stoakes
  2025-11-07 16:11 ` [PATCH v3 4/8] mm: introduce copy-on-fork VMAs and make VM_MAYBE_GUARD one Lorenzo Stoakes
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-07 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

It is useful to be able to designate that certain flags are 'sticky', that
is, if two VMAs are merged one with a flag of this nature and one without,
the merged VMA sets this flag.

As a result we ignore these flags for the purposes of determining VMA flag
differences between VMAs being considered for merge.

This patch therefore updates the VMA merge logic to perform this action,
with flags possessing this property being described in the VM_STICKY
bitmap.

Those flags which ought to be ignored for the purposes of VMA merge are
described in the VM_IGNORE_MERGE bitmap, which the VMA merge logic is also
updated to use.

As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it
already had this behaviour, alongside VM_STICKY as sticky flags by
implication must not disallow merge.

Ultimately it seems that we should make VM_SOFTDIRTY a sticky flag in its
own right, but this change is out of scope for this series.

The only sticky flag designated as such is VM_MAYBE_GUARD, so as a result
of this change, once the VMA flag is set upon guard region installation,
VMAs with guard ranges will now not have their merge behaviour impacted as
a result and can be freely merged with other VMAs without VM_MAYBE_GUARD
set.

We also update the VMA userland tests to account for the changes.

Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h               | 29 +++++++++++++++++++++++++++++
 mm/vma.c                         | 22 ++++++++++++----------
 tools/testing/vma/vma_internal.h | 29 +++++++++++++++++++++++++++++
 3 files changed, 70 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 699566c21ff7..6c1c459e9acb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -527,6 +527,35 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 #define VM_FLAGS_CLEAR	(ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)
 
+/*
+ * Flags which should be 'sticky' on merge - that is, flags which, when one VMA
+ * possesses it but the other does not, the merged VMA should nonetheless have
+ * applied to it:
+ *
+ * VM_MAYBE_GUARD - If a VMA may have guard regions in place it implies that
+ *                  mapped page tables may contain metadata not described by the
+ *                  VMA and thus any merged VMA may also contain this metadata,
+ *                  and thus we must make this flag sticky.
+ */
+#define VM_STICKY VM_MAYBE_GUARD
+
+/*
+ * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
+ * of these flags and the other not does not preclude a merge.
+ *
+ * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
+ *                dirty bit -- the caller should mark merged VMA as dirty. If
+ *                dirty bit won't be excluded from comparison, we increase
+ *                pressure on the memory system forcing the kernel to generate
+ *                new VMAs when old one could be extended instead.
+ *
+ *    VM_STICKY - If one VMA has flags which most be 'sticky', that is ones
+ *                which should propagate to all VMAs, but the other does not,
+ *                the merge should still proceed with the merge logic applying
+ *                sticky flags to the final VMA.
+ */
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
+
 /*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
diff --git a/mm/vma.c b/mm/vma.c
index 0c5e391fe2e2..6cb082bc5e29 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -89,15 +89,7 @@ static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_nex
 
 	if (!mpol_equal(vmg->policy, vma_policy(vma)))
 		return false;
-	/*
-	 * VM_SOFTDIRTY should not prevent from VMA merging, if we
-	 * match the flags but dirty bit -- the caller should mark
-	 * merged VMA as dirty. If dirty bit won't be excluded from
-	 * comparison, we increase pressure on the memory system forcing
-	 * the kernel to generate new VMAs when old one could be
-	 * extended instead.
-	 */
-	if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_SOFTDIRTY)
+	if ((vma->vm_flags ^ vmg->vm_flags) & ~VM_IGNORE_MERGE)
 		return false;
 	if (vma->vm_file != vmg->file)
 		return false;
@@ -808,6 +800,7 @@ static bool can_merge_remove_vma(struct vm_area_struct *vma)
 static __must_check struct vm_area_struct *vma_merge_existing_range(
 		struct vma_merge_struct *vmg)
 {
+	vm_flags_t sticky_flags = vmg->vm_flags & VM_STICKY;
 	struct vm_area_struct *middle = vmg->middle;
 	struct vm_area_struct *prev = vmg->prev;
 	struct vm_area_struct *next;
@@ -900,11 +893,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
 	if (merge_right) {
 		vma_start_write(next);
 		vmg->target = next;
+		sticky_flags |= (next->vm_flags & VM_STICKY);
 	}
 
 	if (merge_left) {
 		vma_start_write(prev);
 		vmg->target = prev;
+		sticky_flags |= (prev->vm_flags & VM_STICKY);
 	}
 
 	if (merge_both) {
@@ -974,6 +969,7 @@ static __must_check struct vm_area_struct *vma_merge_existing_range(
 	if (err || commit_merge(vmg))
 		goto abort;
 
+	vm_flags_set(vmg->target, sticky_flags);
 	khugepaged_enter_vma(vmg->target, vmg->vm_flags);
 	vmg->state = VMA_MERGE_SUCCESS;
 	return vmg->target;
@@ -1124,6 +1120,10 @@ int vma_expand(struct vma_merge_struct *vmg)
 	bool remove_next = false;
 	struct vm_area_struct *target = vmg->target;
 	struct vm_area_struct *next = vmg->next;
+	vm_flags_t sticky_flags;
+
+	sticky_flags = vmg->vm_flags & VM_STICKY;
+	sticky_flags |= target->vm_flags & VM_STICKY;
 
 	VM_WARN_ON_VMG(!target, vmg);
 
@@ -1133,6 +1133,7 @@ int vma_expand(struct vma_merge_struct *vmg)
 	if (next && (target != next) && (vmg->end == next->vm_end)) {
 		int ret;
 
+		sticky_flags |= next->vm_flags & VM_STICKY;
 		remove_next = true;
 		/* This should already have been checked by this point. */
 		VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);
@@ -1159,6 +1160,7 @@ int vma_expand(struct vma_merge_struct *vmg)
 	if (commit_merge(vmg))
 		goto nomem;
 
+	vm_flags_set(target, sticky_flags);
 	return 0;
 
 nomem:
@@ -1902,7 +1904,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *
 	return a->vm_end == b->vm_start &&
 		mpol_equal(vma_policy(a), vma_policy(b)) &&
 		a->vm_file == b->vm_file &&
-		!((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_SOFTDIRTY)) &&
+		!((a->vm_flags ^ b->vm_flags) & ~(VM_ACCESS_FLAGS | VM_IGNORE_MERGE)) &&
 		b->vm_pgoff == a->vm_pgoff + ((b->vm_start - a->vm_start) >> PAGE_SHIFT);
 }
 
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 46acb4df45de..a54990aa3009 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -117,6 +117,35 @@ extern unsigned long dac_mmap_min_addr;
 #define VM_SEALED	VM_NONE
 #endif
 
+/*
+ * Flags which should be 'sticky' on merge - that is, flags which, when one VMA
+ * possesses it but the other does not, the merged VMA should nonetheless have
+ * applied to it:
+ *
+ * VM_MAYBE_GUARD - If a VMA may have guard regions in place it implies that
+ *                  mapped page tables may contain metadata not described by the
+ *                  VMA and thus any merged VMA may also contain this metadata,
+ *                  and thus we must make this flag sticky.
+ */
+#define VM_STICKY VM_MAYBE_GUARD
+
+/*
+ * VMA flags we ignore for the purposes of merge, i.e. one VMA possessing one
+ * of these flags and the other not does not preclude a merge.
+ *
+ * VM_SOFTDIRTY - Should not prevent from VMA merging, if we match the flags but
+ *                dirty bit -- the caller should mark merged VMA as dirty. If
+ *                dirty bit won't be excluded from comparison, we increase
+ *                pressure on the memory system forcing the kernel to generate
+ *                new VMAs when old one could be extended instead.
+ *
+ *    VM_STICKY - If one VMA has flags which most be 'sticky', that is ones
+ *                which should propagate to all VMAs, but the other does not,
+ *                the merge should still proceed with the merge logic applying
+ *                sticky flags to the final VMA.
+ */
+#define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
+
 #define FIRST_USER_ADDRESS	0UL
 #define USER_PGTABLES_CEILING	0UL
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 4/8] mm: introduce copy-on-fork VMAs and make VM_MAYBE_GUARD one
  2025-11-07 16:11 [PATCH v3 0/8] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2025-11-07 16:11 ` [PATCH v3 3/8] mm: implement sticky VMA flags Lorenzo Stoakes
@ 2025-11-07 16:11 ` Lorenzo Stoakes
  2025-11-07 16:11 ` [PATCH v3 5/8] mm: set the VM_MAYBE_GUARD flag on guard region install Lorenzo Stoakes
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-07 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

Gather all the VMA flags whose presence implies that page tables must be
copied on fork into a single bitmap - VM_COPY_ON_FORK - and use this rather
than specifying individual flags in vma_needs_copy().

We also add VM_MAYBE_GUARD to this list, as it being set on a VMA implies
that there may be metadata contained in the page tables (that is - guard
markers) which would will not and cannot be propagated upon fork.

This was already being done manually previously in vma_needs_copy(), but
this makes it very explicit, alongside VM_PFNMAP, VM_MIXEDMAP and
VM_UFFD_WP all of which imply the same.

Note that VM_STICKY flags ought generally to be marked VM_COPY_ON_FORK too
- because equally a flag being VM_STICKY indicates that the VMA contains
metadat that is not propagated by being faulted in - i.e. that the VMA
metadata does not fully describe the VMA alone, and thus we must propagate
whatever metadata there is on a fork.

However, for maximum flexibility, we do not make this necessarily the case
here.

Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h               | 26 ++++++++++++++++++++++++++
 mm/memory.c                      | 18 ++++--------------
 tools/testing/vma/vma_internal.h | 26 ++++++++++++++++++++++++++
 3 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6c1c459e9acb..7946d01e88ff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -556,6 +556,32 @@ extern unsigned int kobjsize(const void *objp);
  */
 #define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
 
+/*
+ * Flags which should result in page tables being copied on fork. These are
+ * flags which indicate that the VMA maps page tables which cannot be
+ * reconsistuted upon page fault, so necessitate page table copying upon
+ *
+ * VM_PFNMAP / VM_MIXEDMAP - These contain kernel-mapped data which cannot be
+ *                           reasonably reconstructed on page fault.
+ *
+ *              VM_UFFD_WP - Encodes metadata about an installed uffd
+ *                           write protect handler, which cannot be
+ *                           reconstructed on page fault.
+ *
+ *                           We always copy pgtables when dst_vma has uffd-wp
+ *                           enabled even if it's file-backed
+ *                           (e.g. shmem). Because when uffd-wp is enabled,
+ *                           pgtable contains uffd-wp protection information,
+ *                           that's something we can't retrieve from page cache,
+ *                           and skip copying will lose those info.
+ *
+ *          VM_MAYBE_GUARD - Could contain page guard region markers which
+ *                           by design are a property of the page tables
+ *                           only and thus cannot be reconstructed on page
+ *                           fault.
+ */
+#define VM_COPY_ON_FORK (VM_PFNMAP | VM_MIXEDMAP | VM_UFFD_WP | VM_MAYBE_GUARD)
+
 /*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
diff --git a/mm/memory.c b/mm/memory.c
index 334732ab6733..5828cfe9679f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1465,25 +1465,15 @@ copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 static bool
 vma_needs_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
+	if (src_vma->vm_flags & VM_COPY_ON_FORK)
+		return true;
 	/*
-	 * Always copy pgtables when dst_vma has uffd-wp enabled even if it's
-	 * file-backed (e.g. shmem). Because when uffd-wp is enabled, pgtable
-	 * contains uffd-wp protection information, that's something we can't
-	 * retrieve from page cache, and skip copying will lose those info.
+	 * The presence of an anon_vma indicates an anonymous VMA has page
+	 * tables which naturally cannot be reconstituted on page fault.
 	 */
-	if (userfaultfd_wp(dst_vma))
-		return true;
-
-	if (src_vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
-		return true;
-
 	if (src_vma->anon_vma)
 		return true;
 
-	/* Guard regions have momdified page tables that require copying. */
-	if (src_vma->vm_flags & VM_MAYBE_GUARD)
-		return true;
-
 	/*
 	 * Don't copy ptes where a page fault will fill them correctly.  Fork
 	 * becomes much lighter when there are big shared or private readonly
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index a54990aa3009..9a0b2abb1a58 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -146,6 +146,32 @@ extern unsigned long dac_mmap_min_addr;
  */
 #define VM_IGNORE_MERGE (VM_SOFTDIRTY | VM_STICKY)
 
+/*
+ * Flags which should result in page tables being copied on fork. These are
+ * flags which indicate that the VMA maps page tables which cannot be
+ * reconsistuted upon page fault, so necessitate page table copying upon
+ *
+ * VM_PFNMAP / VM_MIXEDMAP - These contain kernel-mapped data which cannot be
+ *                           reasonably reconstructed on page fault.
+ *
+ *              VM_UFFD_WP - Encodes metadata about an installed uffd
+ *                           write protect handler, which cannot be
+ *                           reconstructed on page fault.
+ *
+ *                           We always copy pgtables when dst_vma has uffd-wp
+ *                           enabled even if it's file-backed
+ *                           (e.g. shmem). Because when uffd-wp is enabled,
+ *                           pgtable contains uffd-wp protection information,
+ *                           that's something we can't retrieve from page cache,
+ *                           and skip copying will lose those info.
+ *
+ *          VM_MAYBE_GUARD - Could contain page guard region markers which
+ *                           by design are a property of the page tables
+ *                           only and thus cannot be reconstructed on page
+ *                           fault.
+ */
+#define VM_COPY_ON_FORK (VM_PFNMAP | VM_MIXEDMAP | VM_UFFD_WP | VM_MAYBE_GUARD)
+
 #define FIRST_USER_ADDRESS	0UL
 #define USER_PGTABLES_CEILING	0UL
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 5/8] mm: set the VM_MAYBE_GUARD flag on guard region install
  2025-11-07 16:11 [PATCH v3 0/8] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2025-11-07 16:11 ` [PATCH v3 4/8] mm: introduce copy-on-fork VMAs and make VM_MAYBE_GUARD one Lorenzo Stoakes
@ 2025-11-07 16:11 ` Lorenzo Stoakes
  2025-11-10 16:17   ` Vlastimil Babka
  2025-11-10 17:43   ` Lorenzo Stoakes
  2025-11-07 16:11 ` [PATCH v3 6/8] tools/testing/vma: add VMA sticky userland tests Lorenzo Stoakes
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-07 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

Now we have established the VM_MAYBE_GUARD flag and added the capacity to
set it atomically, do so upon MADV_GUARD_INSTALL.

The places where this flag is used currently and matter are:

* VMA merge - performed under mmap/VMA write lock, therefore excluding
  racing writes.

* /proc/$pid/smaps - can race the write, however this isn't meaningful as
  the flag write is performed at the point of the guard region being
  established, and thus an smaps reader can't reasonably expect to avoid
  races. Due to atomicity, a reader will observe either the flag being set
  or not. Therefore consistency will be maintained.

In all other cases the flag being set is irrelevant and atomicity
guarantees other flags will be read correctly.

Note that non-atomic updates of unrelated flags do not cause an issue with
this flag being set atomically, as writes of other flags are performed
under mmap/VMA write lock, and these atomic writes are performed under
mmap/VMA read lock, which excludes the write, avoiding RMW races.

Note that we do not encounter issues with KCSAN by adjusting this flag
atomically, as we are only updating a single bit in the flag bitmap and
therefore we do not need to annotate these changes.

We intentionally set this flag in advance of actually updating the page
tables, to ensure that any racing atomic read of this flag will only return
false prior to page tables being updated, to allow for serialisation via
page table locks.

Note that we set vma->anon_vma for anonymous mappings. This is because the
expectation for anonymous mappings is that an anon_vma is established
should they possess any page table mappings. This is also consistent with
what we were doing prior to this patch (unconditionally setting anon_vma on
guard region installation).

We also need to update retract_page_tables() to ensure that madvise(...,
MADV_COLLAPSE) doesn't incorrectly collapse file-backed ranges contain
guard regions.

This was previously guarded by anon_vma being set to catch MAP_PRIVATE
cases, but the introduction of VM_MAYBE_GUARD necessitates that we check
this flag instead.

We utilise vma_flag_test_atomic() to do so - we first perform an optimistic
check, then after the PTE page table lock is held, we can check again
safely, as upon guard marker install the flag is set atomically prior to
the page table lock being taken to actually apply it.

So if the initial check fails either:

* Page table retraction acquires page table lock prior to VM_MAYBE_GUARD
  being set - guard marker installation will be blocked until page table
  retraction is complete.

OR:

* Guard marker installation acquires page table lock after setting
  VM_MAYBE_GUARD, which raced and didn't pick this up in the initial
  optimistic check, blocking page table retraction until the guard regions
  are installed - the second VM_MAYBE_GUARD check will prevent page table
  retraction.

Either way we're safe.

We refactor the retraction checks into a single
file_backed_vma_is_retractable(), there doesn't seem to be any reason that
the checks were separated as before.

Note that VM_MAYBE_GUARD being set atomically remains correct as
vma_needs_copy() is invoked with the mmap and VMA write locks held,
excluding any race with madvise_guard_install().

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h |  2 ++
 mm/khugepaged.c    | 72 ++++++++++++++++++++++++++++++----------------
 mm/madvise.c       | 22 ++++++++------
 3 files changed, 64 insertions(+), 32 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7946d01e88ff..f4d70b7fc03e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -955,6 +955,8 @@ static inline bool vma_flag_test_atomic(struct vm_area_struct *vma, int bit)
 {
 	if (__vma_flag_atomic_valid(vma, bit))
 		return test_bit(bit, &vma->__vm_flags);
+
+	return false;
 }
 
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1a08673b0d8b..c75afeac4bbb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1711,6 +1711,43 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
+/* Can we retract page tables for this file-backed VMA? */
+static bool file_backed_vma_is_retractable(struct vm_area_struct *vma)
+{
+	/*
+	 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
+	 * got written to. These VMAs are likely not worth removing
+	 * page tables from, as PMD-mapping is likely to be split later.
+	 */
+	if (READ_ONCE(vma->anon_vma))
+		return false;
+
+	/*
+	 * When a vma is registered with uffd-wp, we cannot recycle
+	 * the page table because there may be pte markers installed.
+	 * Other vmas can still have the same file mapped hugely, but
+	 * skip this one: it will always be mapped in small page size
+	 * for uffd-wp registered ranges.
+	 */
+	if (userfaultfd_wp(vma))
+		return false;
+
+	/*
+	 * If the VMA contains guard regions then we can't collapse it.
+	 *
+	 * This is set atomically on guard marker installation under mmap/VMA
+	 * read lock, and here we may not hold any VMA or mmap lock at all.
+	 *
+	 * This is therefore serialised on the PTE page table lock, which is
+	 * obtained on guard region installation after the flag is set, so this
+	 * check being performed under this lock excludes races.
+	 */
+	if (vma_flag_test_atomic(vma, VM_MAYBE_GUARD_BIT))
+		return false;
+
+	return true;
+}
+
 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
@@ -1725,14 +1762,6 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		spinlock_t *ptl;
 		bool success = false;
 
-		/*
-		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
-		 * got written to. These VMAs are likely not worth removing
-		 * page tables from, as PMD-mapping is likely to be split later.
-		 */
-		if (READ_ONCE(vma->anon_vma))
-			continue;
-
 		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		if (addr & ~HPAGE_PMD_MASK ||
 		    vma->vm_end < addr + HPAGE_PMD_SIZE)
@@ -1744,14 +1773,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 
 		if (hpage_collapse_test_exit(mm))
 			continue;
-		/*
-		 * When a vma is registered with uffd-wp, we cannot recycle
-		 * the page table because there may be pte markers installed.
-		 * Other vmas can still have the same file mapped hugely, but
-		 * skip this one: it will always be mapped in small page size
-		 * for uffd-wp registered ranges.
-		 */
-		if (userfaultfd_wp(vma))
+
+		if (!file_backed_vma_is_retractable(vma))
 			continue;
 
 		/* PTEs were notified when unmapped; but now for the PMD? */
@@ -1778,15 +1801,16 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 
 		/*
-		 * Huge page lock is still held, so normally the page table
-		 * must remain empty; and we have already skipped anon_vma
-		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
-		 * held, it is still possible for a racing userfaultfd_ioctl()
-		 * to have inserted ptes or markers.  Now that we hold ptlock,
-		 * repeating the anon_vma check protects from one category,
-		 * and repeating the userfaultfd_wp() check from another.
+		 * Huge page lock is still held, so normally the page table must
+		 * remain empty; and we have already skipped anon_vma and
+		 * userfaultfd_wp() vmas.  But since the mmap_lock is not held,
+		 * it is still possible for a racing userfaultfd_ioctl() or
+		 * madvise() to have inserted ptes or markers.  Now that we hold
+		 * ptlock, repeating the anon_vma check protects from one
+		 * category, and repeating the userfaultfd_wp() check from
+		 * another.
 		 */
-		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
+		if (likely(file_backed_vma_is_retractable(vma))) {
 			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
 			pmdp_get_lockless_sync();
 			success = true;
diff --git a/mm/madvise.c b/mm/madvise.c
index 67bdfcb315b3..de918b107cfc 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1139,15 +1139,21 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
 		return -EINVAL;
 
 	/*
-	 * If we install guard markers, then the range is no longer
-	 * empty from a page table perspective and therefore it's
-	 * appropriate to have an anon_vma.
-	 *
-	 * This ensures that on fork, we copy page tables correctly.
+	 * Set atomically under read lock. All pertinent readers will need to
+	 * acquire an mmap/VMA write lock to read it. All remaining readers may
+	 * or may not see the flag set, but we don't care.
+	 */
+	vma_flag_set_atomic(vma, VM_MAYBE_GUARD_BIT);
+
+	/*
+	 * If anonymous and we are establishing page tables the VMA ought to
+	 * have an anon_vma associated with it.
 	 */
-	err = anon_vma_prepare(vma);
-	if (err)
-		return err;
+	if (vma_is_anonymous(vma)) {
+		err = anon_vma_prepare(vma);
+		if (err)
+			return err;
+	}
 
 	/*
 	 * Optimistically try to install the guard marker pages first. If any
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 6/8] tools/testing/vma: add VMA sticky userland tests
  2025-11-07 16:11 [PATCH v3 0/8] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
                   ` (4 preceding siblings ...)
  2025-11-07 16:11 ` [PATCH v3 5/8] mm: set the VM_MAYBE_GUARD flag on guard region install Lorenzo Stoakes
@ 2025-11-07 16:11 ` Lorenzo Stoakes
  2025-11-07 16:11 ` [PATCH v3 7/8] tools/testing/selftests/mm: add MADV_COLLAPSE test case Lorenzo Stoakes
  2025-11-07 16:11 ` [PATCH v3 8/8] tools/testing/selftests/mm: add smaps visibility guard region test Lorenzo Stoakes
  7 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-07 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

Modify existing merge new/existing userland VMA tests to assert that sticky
VMA flags behave as expected.

We do so by generating every possible permutation of VMAs being manipulated
being sticky/not sticky and asserting that VMA flags with this property
retain are retained upon merge.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/vma/vma.c | 89 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 79 insertions(+), 10 deletions(-)

diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 656e1c75b711..ee9d3547c421 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -48,6 +48,8 @@ static struct anon_vma dummy_anon_vma;
 #define ASSERT_EQ(_val1, _val2) ASSERT_TRUE((_val1) == (_val2))
 #define ASSERT_NE(_val1, _val2) ASSERT_TRUE((_val1) != (_val2))
 
+#define IS_SET(_val, _flags) ((_val & _flags) == _flags)
+
 static struct task_struct __current;
 
 struct task_struct *get_current(void)
@@ -441,7 +443,7 @@ static bool test_simple_shrink(void)
 	return true;
 }
 
-static bool test_merge_new(void)
+static bool __test_merge_new(bool is_sticky, bool a_is_sticky, bool b_is_sticky, bool c_is_sticky)
 {
 	vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
 	struct mm_struct mm = {};
@@ -469,23 +471,32 @@ static bool test_merge_new(void)
 	struct vm_area_struct *vma, *vma_a, *vma_b, *vma_c, *vma_d;
 	bool merged;
 
+	if (is_sticky)
+		vm_flags |= VM_STICKY;
+
 	/*
 	 * 0123456789abc
 	 * AA B       CC
 	 */
 	vma_a = alloc_and_link_vma(&mm, 0, 0x2000, 0, vm_flags);
 	ASSERT_NE(vma_a, NULL);
+	if (a_is_sticky)
+		vm_flags_set(vma_a, VM_STICKY);
 	/* We give each VMA a single avc so we can test anon_vma duplication. */
 	INIT_LIST_HEAD(&vma_a->anon_vma_chain);
 	list_add(&dummy_anon_vma_chain_a.same_vma, &vma_a->anon_vma_chain);
 
 	vma_b = alloc_and_link_vma(&mm, 0x3000, 0x4000, 3, vm_flags);
 	ASSERT_NE(vma_b, NULL);
+	if (b_is_sticky)
+		vm_flags_set(vma_b, VM_STICKY);
 	INIT_LIST_HEAD(&vma_b->anon_vma_chain);
 	list_add(&dummy_anon_vma_chain_b.same_vma, &vma_b->anon_vma_chain);
 
 	vma_c = alloc_and_link_vma(&mm, 0xb000, 0xc000, 0xb, vm_flags);
 	ASSERT_NE(vma_c, NULL);
+	if (c_is_sticky)
+		vm_flags_set(vma_c, VM_STICKY);
 	INIT_LIST_HEAD(&vma_c->anon_vma_chain);
 	list_add(&dummy_anon_vma_chain_c.same_vma, &vma_c->anon_vma_chain);
 
@@ -520,6 +531,8 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 3);
+	if (is_sticky || a_is_sticky || b_is_sticky)
+		ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
 
 	/*
 	 * Merge to PREVIOUS VMA.
@@ -537,6 +550,8 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 3);
+	if (is_sticky || a_is_sticky)
+		ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
 
 	/*
 	 * Merge to NEXT VMA.
@@ -556,6 +571,8 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 3);
+	if (is_sticky) /* D uses is_sticky. */
+		ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
 
 	/*
 	 * Merge BOTH sides.
@@ -574,6 +591,8 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 2);
+	if (is_sticky || a_is_sticky)
+		ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
 
 	/*
 	 * Merge to NEXT VMA.
@@ -592,6 +611,8 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 2);
+	if (is_sticky || c_is_sticky)
+		ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
 
 	/*
 	 * Merge BOTH sides.
@@ -609,6 +630,8 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 1);
+	if (is_sticky || a_is_sticky || c_is_sticky)
+		ASSERT_TRUE(IS_SET(vma->vm_flags, VM_STICKY));
 
 	/*
 	 * Final state.
@@ -637,6 +660,20 @@ static bool test_merge_new(void)
 	return true;
 }
 
+static bool test_merge_new(void)
+{
+	int i, j, k, l;
+
+	/* Generate every possible permutation of sticky flags. */
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < 2; j++)
+			for (k = 0; k < 2; k++)
+				for (l = 0; l < 2; l++)
+					ASSERT_TRUE(__test_merge_new(i, j, k, l));
+
+	return true;
+}
+
 static bool test_vma_merge_special_flags(void)
 {
 	vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
@@ -973,9 +1010,11 @@ static bool test_vma_merge_new_with_close(void)
 	return true;
 }
 
-static bool test_merge_existing(void)
+static bool __test_merge_existing(bool prev_is_sticky, bool middle_is_sticky, bool next_is_sticky)
 {
 	vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
+	vm_flags_t prev_flags = vm_flags;
+	vm_flags_t next_flags = vm_flags;
 	struct mm_struct mm = {};
 	VMA_ITERATOR(vmi, &mm, 0);
 	struct vm_area_struct *vma, *vma_prev, *vma_next;
@@ -988,6 +1027,13 @@ static bool test_merge_existing(void)
 	};
 	struct anon_vma_chain avc = {};
 
+	if (prev_is_sticky)
+		prev_flags |= VM_STICKY;
+	if (middle_is_sticky)
+		vm_flags |= VM_STICKY;
+	if (next_is_sticky)
+		next_flags |= VM_STICKY;
+
 	/*
 	 * Merge right case - partial span.
 	 *
@@ -1000,7 +1046,7 @@ static bool test_merge_existing(void)
 	 */
 	vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags);
 	vma->vm_ops = &vm_ops; /* This should have no impact. */
-	vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
+	vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags);
 	vma_next->vm_ops = &vm_ops; /* This should have no impact. */
 	vmg_set_range_anon_vma(&vmg, 0x3000, 0x6000, 3, vm_flags, &dummy_anon_vma);
 	vmg.middle = vma;
@@ -1018,6 +1064,8 @@ static bool test_merge_existing(void)
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_TRUE(vma_write_started(vma_next));
 	ASSERT_EQ(mm.map_count, 2);
+	if (middle_is_sticky || next_is_sticky)
+		ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));
 
 	/* Clear down and reset. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1033,7 +1081,7 @@ static bool test_merge_existing(void)
 	 *   NNNNNNN
 	 */
 	vma = alloc_and_link_vma(&mm, 0x2000, 0x6000, 2, vm_flags);
-	vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, vm_flags);
+	vma_next = alloc_and_link_vma(&mm, 0x6000, 0x9000, 6, next_flags);
 	vma_next->vm_ops = &vm_ops; /* This should have no impact. */
 	vmg_set_range_anon_vma(&vmg, 0x2000, 0x6000, 2, vm_flags, &dummy_anon_vma);
 	vmg.middle = vma;
@@ -1046,6 +1094,8 @@ static bool test_merge_existing(void)
 	ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma_next));
 	ASSERT_EQ(mm.map_count, 1);
+	if (middle_is_sticky || next_is_sticky)
+		ASSERT_TRUE(IS_SET(vma_next->vm_flags, VM_STICKY));
 
 	/* Clear down and reset. We should have deleted vma. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1060,7 +1110,7 @@ static bool test_merge_existing(void)
 	 * 0123456789
 	 * PPPPPPV
 	 */
-	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
+	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
 	vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
 	vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
 	vma->vm_ops = &vm_ops; /* This should have no impact. */
@@ -1080,6 +1130,8 @@ static bool test_merge_existing(void)
 	ASSERT_TRUE(vma_write_started(vma_prev));
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_EQ(mm.map_count, 2);
+	if (prev_is_sticky || middle_is_sticky)
+		ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
 
 	/* Clear down and reset. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1094,7 +1146,7 @@ static bool test_merge_existing(void)
 	 * 0123456789
 	 * PPPPPPP
 	 */
-	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
+	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
 	vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
 	vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
 	vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);
@@ -1109,6 +1161,8 @@ static bool test_merge_existing(void)
 	ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma_prev));
 	ASSERT_EQ(mm.map_count, 1);
+	if (prev_is_sticky || middle_is_sticky)
+		ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
 
 	/* Clear down and reset. We should have deleted vma. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1123,10 +1177,10 @@ static bool test_merge_existing(void)
 	 * 0123456789
 	 * PPPPPPPPPP
 	 */
-	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
+	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
 	vma_prev->vm_ops = &vm_ops; /* This should have no impact. */
 	vma = alloc_and_link_vma(&mm, 0x3000, 0x7000, 3, vm_flags);
-	vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, vm_flags);
+	vma_next = alloc_and_link_vma(&mm, 0x7000, 0x9000, 7, next_flags);
 	vmg_set_range_anon_vma(&vmg, 0x3000, 0x7000, 3, vm_flags, &dummy_anon_vma);
 	vmg.prev = vma_prev;
 	vmg.middle = vma;
@@ -1139,6 +1193,8 @@ static bool test_merge_existing(void)
 	ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma_prev));
 	ASSERT_EQ(mm.map_count, 1);
+	if (prev_is_sticky || middle_is_sticky || next_is_sticky)
+		ASSERT_TRUE(IS_SET(vma_prev->vm_flags, VM_STICKY));
 
 	/* Clear down and reset. We should have deleted prev and next. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1158,9 +1214,9 @@ static bool test_merge_existing(void)
 	 * PPPVVVVVNNN
 	 */
 
-	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, vm_flags);
+	vma_prev = alloc_and_link_vma(&mm, 0, 0x3000, 0, prev_flags);
 	vma = alloc_and_link_vma(&mm, 0x3000, 0x8000, 3, vm_flags);
-	vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, vm_flags);
+	vma_next = alloc_and_link_vma(&mm, 0x8000, 0xa000, 8, next_flags);
 
 	vmg_set_range(&vmg, 0x4000, 0x5000, 4, vm_flags);
 	vmg.prev = vma;
@@ -1203,6 +1259,19 @@ static bool test_merge_existing(void)
 	return true;
 }
 
+static bool test_merge_existing(void)
+{
+	int i, j, k;
+
+	/* Generate every possible permutation of sticky flags. */
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < 2; j++)
+			for (k = 0; k < 2; k++)
+				ASSERT_TRUE(__test_merge_existing(i, j, k));
+
+	return true;
+}
+
 static bool test_anon_vma_non_mergeable(void)
 {
 	vm_flags_t vm_flags = VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE;
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 7/8] tools/testing/selftests/mm: add MADV_COLLAPSE test case
  2025-11-07 16:11 [PATCH v3 0/8] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
                   ` (5 preceding siblings ...)
  2025-11-07 16:11 ` [PATCH v3 6/8] tools/testing/vma: add VMA sticky userland tests Lorenzo Stoakes
@ 2025-11-07 16:11 ` Lorenzo Stoakes
  2025-11-07 16:11 ` [PATCH v3 8/8] tools/testing/selftests/mm: add smaps visibility guard region test Lorenzo Stoakes
  7 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-07 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

To ensure the retract_page_tables() logic functions correctly with the
introduction of VM_MAYBE_GUARD, add a test to assert that madvise collapse
fails when guard regions are established in the collapsed range in all
cases.

Unfortunately we cannot differentiate between
e.g. CONFIG_READ_ONLY_THP_FOR_FS not being set vs. a file-backed VMA having
collapse correctly disallowed, so in each instance we will get an assert
pass here.

We add an additional check to see whether guard regions are preserved
across collapse in case of a bug causing the collapse to succeed, which
will give us more data to debug with should this occur in future.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/selftests/mm/guard-regions.c | 65 ++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
index 8dd81c0a4a5a..c549bcd6160b 100644
--- a/tools/testing/selftests/mm/guard-regions.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -2138,4 +2138,69 @@ TEST_F(guard_regions, pagemap_scan)
 	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
 }
 
+TEST_F(guard_regions, collapse)
+{
+	const unsigned long page_size = self->page_size;
+	const unsigned long size = 2 * HPAGE_SIZE;
+	const unsigned long num_pages = size / page_size;
+	char *ptr;
+	int i;
+
+	/* Need file to be correct size for tests for non-anon. */
+	if (variant->backing != ANON_BACKED)
+		ASSERT_EQ(ftruncate(self->fd, size), 0);
+
+	/*
+	 * We must close and re-open local-file backed as read-only for
+	 * CONFIG_READ_ONLY_THP_FOR_FS to work.
+	 */
+	if (variant->backing == LOCAL_FILE_BACKED) {
+		ASSERT_EQ(close(self->fd), 0);
+
+		self->fd = open(self->path, O_RDONLY);
+		ASSERT_GE(self->fd, 0);
+	}
+
+	ptr = mmap_(self, variant, NULL, size, PROT_READ, 0, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* Prevent being faulted-in as huge. */
+	ASSERT_EQ(madvise(ptr, size, MADV_NOHUGEPAGE), 0);
+	/* Fault in. */
+	ASSERT_EQ(madvise(ptr, size, MADV_POPULATE_READ), 0);
+
+	/* Install guard regions in ever other page. */
+	for (i = 0; i < num_pages; i += 2) {
+		char *ptr_page = &ptr[i * page_size];
+
+		ASSERT_EQ(madvise(ptr_page, page_size, MADV_GUARD_INSTALL), 0);
+		/* Accesses should now fail. */
+		ASSERT_FALSE(try_read_buf(ptr_page));
+	}
+
+	/* Allow huge page throughout region. */
+	ASSERT_EQ(madvise(ptr, size, MADV_HUGEPAGE), 0);
+
+	/*
+	 * Now collapse the entire region. This should fail in all cases.
+	 *
+	 * The madvise() call will also fail if CONFIG_READ_ONLY_THP_FOR_FS is
+	 * not set for the local file case, but we can't differentiate whether
+	 * this occurred or if the collapse was rightly rejected.
+	 */
+	EXPECT_NE(madvise(ptr, size, MADV_COLLAPSE), 0);
+
+	/*
+	 * If we introduce a bug that causes the collapse to succeed, gather
+	 * data on whether guard regions are at least preserved. The test will
+	 * fail at this point in any case.
+	 */
+	for (i = 0; i < num_pages; i += 2) {
+		char *ptr_page = &ptr[i * page_size];
+
+		/* Accesses should still fail. */
+		ASSERT_FALSE(try_read_buf(ptr_page));
+	}
+}
+
 TEST_HARNESS_MAIN
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 8/8] tools/testing/selftests/mm: add smaps visibility guard region test
  2025-11-07 16:11 [PATCH v3 0/8] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
                   ` (6 preceding siblings ...)
  2025-11-07 16:11 ` [PATCH v3 7/8] tools/testing/selftests/mm: add MADV_COLLAPSE test case Lorenzo Stoakes
@ 2025-11-07 16:11 ` Lorenzo Stoakes
  7 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-07 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

Assert that we observe guard regions appearing in /proc/$pid/smaps as
expected, and when split/merge is performed too (with expected sticky
behaviour).

Also add handling for file systems which don't sanely handle mmap() VMA
merging so we don't incorrectly encounter a test failure in this situation.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/selftests/mm/guard-regions.c | 120 +++++++++++++++++++++
 tools/testing/selftests/mm/vm_util.c       |   5 +
 tools/testing/selftests/mm/vm_util.h       |   1 +
 3 files changed, 126 insertions(+)

diff --git a/tools/testing/selftests/mm/guard-regions.c b/tools/testing/selftests/mm/guard-regions.c
index c549bcd6160b..795bf3f39f44 100644
--- a/tools/testing/selftests/mm/guard-regions.c
+++ b/tools/testing/selftests/mm/guard-regions.c
@@ -94,6 +94,7 @@ static void *mmap_(FIXTURE_DATA(guard_regions) * self,
 	case ANON_BACKED:
 		flags |= MAP_PRIVATE | MAP_ANON;
 		fd = -1;
+		offset = 0;
 		break;
 	case SHMEM_BACKED:
 	case LOCAL_FILE_BACKED:
@@ -260,6 +261,54 @@ static bool is_buf_eq(char *buf, size_t size, char chr)
 	return true;
 }
 
+/*
+ * Some file systems have issues with merging due to changing merge-sensitive
+ * parameters in the .mmap callback, and prior to .mmap_prepare being
+ * implemented everywhere this will now result in an unexpected failure to
+ * merge (e.g. - overlayfs).
+ *
+ * Perform a simple test to see if the local file system suffers from this, if
+ * it does then we can skip test logic that assumes local file system merging is
+ * sane.
+ */
+static bool local_fs_has_sane_mmap(FIXTURE_DATA(guard_regions) * self,
+				   const FIXTURE_VARIANT(guard_regions) * variant)
+{
+	const unsigned long page_size = self->page_size;
+	char *ptr, *ptr2;
+	struct procmap_fd procmap;
+
+	if (variant->backing != LOCAL_FILE_BACKED)
+		return true;
+
+	/* Map 10 pages. */
+	ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0);
+	if (ptr == MAP_FAILED)
+		return false;
+	/* Unmap the middle. */
+	munmap(&ptr[5 * page_size], page_size);
+
+	/* Map again. */
+	ptr2 = mmap_(self, variant, &ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE,
+		     MAP_FIXED, 5 * page_size);
+
+	if (ptr2 == MAP_FAILED)
+		return false;
+
+	/* Now make sure they all merged. */
+	if (open_self_procmap(&procmap) != 0)
+		return false;
+	if (!find_vma_procmap(&procmap, ptr))
+		return false;
+	if (procmap.query.vma_start != (unsigned long)ptr)
+		return false;
+	if (procmap.query.vma_end != (unsigned long)ptr + 10 * page_size)
+		return false;
+	close_procmap(&procmap);
+
+	return true;
+}
+
 FIXTURE_SETUP(guard_regions)
 {
 	self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
@@ -2203,4 +2252,75 @@ TEST_F(guard_regions, collapse)
 	}
 }
 
+TEST_F(guard_regions, smaps)
+{
+	const unsigned long page_size = self->page_size;
+	struct procmap_fd procmap;
+	char *ptr, *ptr2;
+	int i;
+
+	/* Map a region. */
+	ptr = mmap_(self, variant, NULL, 10 * page_size, PROT_READ | PROT_WRITE, 0, 0);
+	ASSERT_NE(ptr, MAP_FAILED);
+
+	/* We shouldn't yet see a guard flag. */
+	ASSERT_FALSE(check_vmflag_guard(ptr));
+
+	/* Install a single guard region. */
+	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_INSTALL), 0);
+
+	/* Now we should see a guard flag. */
+	ASSERT_TRUE(check_vmflag_guard(ptr));
+
+	/*
+	 * Removing the guard region should not change things because we simply
+	 * cannot accurately track whether a given VMA has had all of its guard
+	 * regions removed.
+	 */
+	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_REMOVE), 0);
+	ASSERT_TRUE(check_vmflag_guard(ptr));
+
+	/* Install guard regions throughout. */
+	for (i = 0; i < 10; i++) {
+		ASSERT_EQ(madvise(&ptr[i * page_size], page_size, MADV_GUARD_INSTALL), 0);
+		/* We should always see the guard region flag. */
+		ASSERT_TRUE(check_vmflag_guard(ptr));
+	}
+
+	/* Split into two VMAs. */
+	ASSERT_EQ(munmap(&ptr[4 * page_size], page_size), 0);
+
+	/* Both VMAs should have the guard flag set. */
+	ASSERT_TRUE(check_vmflag_guard(ptr));
+	ASSERT_TRUE(check_vmflag_guard(&ptr[5 * page_size]));
+
+	/*
+	 * If the local file system is unable to merge VMAs due to having
+	 * unusual characteristics, there is no point in asserting merge
+	 * behaviour.
+	 */
+	if (!local_fs_has_sane_mmap(self, variant)) {
+		TH_LOG("local filesystem does not support sane merging skipping merge test");
+		return;
+	}
+
+	/* Map a fresh VMA between the two split VMAs. */
+	ptr2 = mmap_(self, variant, &ptr[4 * page_size], page_size,
+		     PROT_READ | PROT_WRITE, MAP_FIXED, 4 * page_size);
+	ASSERT_NE(ptr2, MAP_FAILED);
+
+	/*
+	 * Check the procmap to ensure that this VMA merged with the adjacent
+	 * two. The guard region flag is 'sticky' so should not preclude
+	 * merging.
+	 */
+	ASSERT_EQ(open_self_procmap(&procmap), 0);
+	ASSERT_TRUE(find_vma_procmap(&procmap, ptr));
+	ASSERT_EQ(procmap.query.vma_start, (unsigned long)ptr);
+	ASSERT_EQ(procmap.query.vma_end, (unsigned long)ptr + 10 * page_size);
+	ASSERT_EQ(close_procmap(&procmap), 0);
+	/* And, of course, this VMA should have the guard flag set. */
+	ASSERT_TRUE(check_vmflag_guard(ptr));
+}
+
 TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index e33cda301dad..605cb58ea5c3 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -449,6 +449,11 @@ bool check_vmflag_pfnmap(void *addr)
 	return check_vmflag(addr, "pf");
 }
 
+bool check_vmflag_guard(void *addr)
+{
+	return check_vmflag(addr, "gu");
+}
+
 bool softdirty_supported(void)
 {
 	char *addr;
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 26c30fdc0241..a8abdf414d46 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -98,6 +98,7 @@ int uffd_register_with_ioctls(int uffd, void *addr, uint64_t len,
 unsigned long get_free_hugepages(void);
 bool check_vmflag_io(void *addr);
 bool check_vmflag_pfnmap(void *addr);
+bool check_vmflag_guard(void *addr);
 int open_procmap(pid_t pid, struct procmap_fd *procmap_out);
 int query_procmap(struct procmap_fd *procmap);
 bool find_vma_procmap(struct procmap_fd *procmap, void *address);
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/8] mm: add atomic VMA flags and set VM_MAYBE_GUARD as such
  2025-11-07 16:11 ` [PATCH v3 2/8] mm: add atomic VMA flags and set VM_MAYBE_GUARD as such Lorenzo Stoakes
@ 2025-11-10 15:51   ` Vlastimil Babka
  2025-11-10 17:34     ` Lorenzo Stoakes
  2025-11-10 17:36   ` Lorenzo Stoakes
  2025-11-10 17:59   ` Lorenzo Stoakes
  2 siblings, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2025-11-10 15:51 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jann Horn, Pedro Falcato,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, linux-kernel, linux-fsdevel, linux-doc,
	linux-mm, linux-trace-kernel, linux-kselftest, Andrei Vagin

On 11/7/25 17:11, Lorenzo Stoakes wrote:
> This patch adds the ability to atomically set VMA flags with only the mmap
> read/VMA read lock held.
> 
> As this could be hugely problematic for VMA flags in general given that all
> other accesses are non-atomic and serialised by the mmap/VMA locks, we
> implement this with a strict allow-list - that is, only designated flags
> are allowed to do this.
> 
> We make VM_MAYBE_GUARD one of these flags.
> 
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  include/linux/mm.h | 42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2a5516bff75a..699566c21ff7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -518,6 +518,9 @@ extern unsigned int kobjsize(const void *objp);
>  /* This mask represents all the VMA flag bits used by mlock */
>  #define VM_LOCKED_MASK	(VM_LOCKED | VM_LOCKONFAULT)
>  
> +/* These flags can be updated atomically via VMA/mmap read lock. */
> +#define VM_ATOMIC_SET_ALLOWED VM_MAYBE_GUARD
> +
>  /* Arch-specific flags to clear when updating VM flags on protection change */
>  #ifndef VM_ARCH_CLEAR
>  # define VM_ARCH_CLEAR	VM_NONE
> @@ -860,6 +863,45 @@ static inline void vm_flags_mod(struct vm_area_struct *vma,
>  	__vm_flags_mod(vma, set, clear);
>  }
>  
> +static inline bool __vma_flag_atomic_valid(struct vm_area_struct *vma,
> +				       int bit)
> +{
> +	const vm_flags_t mask = BIT(bit);
> +
> +	/* Only specific flags are permitted */
> +	if (WARN_ON_ONCE(!(mask & VM_ATOMIC_SET_ALLOWED)))
> +		return false;
> +
> +	return true;
> +}
> +
> +/*
> + * Set VMA flag atomically. Requires only VMA/mmap read lock. Only specific
> + * valid flags are allowed to do this.
> + */
> +static inline void vma_flag_set_atomic(struct vm_area_struct *vma, int bit)
> +{
> +	/* mmap read lock/VMA read lock must be held. */
> +	if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
> +		vma_assert_locked(vma);
> +
> +	if (__vma_flag_atomic_valid(vma, bit))
> +		set_bit(bit, &vma->__vm_flags);
> +}
> +
> +/*
> + * Test for VMA flag atomically. Requires no locks. Only specific valid flags
> + * are allowed to do this.
> + *
> + * This is necessarily racey, so callers must ensure that serialisation is
> + * achieved through some other means, or that races are permissible.
> + */
> +static inline bool vma_flag_test_atomic(struct vm_area_struct *vma, int bit)
> +{
> +	if (__vma_flag_atomic_valid(vma, bit))
> +		return test_bit(bit, &vma->__vm_flags);
> +}

Hm clang is unhappy here.

./include/linux/mm.h:932:1: error: non-void function does not return a value in all control paths [-Werror,-Wreturn-type]
  932 | }
      | ^
1 error generated.

I don't have CONFIG_WERROR enabled though, so not sure why it's not just a
warning, as the function is unused until patch 5/8 which adds a "return
false" here. So it's just a potential bisection annoyance with clang.

Andrew could you move that hunk from to this patch? Thanks.

> +
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
>  {
>  	vma->vm_ops = NULL;



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 5/8] mm: set the VM_MAYBE_GUARD flag on guard region install
  2025-11-07 16:11 ` [PATCH v3 5/8] mm: set the VM_MAYBE_GUARD flag on guard region install Lorenzo Stoakes
@ 2025-11-10 16:17   ` Vlastimil Babka
  2025-11-10 17:37     ` Lorenzo Stoakes
  2025-11-10 17:43   ` Lorenzo Stoakes
  1 sibling, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2025-11-10 16:17 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jann Horn, Pedro Falcato,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, linux-kernel, linux-fsdevel, linux-doc,
	linux-mm, linux-trace-kernel, linux-kselftest, Andrei Vagin

On 11/7/25 17:11, Lorenzo Stoakes wrote:
> Now we have established the VM_MAYBE_GUARD flag and added the capacity to
> set it atomically, do so upon MADV_GUARD_INSTALL.
> 
> The places where this flag is used currently and matter are:
> 
> * VMA merge - performed under mmap/VMA write lock, therefore excluding
>   racing writes.
> 
> * /proc/$pid/smaps - can race the write, however this isn't meaningful as
>   the flag write is performed at the point of the guard region being
>   established, and thus an smaps reader can't reasonably expect to avoid
>   races. Due to atomicity, a reader will observe either the flag being set
>   or not. Therefore consistency will be maintained.
> 
> In all other cases the flag being set is irrelevant and atomicity
> guarantees other flags will be read correctly.
> 
> Note that non-atomic updates of unrelated flags do not cause an issue with
> this flag being set atomically, as writes of other flags are performed
> under mmap/VMA write lock, and these atomic writes are performed under
> mmap/VMA read lock, which excludes the write, avoiding RMW races.
> 
> Note that we do not encounter issues with KCSAN by adjusting this flag
> atomically, as we are only updating a single bit in the flag bitmap and
> therefore we do not need to annotate these changes.
> 
> We intentionally set this flag in advance of actually updating the page
> tables, to ensure that any racing atomic read of this flag will only return
> false prior to page tables being updated, to allow for serialisation via
> page table locks.
> 
> Note that we set vma->anon_vma for anonymous mappings. This is because the
> expectation for anonymous mappings is that an anon_vma is established
> should they possess any page table mappings. This is also consistent with
> what we were doing prior to this patch (unconditionally setting anon_vma on
> guard region installation).
> 
> We also need to update retract_page_tables() to ensure that madvise(...,
> MADV_COLLAPSE) doesn't incorrectly collapse file-backed ranges contain
> guard regions.
> 
> This was previously guarded by anon_vma being set to catch MAP_PRIVATE
> cases, but the introduction of VM_MAYBE_GUARD necessitates that we check
> this flag instead.
> 
> We utilise vma_flag_test_atomic() to do so - we first perform an optimistic
> check, then after the PTE page table lock is held, we can check again
> safely, as upon guard marker install the flag is set atomically prior to
> the page table lock being taken to actually apply it.
> 
> So if the initial check fails either:
> 
> * Page table retraction acquires page table lock prior to VM_MAYBE_GUARD
>   being set - guard marker installation will be blocked until page table
>   retraction is complete.
> 
> OR:
> 
> * Guard marker installation acquires page table lock after setting
>   VM_MAYBE_GUARD, which raced and didn't pick this up in the initial
>   optimistic check, blocking page table retraction until the guard regions
>   are installed - the second VM_MAYBE_GUARD check will prevent page table
>   retraction.
> 
> Either way we're safe.
> 
> We refactor the retraction checks into a single
> file_backed_vma_is_retractable(), there doesn't seem to be any reason that
> the checks were separated as before.
> 
> Note that VM_MAYBE_GUARD being set atomically remains correct as
> vma_needs_copy() is invoked with the mmap and VMA write locks held,
> excluding any race with madvise_guard_install().
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Small nit below:

> @@ -1778,15 +1801,16 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>  
>  		/*
> -		 * Huge page lock is still held, so normally the page table
> -		 * must remain empty; and we have already skipped anon_vma
> -		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
> -		 * held, it is still possible for a racing userfaultfd_ioctl()
> -		 * to have inserted ptes or markers.  Now that we hold ptlock,
> -		 * repeating the anon_vma check protects from one category,
> -		 * and repeating the userfaultfd_wp() check from another.
> +		 * Huge page lock is still held, so normally the page table must
> +		 * remain empty; and we have already skipped anon_vma and
> +		 * userfaultfd_wp() vmas.  But since the mmap_lock is not held,
> +		 * it is still possible for a racing userfaultfd_ioctl() or
> +		 * madvise() to have inserted ptes or markers.  Now that we hold
> +		 * ptlock, repeating the anon_vma check protects from one
> +		 * category, and repeating the userfaultfd_wp() check from
> +		 * another.

The last part of the comment is unchanged and mentions anon_vma check and
userfaultfd_wp() check which were there explicitly originally, but now it's
a file_backed_vma_is_retractable() check that also includes the guard region
check, so maybe could be updated?

>  		 */
> -		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
> +		if (likely(file_backed_vma_is_retractable(vma))) {
>  			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
>  			pmdp_get_lockless_sync();
>  			success = true;
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 67bdfcb315b3..de918b107cfc 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1139,15 +1139,21 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
>  		return -EINVAL;
>  
>  	/*
> -	 * If we install guard markers, then the range is no longer
> -	 * empty from a page table perspective and therefore it's
> -	 * appropriate to have an anon_vma.
> -	 *
> -	 * This ensures that on fork, we copy page tables correctly.
> +	 * Set atomically under read lock. All pertinent readers will need to
> +	 * acquire an mmap/VMA write lock to read it. All remaining readers may
> +	 * or may not see the flag set, but we don't care.
> +	 */
> +	vma_flag_set_atomic(vma, VM_MAYBE_GUARD_BIT);
> +
> +	/*
> +	 * If anonymous and we are establishing page tables the VMA ought to
> +	 * have an anon_vma associated with it.
>  	 */
> -	err = anon_vma_prepare(vma);
> -	if (err)
> -		return err;
> +	if (vma_is_anonymous(vma)) {
> +		err = anon_vma_prepare(vma);
> +		if (err)
> +			return err;
> +	}
>  
>  	/*
>  	 * Optimistically try to install the guard marker pages first. If any



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/8] mm: add atomic VMA flags and set VM_MAYBE_GUARD as such
  2025-11-10 15:51   ` Vlastimil Babka
@ 2025-11-10 17:34     ` Lorenzo Stoakes
  0 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-10 17:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

On Mon, Nov 10, 2025 at 04:51:27PM +0100, Vlastimil Babka wrote:
> On 11/7/25 17:11, Lorenzo Stoakes wrote:
> > This patch adds the ability to atomically set VMA flags with only the mmap
> > read/VMA read lock held.
> >
> > As this could be hugely problematic for VMA flags in general given that all
> > other accesses are non-atomic and serialised by the mmap/VMA locks, we
> > implement this with a strict allow-list - that is, only designated flags
> > are allowed to do this.
> >
> > We make VM_MAYBE_GUARD one of these flags.
> >
> > Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  include/linux/mm.h | 42 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 42 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 2a5516bff75a..699566c21ff7 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -518,6 +518,9 @@ extern unsigned int kobjsize(const void *objp);
> >  /* This mask represents all the VMA flag bits used by mlock */
> >  #define VM_LOCKED_MASK	(VM_LOCKED | VM_LOCKONFAULT)
> >
> > +/* These flags can be updated atomically via VMA/mmap read lock. */
> > +#define VM_ATOMIC_SET_ALLOWED VM_MAYBE_GUARD
> > +
> >  /* Arch-specific flags to clear when updating VM flags on protection change */
> >  #ifndef VM_ARCH_CLEAR
> >  # define VM_ARCH_CLEAR	VM_NONE
> > @@ -860,6 +863,45 @@ static inline void vm_flags_mod(struct vm_area_struct *vma,
> >  	__vm_flags_mod(vma, set, clear);
> >  }
> >
> > +static inline bool __vma_flag_atomic_valid(struct vm_area_struct *vma,
> > +				       int bit)
> > +{
> > +	const vm_flags_t mask = BIT(bit);
> > +
> > +	/* Only specific flags are permitted */
> > +	if (WARN_ON_ONCE(!(mask & VM_ATOMIC_SET_ALLOWED)))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +/*
> > + * Set VMA flag atomically. Requires only VMA/mmap read lock. Only specific
> > + * valid flags are allowed to do this.
> > + */
> > +static inline void vma_flag_set_atomic(struct vm_area_struct *vma, int bit)
> > +{
> > +	/* mmap read lock/VMA read lock must be held. */
> > +	if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
> > +		vma_assert_locked(vma);
> > +
> > +	if (__vma_flag_atomic_valid(vma, bit))
> > +		set_bit(bit, &vma->__vm_flags);
> > +}
> > +
> > +/*
> > + * Test for VMA flag atomically. Requires no locks. Only specific valid flags
> > + * are allowed to do this.
> > + *
> > + * This is necessarily racey, so callers must ensure that serialisation is
> > + * achieved through some other means, or that races are permissible.
> > + */
> > +static inline bool vma_flag_test_atomic(struct vm_area_struct *vma, int bit)
> > +{
> > +	if (__vma_flag_atomic_valid(vma, bit))
> > +		return test_bit(bit, &vma->__vm_flags);
> > +}
>
> Hm clang is unhappy here.
>
> ./include/linux/mm.h:932:1: error: non-void function does not return a value in all control paths [-Werror,-Wreturn-type]
>   932 | }
>       | ^
> 1 error generated.

Yeah fun that gcc doesn't highlight this, god knows why not.

I thought I had fixed this (I remember it coming up in testing) but clearly I
fixed at the wrong commit.

>
> I don't have CONFIG_WERROR enabled though, so not sure why it's not just a
> warning, as the function is unused until patch 5/8 which adds a "return
> false" here. So it's just a potential bisection annoyance with clang.
>
> Andrew could you move that hunk from to this patch? Thanks.

I don't think this is the right solution.

Let's just add a return false. Will send a fix-patch.

>
> > +
> >  static inline void vma_set_anonymous(struct vm_area_struct *vma)
> >  {
> >  	vma->vm_ops = NULL;
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/8] mm: add atomic VMA flags and set VM_MAYBE_GUARD as such
  2025-11-07 16:11 ` [PATCH v3 2/8] mm: add atomic VMA flags and set VM_MAYBE_GUARD as such Lorenzo Stoakes
  2025-11-10 15:51   ` Vlastimil Babka
@ 2025-11-10 17:36   ` Lorenzo Stoakes
  2025-11-10 17:49     ` Lorenzo Stoakes
  2025-11-10 17:59   ` Lorenzo Stoakes
  2 siblings, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-10 17:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

Hi Andrew,

Please apply this trivial fix-patch.

Thanks, Lorenzo

----8<----

From e73da6d99f6e32c959c7a852a90f03c9c76816c6 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Mon, 10 Nov 2025 17:35:11 +0000
Subject: [PATCH] fixup

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 699566c21ff7..e94005f2b985 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -900,6 +900,8 @@ static inline bool vma_flag_test_atomic(struct vm_area_struct *vma, int bit)
 {
 	if (__vma_flag_atomic_valid(vma, bit))
 		return test_bit(bit, &vma->__vm_flags);
+
+	return false;
 }

 static inline void vma_set_anonymous(struct vm_area_struct *vma)
--
2.51.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 5/8] mm: set the VM_MAYBE_GUARD flag on guard region install
  2025-11-10 16:17   ` Vlastimil Babka
@ 2025-11-10 17:37     ` Lorenzo Stoakes
  0 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-10 17:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Jonathan Corbet, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

On Mon, Nov 10, 2025 at 05:17:13PM +0100, Vlastimil Babka wrote:
> On 11/7/25 17:11, Lorenzo Stoakes wrote:
> > Now we have established the VM_MAYBE_GUARD flag and added the capacity to
> > set it atomically, do so upon MADV_GUARD_INSTALL.
> >
> > The places where this flag is used currently and matter are:
> >
> > * VMA merge - performed under mmap/VMA write lock, therefore excluding
> >   racing writes.
> >
> > * /proc/$pid/smaps - can race the write, however this isn't meaningful as
> >   the flag write is performed at the point of the guard region being
> >   established, and thus an smaps reader can't reasonably expect to avoid
> >   races. Due to atomicity, a reader will observe either the flag being set
> >   or not. Therefore consistency will be maintained.
> >
> > In all other cases the flag being set is irrelevant and atomicity
> > guarantees other flags will be read correctly.
> >
> > Note that non-atomic updates of unrelated flags do not cause an issue with
> > this flag being set atomically, as writes of other flags are performed
> > under mmap/VMA write lock, and these atomic writes are performed under
> > mmap/VMA read lock, which excludes the write, avoiding RMW races.
> >
> > Note that we do not encounter issues with KCSAN by adjusting this flag
> > atomically, as we are only updating a single bit in the flag bitmap and
> > therefore we do not need to annotate these changes.
> >
> > We intentionally set this flag in advance of actually updating the page
> > tables, to ensure that any racing atomic read of this flag will only return
> > false prior to page tables being updated, to allow for serialisation via
> > page table locks.
> >
> > Note that we set vma->anon_vma for anonymous mappings. This is because the
> > expectation for anonymous mappings is that an anon_vma is established
> > should they possess any page table mappings. This is also consistent with
> > what we were doing prior to this patch (unconditionally setting anon_vma on
> > guard region installation).
> >
> > We also need to update retract_page_tables() to ensure that madvise(...,
> > MADV_COLLAPSE) doesn't incorrectly collapse file-backed ranges contain
> > guard regions.
> >
> > This was previously guarded by anon_vma being set to catch MAP_PRIVATE
> > cases, but the introduction of VM_MAYBE_GUARD necessitates that we check
> > this flag instead.
> >
> > We utilise vma_flag_test_atomic() to do so - we first perform an optimistic
> > check, then after the PTE page table lock is held, we can check again
> > safely, as upon guard marker install the flag is set atomically prior to
> > the page table lock being taken to actually apply it.
> >
> > So if the initial check fails either:
> >
> > * Page table retraction acquires page table lock prior to VM_MAYBE_GUARD
> >   being set - guard marker installation will be blocked until page table
> >   retraction is complete.
> >
> > OR:
> >
> > * Guard marker installation acquires page table lock after setting
> >   VM_MAYBE_GUARD, which raced and didn't pick this up in the initial
> >   optimistic check, blocking page table retraction until the guard regions
> >   are installed - the second VM_MAYBE_GUARD check will prevent page table
> >   retraction.
> >
> > Either way we're safe.
> >
> > We refactor the retraction checks into a single
> > file_backed_vma_is_retractable(), there doesn't seem to be any reason that
> > the checks were separated as before.
> >
> > Note that VM_MAYBE_GUARD being set atomically remains correct as
> > vma_needs_copy() is invoked with the mmap and VMA write locks held,
> > excluding any race with madvise_guard_install().
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks

>
> Small nit below:
>
> > @@ -1778,15 +1801,16 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >  			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> >
> >  		/*
> > -		 * Huge page lock is still held, so normally the page table
> > -		 * must remain empty; and we have already skipped anon_vma
> > -		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
> > -		 * held, it is still possible for a racing userfaultfd_ioctl()
> > -		 * to have inserted ptes or markers.  Now that we hold ptlock,
> > -		 * repeating the anon_vma check protects from one category,
> > -		 * and repeating the userfaultfd_wp() check from another.
> > +		 * Huge page lock is still held, so normally the page table must
> > +		 * remain empty; and we have already skipped anon_vma and
> > +		 * userfaultfd_wp() vmas.  But since the mmap_lock is not held,
> > +		 * it is still possible for a racing userfaultfd_ioctl() or
> > +		 * madvise() to have inserted ptes or markers.  Now that we hold
> > +		 * ptlock, repeating the anon_vma check protects from one
> > +		 * category, and repeating the userfaultfd_wp() check from
> > +		 * another.
>
> The last part of the comment is unchanged and mentions anon_vma check and
> userfaultfd_wp() check which were there explicitly originally, but now it's
> a file_backed_vma_is_retractable() check that also includes the guard region
> check, so maybe could be updated?

OK will send fix-patch.

>
> >  		 */
> > -		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
> > +		if (likely(file_backed_vma_is_retractable(vma))) {
> >  			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> >  			pmdp_get_lockless_sync();
> >  			success = true;
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 67bdfcb315b3..de918b107cfc 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1139,15 +1139,21 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
> >  		return -EINVAL;
> >
> >  	/*
> > -	 * If we install guard markers, then the range is no longer
> > -	 * empty from a page table perspective and therefore it's
> > -	 * appropriate to have an anon_vma.
> > -	 *
> > -	 * This ensures that on fork, we copy page tables correctly.
> > +	 * Set atomically under read lock. All pertinent readers will need to
> > +	 * acquire an mmap/VMA write lock to read it. All remaining readers may
> > +	 * or may not see the flag set, but we don't care.
> > +	 */
> > +	vma_flag_set_atomic(vma, VM_MAYBE_GUARD_BIT);
> > +
> > +	/*
> > +	 * If anonymous and we are establishing page tables the VMA ought to
> > +	 * have an anon_vma associated with it.
> >  	 */
> > -	err = anon_vma_prepare(vma);
> > -	if (err)
> > -		return err;
> > +	if (vma_is_anonymous(vma)) {
> > +		err = anon_vma_prepare(vma);
> > +		if (err)
> > +			return err;
> > +	}
> >
> >  	/*
> >  	 * Optimistically try to install the guard marker pages first. If any
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 5/8] mm: set the VM_MAYBE_GUARD flag on guard region install
  2025-11-07 16:11 ` [PATCH v3 5/8] mm: set the VM_MAYBE_GUARD flag on guard region install Lorenzo Stoakes
  2025-11-10 16:17   ` Vlastimil Babka
@ 2025-11-10 17:43   ` Lorenzo Stoakes
  1 sibling, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-10 17:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

Hi Andrew,

Please apply this trivial comment fixup fix-patch.

Thanks, Lorenzo

----8<----
From ed78447f613dc9ef16ce9d7a7e43de993379d9f5 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Mon, 10 Nov 2025 17:41:15 +0000
Subject: [PATCH] fixup

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/khugepaged.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c75afeac4bbb..742b47e0fb75 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1806,9 +1806,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		* userfaultfd_wp() vmas.  But since the mmap_lock is not held,
 		* it is still possible for a racing userfaultfd_ioctl() or
 		* madvise() to have inserted ptes or markers.  Now that we hold
-		* ptlock, repeating the anon_vma check protects from one
-		* category, and repeating the userfaultfd_wp() check from
-		* another.
+		* ptlock, repeating the retractable checks protects us from
+		* races against the prior checks.
 		*/
 		if (likely(file_backed_vma_is_retractable(vma))) {
 			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
--
2.51.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/8] mm: add atomic VMA flags and set VM_MAYBE_GUARD as such
  2025-11-10 17:36   ` Lorenzo Stoakes
@ 2025-11-10 17:49     ` Lorenzo Stoakes
  0 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-10 17:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

Andrew - actually please ignore this, let me send another that'll fold this in
and make sparse happy too.

Cheers, Lorenzo

On Mon, Nov 10, 2025 at 05:36:29PM +0000, Lorenzo Stoakes wrote:
> Hi Andrew,
>
> Please apply this trivial fix-patch.
>
> Thanks, Lorenzo
>
> ----8<----
>
> From e73da6d99f6e32c959c7a852a90f03c9c76816c6 Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Mon, 10 Nov 2025 17:35:11 +0000
> Subject: [PATCH] fixup
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  include/linux/mm.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 699566c21ff7..e94005f2b985 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -900,6 +900,8 @@ static inline bool vma_flag_test_atomic(struct vm_area_struct *vma, int bit)
>  {
>  	if (__vma_flag_atomic_valid(vma, bit))
>  		return test_bit(bit, &vma->__vm_flags);
> +
> +	return false;
>  }
>
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
> --
> 2.51.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/8] mm: add atomic VMA flags and set VM_MAYBE_GUARD as such
  2025-11-07 16:11 ` [PATCH v3 2/8] mm: add atomic VMA flags and set VM_MAYBE_GUARD as such Lorenzo Stoakes
  2025-11-10 15:51   ` Vlastimil Babka
  2025-11-10 17:36   ` Lorenzo Stoakes
@ 2025-11-10 17:59   ` Lorenzo Stoakes
  2 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-11-10 17:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jann Horn,
	Pedro Falcato, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, linux-kernel, linux-fsdevel,
	linux-doc, linux-mm, linux-trace-kernel, linux-kselftest,
	Andrei Vagin

Hi Andrew,

OK take 2 here :) please apply this fix-patch which both makes sparse happy and
avoids a clang compilation error on this commit.

Cheers, Lorenzo

----8<----
From 553fb3f0fc9f3c351bddf956b00d1dfaa2a32920 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Mon, 10 Nov 2025 17:35:11 +0000
Subject: [PATCH] fixup

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 699566c21ff7..a9b8f6205204 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -886,7 +886,7 @@ static inline void vma_flag_set_atomic(struct vm_area_struct *vma, int bit)
 		vma_assert_locked(vma);

 	if (__vma_flag_atomic_valid(vma, bit))
-		set_bit(bit, &vma->__vm_flags);
+		set_bit(bit, &ACCESS_PRIVATE(vma, __vm_flags));
 }

 /*
@@ -899,7 +899,9 @@ static inline void vma_flag_set_atomic(struct vm_area_struct *vma, int bit)
 static inline bool vma_flag_test_atomic(struct vm_area_struct *vma, int bit)
 {
 	if (__vma_flag_atomic_valid(vma, bit))
-		return test_bit(bit, &vma->__vm_flags);
+		return test_bit(bit, &vma->vm_flags);
+
+	return false;
 }

 static inline void vma_set_anonymous(struct vm_area_struct *vma)
--
2.51.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-11-10 17:59 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-07 16:11 [PATCH v3 0/8] introduce VM_MAYBE_GUARD and make it sticky Lorenzo Stoakes
2025-11-07 16:11 ` [PATCH v3 1/8] mm: introduce VM_MAYBE_GUARD and make visible in /proc/$pid/smaps Lorenzo Stoakes
2025-11-07 16:11 ` [PATCH v3 2/8] mm: add atomic VMA flags and set VM_MAYBE_GUARD as such Lorenzo Stoakes
2025-11-10 15:51   ` Vlastimil Babka
2025-11-10 17:34     ` Lorenzo Stoakes
2025-11-10 17:36   ` Lorenzo Stoakes
2025-11-10 17:49     ` Lorenzo Stoakes
2025-11-10 17:59   ` Lorenzo Stoakes
2025-11-07 16:11 ` [PATCH v3 3/8] mm: implement sticky VMA flags Lorenzo Stoakes
2025-11-07 16:11 ` [PATCH v3 4/8] mm: introduce copy-on-fork VMAs and make VM_MAYBE_GUARD one Lorenzo Stoakes
2025-11-07 16:11 ` [PATCH v3 5/8] mm: set the VM_MAYBE_GUARD flag on guard region install Lorenzo Stoakes
2025-11-10 16:17   ` Vlastimil Babka
2025-11-10 17:37     ` Lorenzo Stoakes
2025-11-10 17:43   ` Lorenzo Stoakes
2025-11-07 16:11 ` [PATCH v3 6/8] tools/testing/vma: add VMA sticky userland tests Lorenzo Stoakes
2025-11-07 16:11 ` [PATCH v3 7/8] tools/testing/selftests/mm: add MADV_COLLAPSE test case Lorenzo Stoakes
2025-11-07 16:11 ` [PATCH v3 8/8] tools/testing/selftests/mm: add smaps visibility guard region test Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).