linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case
@ 2025-07-16 17:38 Lorenzo Stoakes
  2025-07-16 17:38 ` [PATCH v3 1/5] mm/mseal: always define VM_SEALED Lorenzo Stoakes
                   ` (6 more replies)
  0 siblings, 7 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-16 17:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

Perform a number of cleanups to the mseal logic. Firstly, VM_SEALED is
treated differently from every other VMA flag, it really doesn't make sense
to do this, so we start by making this consistent with everything else.

Next we place the madvise logic where it belongs - in mm/madvise.c. It
really makes no sense to abstract this elsewhere. In doing so, we go to
great lengths to explain very clearly the previously very confusing logic
as to what sealed mappings are impacted here.

In doing so, we fix an existing logical oversight - previously we permitted
an madvise() discard operation for a sealed, read-only MAP_PRIVATE
file-backed mapping.

However this is incorrect. To see why consider:

1. A MAP_PRIVATE R/W file-backed mapping is established.
2. The mapping is written to, which backs it with anonymous memory.
3. The mapping is mprotect()'d read-only.
4. The mapping is mseal()'d.

At this point you have data that, once sealed, a user cannot alter, but a
discard operation can unrecoverably remove. This contradicts the semantics
of mseal(), so should not be permitted.

We then abstract out and explain the 'are there are any gaps in this range
in the mm?' check being performed as a prerequisite to mseal being
performed.

Finally, we simplify the actual mseal logic which is really quite
straightforward.


v3:
* Propagated more tags, thanks everyone!
* Updated 5/5 to assign curr_start in a smarter way as per Liam. Adjust
  code to more sensibly handle already-sealed case at the same time.
* Updated 4/5 to not move range_contains_unmapped() for better diff.
* Renamed can_modify_vma() to vma_is_sealed() and inverted the logic - this
  is far clearer than the nebulous 'can modify VMA'.

v2:
* Propagated tags, thanks everyone!
* Updated can_madvise_modify() to a more logical order re: the checks
  performed, as per David.
* Replaced vma_is_anonymous() check (which was, in the original code, a
  vma->vm_file or vma->vm_ops check) with a vma->vm_flags & VM_SHARED
  check - to explicitly check for shared mappings vs private to preclude
  MAP_PRIVATE-mapping file-baked mappings, as per David.
* Made range_contains_unmapped() static and placed in mm/mseal.c to avoid
  encouraging any other internal users towards this rather silly pattern,
  as per Pedro and Liam.
https://lore.kernel.org/all/cover.1752586090.git.lorenzo.stoakes@oracle.com/

v1:
https://lore.kernel.org/all/cover.1752497324.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (5):
  mm/mseal: always define VM_SEALED
  mm/mseal: update madvise() logic
  mm/mseal: small cleanups
  mm/mseal: Simplify and rename VMA gap check
  mm/mseal: rework mseal apply logic

 include/linux/mm.h                      |   6 +-
 mm/madvise.c                            |  63 +++++++++-
 mm/mprotect.c                           |   2 +-
 mm/mremap.c                             |   2 +-
 mm/mseal.c                              | 157 +++++-------------------
 mm/vma.c                                |   4 +-
 mm/vma.h                                |  27 +---
 tools/testing/selftests/mm/mseal_test.c |   3 +-
 tools/testing/vma/vma_internal.h        |   6 +-
 9 files changed, 107 insertions(+), 163 deletions(-)

--
2.50.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 1/5] mm/mseal: always define VM_SEALED
  2025-07-16 17:38 [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Lorenzo Stoakes
@ 2025-07-16 17:38 ` Lorenzo Stoakes
  2025-07-24 18:34   ` Jeff Xu
  2025-07-16 17:38 ` [PATCH v3 2/5] mm/mseal: update madvise() logic Lorenzo Stoakes
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-16 17:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

There is no reason to treat VM_SEALED in a special way, in each other case
in which a VMA flag is unavailable due to configuration, we simply assign
that flag to VM_NONE, so make VM_SEALED consistent with all other VMA
flags in this respect.

Additionally, use the next available bit for VM_SEALED, 42, rather than
arbitrarily putting it at 63 and update the declaration to match all other
VMA flags.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h               | 6 ++++--
 tools/testing/vma/vma_internal.h | 6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 805108d7bbc3..fbf2a1f7ffc6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -414,8 +414,10 @@ extern unsigned int kobjsize(const void *objp);
 #endif
 
 #ifdef CONFIG_64BIT
-/* VM is sealed, in vm_flags */
-#define VM_SEALED	_BITUL(63)
+#define VM_SEALED_BIT	42
+#define VM_SEALED	BIT(VM_SEALED_BIT)
+#else
+#define VM_SEALED	VM_NONE
 #endif
 
 /* Bits set in the VMA until the stack is in its final location */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 991022e9e0d3..0fe52fd6782b 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -108,8 +108,10 @@ extern unsigned long dac_mmap_min_addr;
 #define CAP_IPC_LOCK         14
 
 #ifdef CONFIG_64BIT
-/* VM is sealed, in vm_flags */
-#define VM_SEALED	_BITUL(63)
+#define VM_SEALED_BIT	42
+#define VM_SEALED	BIT(VM_SEALED_BIT)
+#else
+#define VM_SEALED	VM_NONE
 #endif
 
 #define FIRST_USER_ADDRESS	0UL
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-16 17:38 [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Lorenzo Stoakes
  2025-07-16 17:38 ` [PATCH v3 1/5] mm/mseal: always define VM_SEALED Lorenzo Stoakes
@ 2025-07-16 17:38 ` Lorenzo Stoakes
  2025-07-24 18:39   ` Jeff Xu
                     ` (2 more replies)
  2025-07-16 17:38 ` [PATCH v3 3/5] mm/mseal: small cleanups Lorenzo Stoakes
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-16 17:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

The madvise() logic is inexplicably performed in mm/mseal.c - this ought
to be located in mm/madvise.c.

Additionally can_modify_vma_madv() is inconsistently named and, in
combination with is_ro_anon(), is very confusing logic.

Put a static function in mm/madvise.c instead - can_madvise_modify() -
that spells out exactly what's happening.  Also explicitly check for an
anon VMA.

Also add commentary to explain what's going on.

Essentially - we disallow discarding of data in mseal()'d mappings in
instances where the user couldn't otherwise write to that data.

Shared mappings are always backed, so no discard will actually truly
discard the data.  Read-only anonymous and MAP_PRIVATE file-backed
mappings are the ones we are interested in.

We make a change to the logic here to correct a mistake - we must disallow
discard of read-only MAP_PRIVATE file-backed mappings, which previously we
were not.

The justification for this change is to account for the case where:

1. A MAP_PRIVATE R/W file-backed mapping is established.
2. The mapping is written to, which backs it with anonymous memory.
3. The mapping is mprotect()'d read-only.
4. The mapping is mseal()'d.

If we were to now allow discard of this data, it would mean mseal() would
not prevent the unrecoverable discarding of data and it was thus violate
the semantics of sealed VMAs.

Finally, update the mseal tests, which were asserting previously that a
read-only MAP_PRIVATE file-backed mapping could be discarded.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/madvise.c                            | 63 ++++++++++++++++++++++++-
 mm/mseal.c                              | 49 -------------------
 mm/vma.h                                |  7 ---
 tools/testing/selftests/mm/mseal_test.c |  3 +-
 4 files changed, 63 insertions(+), 59 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 2bf80989d5b6..dc3d8497b0f4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -19,6 +19,7 @@
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/mmu_context.h>
 #include <linux/string.h>
 #include <linux/uio.h>
 #include <linux/ksm.h>
@@ -1255,6 +1256,66 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
 			       &guard_remove_walk_ops, NULL);
 }
 
+#ifdef CONFIG_64BIT
+/* Does the madvise operation result in discarding of mapped data? */
+static bool is_discard(int behavior)
+{
+	switch (behavior) {
+	case MADV_FREE:
+	case MADV_DONTNEED:
+	case MADV_DONTNEED_LOCKED:
+	case MADV_REMOVE:
+	case MADV_DONTFORK:
+	case MADV_WIPEONFORK:
+	case MADV_GUARD_INSTALL:
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
+ * circumstances - discarding of data from read-only anonymous SEALED mappings.
+ *
+ * This is because users cannot trivally discard data from these VMAs, and may
+ * only do so via an appropriate madvise() call.
+ */
+static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
+{
+	struct vm_area_struct *vma = madv_behavior->vma;
+
+	/* If the VMA isn't sealed we're good. */
+	if (can_modify_vma(vma))
+		return true;
+
+	/* For a sealed VMA, we only care about discard operations. */
+	if (!is_discard(madv_behavior->behavior))
+		return true;
+
+	/*
+	 * But shared mappings are fine, as dirty pages will be written to a
+	 * backing store regardless of discard.
+	 */
+	if (vma->vm_flags & VM_SHARED)
+		return true;
+
+	/* If the user could write to the mapping anyway, then this is fine. */
+	if ((vma->vm_flags & VM_WRITE) &&
+	    arch_vma_access_permitted(vma, /* write= */ true,
+			/* execute= */ false, /* foreign= */ false))
+		return true;
+
+	/* Otherwise, we are not permitted to perform this operation. */
+	return false;
+}
+#else
+static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
+{
+	return true;
+}
+#endif
+
 /*
  * Apply an madvise behavior to a region of a vma.  madvise_update_vma
  * will handle splitting a vm area into separate areas, each area with its own
@@ -1268,7 +1329,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
 	struct madvise_behavior_range *range = &madv_behavior->range;
 	int error;
 
-	if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
+	if (unlikely(!can_madvise_modify(madv_behavior)))
 		return -EPERM;
 
 	switch (behavior) {
diff --git a/mm/mseal.c b/mm/mseal.c
index c27197ac04e8..1308e88ab184 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -11,7 +11,6 @@
 #include <linux/mman.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
-#include <linux/mmu_context.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
 #include "internal.h"
@@ -21,54 +20,6 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
 	vm_flags_set(vma, VM_SEALED);
 }
 
-static bool is_madv_discard(int behavior)
-{
-	switch (behavior) {
-	case MADV_FREE:
-	case MADV_DONTNEED:
-	case MADV_DONTNEED_LOCKED:
-	case MADV_REMOVE:
-	case MADV_DONTFORK:
-	case MADV_WIPEONFORK:
-	case MADV_GUARD_INSTALL:
-		return true;
-	}
-
-	return false;
-}
-
-static bool is_ro_anon(struct vm_area_struct *vma)
-{
-	/* check anonymous mapping. */
-	if (vma->vm_file || vma->vm_flags & VM_SHARED)
-		return false;
-
-	/*
-	 * check for non-writable:
-	 * PROT=RO or PKRU is not writeable.
-	 */
-	if (!(vma->vm_flags & VM_WRITE) ||
-		!arch_vma_access_permitted(vma, true, false, false))
-		return true;
-
-	return false;
-}
-
-/*
- * Check if a vma is allowed to be modified by madvise.
- */
-bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
-{
-	if (!is_madv_discard(behavior))
-		return true;
-
-	if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
-		return false;
-
-	/* Allow by default. */
-	return true;
-}
-
 static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		struct vm_area_struct **prev, unsigned long start,
 		unsigned long end, vm_flags_t newflags)
diff --git a/mm/vma.h b/mm/vma.h
index acdcc515c459..85db5e880fcc 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -577,8 +577,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
 	return true;
 }
 
-bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior);
-
 #else
 
 static inline bool can_modify_vma(struct vm_area_struct *vma)
@@ -586,11 +584,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
 	return true;
 }
 
-static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
-{
-	return true;
-}
-
 #endif
 
 #if defined(CONFIG_STACK_GROWSUP)
diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c
index 005f29c86484..34c042da4de2 100644
--- a/tools/testing/selftests/mm/mseal_test.c
+++ b/tools/testing/selftests/mm/mseal_test.c
@@ -1712,7 +1712,6 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
 	unsigned long size = 4 * page_size;
 	int ret;
 	int fd;
-	unsigned long mapflags = MAP_PRIVATE;
 
 	fd = memfd_create("test", 0);
 	FAIL_TEST_IF_FALSE(fd > 0);
@@ -1720,7 +1719,7 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
 	ret = fallocate(fd, 0, 0, size);
 	FAIL_TEST_IF_FALSE(!ret);
 
-	ptr = mmap(NULL, size, PROT_READ, mapflags, fd, 0);
+	ptr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
 	FAIL_TEST_IF_FALSE(ptr != MAP_FAILED);
 
 	if (seal) {
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 3/5] mm/mseal: small cleanups
  2025-07-16 17:38 [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Lorenzo Stoakes
  2025-07-16 17:38 ` [PATCH v3 1/5] mm/mseal: always define VM_SEALED Lorenzo Stoakes
  2025-07-16 17:38 ` [PATCH v3 2/5] mm/mseal: update madvise() logic Lorenzo Stoakes
@ 2025-07-16 17:38 ` Lorenzo Stoakes
  2025-07-24 18:40   ` Jeff Xu
  2025-07-16 17:38 ` [PATCH v3 4/5] mm/mseal: Simplify and rename VMA gap check Lorenzo Stoakes
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-16 17:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

Drop the wholly unnecessary set_vma_sealed() helper(), which is used only
once, and place VMA_ITERATOR() declarations in the correct place.

Retain vma_is_sealed(), and use it instead of the confusingly named
can_modify_vma(), so it's abundantly clear what's being tested, rather then
a nebulous sense of 'can the VMA be modified'.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/madvise.c  |  2 +-
 mm/mprotect.c |  2 +-
 mm/mremap.c   |  2 +-
 mm/mseal.c    |  9 +--------
 mm/vma.c      |  4 ++--
 mm/vma.h      | 20 ++------------------
 6 files changed, 8 insertions(+), 31 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index dc3d8497b0f4..da6e0e7c00b5 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1286,7 +1286,7 @@ static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
 	struct vm_area_struct *vma = madv_behavior->vma;
 
 	/* If the VMA isn't sealed we're good. */
-	if (can_modify_vma(vma))
+	if (!vma_is_sealed(vma))
 		return true;
 
 	/* For a sealed VMA, we only care about discard operations. */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 88709c01177b..807939177065 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -605,7 +605,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 	unsigned long charged = 0;
 	int error;
 
-	if (!can_modify_vma(vma))
+	if (vma_is_sealed(vma))
 		return -EPERM;
 
 	if (newflags == oldflags) {
diff --git a/mm/mremap.c b/mm/mremap.c
index 5b7fe8f36074..8e93eca86721 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1649,7 +1649,7 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
 		return -EFAULT;
 
 	/* If mseal()'d, mremap() is prohibited. */
-	if (!can_modify_vma(vma))
+	if (vma_is_sealed(vma))
 		return -EPERM;
 
 	/* Align to hugetlb page size, if required. */
diff --git a/mm/mseal.c b/mm/mseal.c
index 1308e88ab184..adbcc65e9660 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -15,11 +15,6 @@
 #include <linux/sched.h>
 #include "internal.h"
 
-static inline void set_vma_sealed(struct vm_area_struct *vma)
-{
-	vm_flags_set(vma, VM_SEALED);
-}
-
 static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		struct vm_area_struct **prev, unsigned long start,
 		unsigned long end, vm_flags_t newflags)
@@ -36,7 +31,7 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	set_vma_sealed(vma);
+	vm_flags_set(vma, VM_SEALED);
 out:
 	*prev = vma;
 	return ret;
@@ -53,7 +48,6 @@ static int check_mm_seal(unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *vma;
 	unsigned long nstart = start;
-
 	VMA_ITERATOR(vmi, current->mm, start);
 
 	/* going through each vma to check. */
@@ -78,7 +72,6 @@ static int apply_mm_seal(unsigned long start, unsigned long end)
 {
 	unsigned long nstart;
 	struct vm_area_struct *vma, *prev;
-
 	VMA_ITERATOR(vmi, current->mm, start);
 
 	vma = vma_iter_load(&vmi);
diff --git a/mm/vma.c b/mm/vma.c
index fc502b741dcf..75fd2759964b 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1351,7 +1351,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		}
 
 		/* Don't bother splitting the VMA if we can't unmap it anyway */
-		if (!can_modify_vma(vms->vma)) {
+		if (vma_is_sealed(vms->vma)) {
 			error = -EPERM;
 			goto start_split_failed;
 		}
@@ -1371,7 +1371,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 	for_each_vma_range(*(vms->vmi), next, vms->end) {
 		long nrpages;
 
-		if (!can_modify_vma(next)) {
+		if (vma_is_sealed(next)) {
 			error = -EPERM;
 			goto modify_vma_failed;
 		}
diff --git a/mm/vma.h b/mm/vma.h
index 85db5e880fcc..b123a9cdedb0 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -559,31 +559,15 @@ struct vm_area_struct *vma_iter_next_rewind(struct vma_iterator *vmi,
 }
 
 #ifdef CONFIG_64BIT
-
 static inline bool vma_is_sealed(struct vm_area_struct *vma)
 {
 	return (vma->vm_flags & VM_SEALED);
 }
-
-/*
- * check if a vma is sealed for modification.
- * return true, if modification is allowed.
- */
-static inline bool can_modify_vma(struct vm_area_struct *vma)
-{
-	if (unlikely(vma_is_sealed(vma)))
-		return false;
-
-	return true;
-}
-
 #else
-
-static inline bool can_modify_vma(struct vm_area_struct *vma)
+static inline bool vma_is_sealed(struct vm_area_struct *vma)
 {
-	return true;
+	return false;
 }
-
 #endif
 
 #if defined(CONFIG_STACK_GROWSUP)
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 4/5] mm/mseal: Simplify and rename VMA gap check
  2025-07-16 17:38 [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2025-07-16 17:38 ` [PATCH v3 3/5] mm/mseal: small cleanups Lorenzo Stoakes
@ 2025-07-16 17:38 ` Lorenzo Stoakes
  2025-07-24 18:40   ` Jeff Xu
  2025-07-16 17:38 ` [PATCH v3 5/5] mm/mseal: rework mseal apply logic Lorenzo Stoakes
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-16 17:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

The check_mm_seal() function is doing something general - checking whether
a range contains only VMAs (or rather that it does NOT contain any
unmapped regions).

So rename this function to range_contains_unmapped().

Additionally simplify the logic, we are simply checking whether the last
vma->vm_end has either a VMA starting after it or ends before the end
parameter.

This check is rather dubious, so it is sensible to keep it local to
mm/mseal.c as at a later stage it may be removed, and we don't want any
other mm code to perform such a check.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/mseal.c | 36 +++++++++++-------------------------
 1 file changed, 11 insertions(+), 25 deletions(-)

diff --git a/mm/mseal.c b/mm/mseal.c
index adbcc65e9660..61c07b1369cb 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -37,32 +37,22 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	return ret;
 }
 
-/*
- * Check for do_mseal:
- * 1> start is part of a valid vma.
- * 2> end is part of a valid vma.
- * 3> No gap (unallocated address) between start and end.
- * 4> map is sealable.
- */
-static int check_mm_seal(unsigned long start, unsigned long end)
+/* Does the [start, end) range contain any unmapped memory? */
+static bool range_contains_unmapped(struct mm_struct *mm,
+		unsigned long start, unsigned long end)
 {
 	struct vm_area_struct *vma;
-	unsigned long nstart = start;
+	unsigned long prev_end = start;
 	VMA_ITERATOR(vmi, current->mm, start);
 
-	/* going through each vma to check. */
 	for_each_vma_range(vmi, vma, end) {
-		if (vma->vm_start > nstart)
-			/* unallocated memory found. */
-			return -ENOMEM;
-
-		if (vma->vm_end >= end)
-			return 0;
+		if (vma->vm_start > prev_end)
+			return true;
 
-		nstart = vma->vm_end;
+		prev_end = vma->vm_end;
 	}
 
-	return -ENOMEM;
+	return prev_end < end;
 }
 
 /*
@@ -184,14 +174,10 @@ int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
 	if (mmap_write_lock_killable(mm))
 		return -EINTR;
 
-	/*
-	 * First pass, this helps to avoid
-	 * partial sealing in case of error in input address range,
-	 * e.g. ENOMEM error.
-	 */
-	ret = check_mm_seal(start, end);
-	if (ret)
+	if (range_contains_unmapped(mm, start, end)) {
+		ret = -ENOMEM;
 		goto out;
+	}
 
 	/*
 	 * Second pass, this should success, unless there are errors
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 5/5] mm/mseal: rework mseal apply logic
  2025-07-16 17:38 [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2025-07-16 17:38 ` [PATCH v3 4/5] mm/mseal: Simplify and rename VMA gap check Lorenzo Stoakes
@ 2025-07-16 17:38 ` Lorenzo Stoakes
  2025-07-24 18:41   ` Jeff Xu
  2025-07-24 18:32 ` [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Jeff Xu
  2025-07-25  6:40 ` Lorenzo Stoakes
  6 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-16 17:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

The logic can be simplified - firstly by renaming the inconsistently named
apply_mm_seal() to mseal_apply().

We then wrap mseal_fixup() into the main loop as the logic is simple enough
to not require it, equally it isn't a hugely pleasant pattern in mprotect()
etc.  so it's not something we want to perpetuate.

We eliminate the need for invoking vma_iter_end() on each loop by directly
determining if the VMA was merged - the only thing we need concern
ourselves with is whether the start/end of the (gapless) range are offset
into VMAs.

This refactoring also avoids the rather horrid 'pass pointer to prev
around' pattern used in mprotect() et al.

No functional change intended.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/mseal.c | 67 ++++++++++++++++--------------------------------------
 1 file changed, 20 insertions(+), 47 deletions(-)

diff --git a/mm/mseal.c b/mm/mseal.c
index 61c07b1369cb..0ab12e09792a 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -15,28 +15,6 @@
 #include <linux/sched.h>
 #include "internal.h"
 
-static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
-		struct vm_area_struct **prev, unsigned long start,
-		unsigned long end, vm_flags_t newflags)
-{
-	int ret = 0;
-	vm_flags_t oldflags = vma->vm_flags;
-
-	if (newflags == oldflags)
-		goto out;
-
-	vma = vma_modify_flags(vmi, *prev, vma, start, end, newflags);
-	if (IS_ERR(vma)) {
-		ret = PTR_ERR(vma);
-		goto out;
-	}
-
-	vm_flags_set(vma, VM_SEALED);
-out:
-	*prev = vma;
-	return ret;
-}
-
 /* Does the [start, end) range contain any unmapped memory? */
 static bool range_contains_unmapped(struct mm_struct *mm,
 		unsigned long start, unsigned long end)
@@ -55,38 +33,33 @@ static bool range_contains_unmapped(struct mm_struct *mm,
 	return prev_end < end;
 }
 
-/*
- * Apply sealing.
- */
-static int apply_mm_seal(unsigned long start, unsigned long end)
+static int mseal_apply(struct mm_struct *mm,
+		unsigned long start, unsigned long end)
 {
-	unsigned long nstart;
 	struct vm_area_struct *vma, *prev;
-	VMA_ITERATOR(vmi, current->mm, start);
+	unsigned long curr_start = start;
+	VMA_ITERATOR(vmi, mm, start);
 
+	/* We know there are no gaps so this will be non-NULL. */
 	vma = vma_iter_load(&vmi);
-	/*
-	 * Note: check_mm_seal should already checked ENOMEM case.
-	 * so vma should not be null, same for the other ENOMEM cases.
-	 */
 	prev = vma_prev(&vmi);
 	if (start > vma->vm_start)
 		prev = vma;
 
-	nstart = start;
 	for_each_vma_range(vmi, vma, end) {
-		int error;
-		unsigned long tmp;
-		vm_flags_t newflags;
-
-		newflags = vma->vm_flags | VM_SEALED;
-		tmp = vma->vm_end;
-		if (tmp > end)
-			tmp = end;
-		error = mseal_fixup(&vmi, vma, &prev, nstart, tmp, newflags);
-		if (error)
-			return error;
-		nstart = vma_iter_end(&vmi);
+		unsigned long curr_end = MIN(vma->vm_end, end);
+
+		if (!(vma->vm_flags & VM_SEALED)) {
+			vma = vma_modify_flags(&vmi, prev, vma,
+					curr_start, curr_end,
+					vma->vm_flags | VM_SEALED);
+			if (IS_ERR(vma))
+				return PTR_ERR(vma);
+			vm_flags_set(vma, VM_SEALED);
+		}
+
+		prev = vma;
+		curr_start = curr_end;
 	}
 
 	return 0;
@@ -185,10 +158,10 @@ int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
 	 * reaching the max supported VMAs, however, those cases shall
 	 * be rare.
 	 */
-	ret = apply_mm_seal(start, end);
+	ret = mseal_apply(mm, start, end);
 
 out:
-	mmap_write_unlock(current->mm);
+	mmap_write_unlock(mm);
 	return ret;
 }
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case
  2025-07-16 17:38 [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Lorenzo Stoakes
                   ` (4 preceding siblings ...)
  2025-07-16 17:38 ` [PATCH v3 5/5] mm/mseal: rework mseal apply logic Lorenzo Stoakes
@ 2025-07-24 18:32 ` Jeff Xu
  2025-07-24 19:10   ` Lorenzo Stoakes
  2025-07-25  6:40 ` Lorenzo Stoakes
  6 siblings, 1 reply; 38+ messages in thread
From: Jeff Xu @ 2025-07-24 18:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Kees Cook, linux-hardening

Hi Lorenzo,

Thanks for including me in this thread. I've just returned from
vacation and am catching up on my emails. I'll respond to each patch
separately in the following emails.

Could you consider adding mm/mseal.c to the HARDENING section of
MAINTAINERS? Please include Kees and linux-hardening in future emails
about mseal - Kees has been helping me with mseal since the beginning.

Thanks and regards,
-Jeff

On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Perform a number of cleanups to the mseal logic. Firstly, VM_SEALED is
> treated differently from every other VMA flag, it really doesn't make sense
> to do this, so we start by making this consistent with everything else.
>
> Next we place the madvise logic where it belongs - in mm/madvise.c. It
> really makes no sense to abstract this elsewhere. In doing so, we go to
> great lengths to explain very clearly the previously very confusing logic
> as to what sealed mappings are impacted here.
>
> In doing so, we fix an existing logical oversight - previously we permitted
> an madvise() discard operation for a sealed, read-only MAP_PRIVATE
> file-backed mapping.
>
> However this is incorrect. To see why consider:
>
> 1. A MAP_PRIVATE R/W file-backed mapping is established.
> 2. The mapping is written to, which backs it with anonymous memory.
> 3. The mapping is mprotect()'d read-only.
> 4. The mapping is mseal()'d.
>
> At this point you have data that, once sealed, a user cannot alter, but a
> discard operation can unrecoverably remove. This contradicts the semantics
> of mseal(), so should not be permitted.
>
> We then abstract out and explain the 'are there are any gaps in this range
> in the mm?' check being performed as a prerequisite to mseal being
> performed.
>
> Finally, we simplify the actual mseal logic which is really quite
> straightforward.
>
>
> v3:
> * Propagated more tags, thanks everyone!
> * Updated 5/5 to assign curr_start in a smarter way as per Liam. Adjust
>   code to more sensibly handle already-sealed case at the same time.
> * Updated 4/5 to not move range_contains_unmapped() for better diff.
> * Renamed can_modify_vma() to vma_is_sealed() and inverted the logic - this
>   is far clearer than the nebulous 'can modify VMA'.
>
> v2:
> * Propagated tags, thanks everyone!
> * Updated can_madvise_modify() to a more logical order re: the checks
>   performed, as per David.
> * Replaced vma_is_anonymous() check (which was, in the original code, a
>   vma->vm_file or vma->vm_ops check) with a vma->vm_flags & VM_SHARED
>   check - to explicitly check for shared mappings vs private to preclude
>   MAP_PRIVATE-mapping file-baked mappings, as per David.
> * Made range_contains_unmapped() static and placed in mm/mseal.c to avoid
>   encouraging any other internal users towards this rather silly pattern,
>   as per Pedro and Liam.
> https://lore.kernel.org/all/cover.1752586090.git.lorenzo.stoakes@oracle.com/
>
> v1:
> https://lore.kernel.org/all/cover.1752497324.git.lorenzo.stoakes@oracle.com/
>
> Lorenzo Stoakes (5):
>   mm/mseal: always define VM_SEALED
>   mm/mseal: update madvise() logic
>   mm/mseal: small cleanups
>   mm/mseal: Simplify and rename VMA gap check
>   mm/mseal: rework mseal apply logic
>
>  include/linux/mm.h                      |   6 +-
>  mm/madvise.c                            |  63 +++++++++-
>  mm/mprotect.c                           |   2 +-
>  mm/mremap.c                             |   2 +-
>  mm/mseal.c                              | 157 +++++-------------------
>  mm/vma.c                                |   4 +-
>  mm/vma.h                                |  27 +---
>  tools/testing/selftests/mm/mseal_test.c |   3 +-
>  tools/testing/vma/vma_internal.h        |   6 +-
>  9 files changed, 107 insertions(+), 163 deletions(-)
>
> --
> 2.50.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/5] mm/mseal: always define VM_SEALED
  2025-07-16 17:38 ` [PATCH v3 1/5] mm/mseal: always define VM_SEALED Lorenzo Stoakes
@ 2025-07-24 18:34   ` Jeff Xu
  2025-07-24 18:44     ` Lorenzo Stoakes
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Xu @ 2025-07-24 18:34 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Kees Cook, linux-hardening

Hi Lorenzo,

On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> There is no reason to treat VM_SEALED in a special way, in each other case
> in which a VMA flag is unavailable due to configuration, we simply assign
> that flag to VM_NONE, so make VM_SEALED consistent with all other VMA
> flags in this respect.
>
Originally, I wanted to avoid using VM_NONE for VM_SEALED to catch
compiler errors in 32-bit builds. This would serve as a safeguard
against inadvertently using VM_SEALED in code paths shared between
32-bit and 64-bit architectures.

Take an example of show_smap_vma_flags [1] :

#ifdef CONFIG_64BIT
[ilog2(VM_SEALED)] = "sl",
#endif

If a developer forgets to add #ifdef CONFIG_64BIT, the build will fail
for 32-bit systems. With VM_SEALED redefined as VM_NONE, the problem
will go unnoticed.

This coding practice is more defensive and helps catch errors early
on. It seems that you're working on introducing VM_SEALED for 32-bit
systems. If that happens, we won't need this safeguard anymore. But
until then, is it OK to keep it in place?  That said, if VM_SEALED
support for 32-bit is coming really soon, we can merge this change,
however, perhaps you could update the description to explain why we're
removing this safeguard, i.e. due to 32-bit support will soon be in
place.

Link: https://elixir.bootlin.com/linux/v6.15.7/source/fs/proc/task_mmu.c#L1001
[1]

Thanks and regards,
-Jeff
> Additionally, use the next available bit for VM_SEALED, 42, rather than
> arbitrarily putting it at 63 and update the declaration to match all other
> VMA flags.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm.h               | 6 ++++--
>  tools/testing/vma/vma_internal.h | 6 ++++--
>  2 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 805108d7bbc3..fbf2a1f7ffc6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -414,8 +414,10 @@ extern unsigned int kobjsize(const void *objp);
>  #endif
>
>  #ifdef CONFIG_64BIT
> -/* VM is sealed, in vm_flags */
> -#define VM_SEALED      _BITUL(63)
> +#define VM_SEALED_BIT  42
> +#define VM_SEALED      BIT(VM_SEALED_BIT)
> +#else
> +#define VM_SEALED      VM_NONE
>  #endif
>
>  /* Bits set in the VMA until the stack is in its final location */
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 991022e9e0d3..0fe52fd6782b 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -108,8 +108,10 @@ extern unsigned long dac_mmap_min_addr;
>  #define CAP_IPC_LOCK         14
>
>  #ifdef CONFIG_64BIT
> -/* VM is sealed, in vm_flags */
> -#define VM_SEALED      _BITUL(63)
> +#define VM_SEALED_BIT  42
> +#define VM_SEALED      BIT(VM_SEALED_BIT)
> +#else
> +#define VM_SEALED      VM_NONE
>  #endif
>
>  #define FIRST_USER_ADDRESS     0UL
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-16 17:38 ` [PATCH v3 2/5] mm/mseal: update madvise() logic Lorenzo Stoakes
@ 2025-07-24 18:39   ` Jeff Xu
  2025-07-24 18:56     ` David Hildenbrand
  2025-07-24 19:07     ` Lorenzo Stoakes
  2025-07-24 21:15   ` Kees Cook
  2025-07-24 22:12   ` David Hildenbrand
  2 siblings, 2 replies; 38+ messages in thread
From: Jeff Xu @ 2025-07-24 18:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Kees Cook, linux-hardening

Hi Lorenzo,

This change has two parts: a non-functional refactoring work of moving
function from mseal.c to madvise.c, and a functional change to the
behavior of madvise under mseal. Would you consider splitting the
change into two parts? which seems to be a common practice at
supplying patches in lkml.

On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> The madvise() logic is inexplicably performed in mm/mseal.c - this ought
> to be located in mm/madvise.c.
>
This is part one of a non-functional refactoring work to move a
function from mseal.c to madvise.c.

There are two main reasons why I initially wanted to keep all
mseal-related functions in mseal.c:

1 Keeping all mseal related logic in mseal.c makes it easier for
developers to find all the impacted areas of mseal.
2 mseal is not supported in 32 bits, and mseal.c is excluded from 32
bits build (see makefile).This would prevent accidentally using mseal
in code paths shared between 32-bit and 64-bit architectures. It also
avoids adding #Ifdef in the .c file, which is recommended by the mm
coding standard (I received comments about avoiding  #ifdef in .c  in
the past).

Point 2 can go aways if 32 bits mseal support is coming soon, same as
my other comments for patch 1/5.

Point 1 remains. However, I want to focus on the functional change
part of this patch, rather than the other aspects.

> Additionally can_modify_vma_madv() is inconsistently named and, in
> combination with is_ro_anon(), is very confusing logic.
>
> Put a static function in mm/madvise.c instead - can_madvise_modify() -
> that spells out exactly what's happening.  Also explicitly check for an
> anon VMA.
>
> Also add commentary to explain what's going on.
>
> Essentially - we disallow discarding of data in mseal()'d mappings in
> instances where the user couldn't otherwise write to that data.
>
> Shared mappings are always backed, so no discard will actually truly
> discard the data.  Read-only anonymous and MAP_PRIVATE file-backed
> mappings are the ones we are interested in.
>
> We make a change to the logic here to correct a mistake - we must disallow
> discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> were not.
>
> The justification for this change is to account for the case where:
>
> 1. A MAP_PRIVATE R/W file-backed mapping is established.
> 2. The mapping is written to, which backs it with anonymous memory.
> 3. The mapping is mprotect()'d read-only.
> 4. The mapping is mseal()'d.
>
> If we were to now allow discard of this data, it would mean mseal() would
> not prevent the unrecoverable discarding of data and it was thus violate
> the semantics of sealed VMAs.
>
This is the functional change and the most important area that I would
like to discuss in this patch series.

Since Jann Horn first highlighted [1] the problematic behavior of
destructive madvise for anonymous mapping during initial discussions
of mseal(), the proper solution has been open to discussion and
exploration. Linus Torvalds has expressed interest [2] in fixing this
within madvise itself, without requiring mseal, and I copied it here
for ease of reference:

“Hmm. I actually would be happier if we just made that change in
general. Maybe even without sealing, but I agree that it *definitely*
makes sense in general as a sealing thing.”

After mseal was merged, Pedro Falcato raised a valid concern regarding
file-backed private mappings. The issue is that these mappings can be
written to, and subsequently changed to read-only, which is precisely
the problem this patch aims to fix. I attempted to address this in
[3]. However, that patch was rejected due to the introduction of a new
vm_flags, which was a valid rejection as it wasn't the ideal solution.
Nevertheless, it sparked an interesting discussion, with me raising a
point that userspace might use this feature to free up RAM for
file-backed private mapping that is never written to,  and input about
this topic from Vlastimil Babka [4] is about MADV_COLD/MADV_PAGEOUT.

A detail about userspace calling those madvise for file-backed private
mapping. Previously, I added extra logging in the kernel, and observed
many instances of those calls  during Android phone startup, although
I haven’t delved into the reason behind why user space calls those,
e.g. if it is from an individual app or Android platform.

Incidentally, recently while I was studying selinux code, particularly
exemod permission [5] , I learned that selinux utilizes vma->anon_vma
to identify file-backed mappings that have been written to.  Jann
pointed out to me that the kernel creates a COW mapping when a private
file-backed mapping is written. Now I'm wondering if this could be the
answer to our problem. We could try having destructive madvise check
vma->anon_vma and reject the call if it's true. I haven't had a chance
to test this theory yet, though.

To summarize all the discussion points so far:
1. It's questionable behavior for madvise to allow destructive
behavior for read-only anonymous mappings, regardless of mseal state.
2. We could potentially fix point 1 within madvise itself, without
involving mseal, as Linus desires.
3. Android userspace uses destructive madvise to free up RAM, but I
need to take a closer look at the patterns and usage to understand why
they do that.
4. We could ask applications to switch to non-destructive madvise,
like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
switch the kernel to use non-destructive madvise implicitly for
destructive madvise in suitable situations.
5. We could investigate more based on vma->anon_vma

Link: https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/
[1]
Link: https://lore.kernel.org/lkml/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/
[2]
Link: https://lore.kernel.org/all/20241017005105.3047458-2-jeffxu@chromium.org/
[3]
Link: https://lore.kernel.org/all/8f68ad82-2f60-49f8-b150-0cf183c9cc71@suse.cz/
[4]
Link: https://elixir.bootlin.com/linux/v6.15.7/source/security/selinux/hooks.c#L3878
[5]

> Finally, update the mseal tests, which were asserting previously that a
> read-only MAP_PRIVATE file-backed mapping could be discarded.
>
The test you are updating is not intended for the scenario this patch
is aimed to fix: i.e. the scenario:
1. A MAP_PRIVATE R/W file-backed mapping is established.
2. The mapping is written to, which backs it with anonymous memory.
3. The mapping is mprotect()'d read-only.
4. The mapping is mseal()'d.

The test case doesn't include steps 1, 2, and 3, is it possible to
keep the existing one and create a new test case? But I don't think
that's the main discussion point. We can revisit test cases once we've
fully discussed the design.

Thanks and regards,
-Jeff
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/madvise.c                            | 63 ++++++++++++++++++++++++-
>  mm/mseal.c                              | 49 -------------------
>  mm/vma.h                                |  7 ---
>  tools/testing/selftests/mm/mseal_test.c |  3 +-
>  4 files changed, 63 insertions(+), 59 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 2bf80989d5b6..dc3d8497b0f4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -19,6 +19,7 @@
>  #include <linux/sched.h>
>  #include <linux/sched/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/mmu_context.h>
>  #include <linux/string.h>
>  #include <linux/uio.h>
>  #include <linux/ksm.h>
> @@ -1255,6 +1256,66 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
>                                &guard_remove_walk_ops, NULL);
>  }
>
> +#ifdef CONFIG_64BIT
> +/* Does the madvise operation result in discarding of mapped data? */
> +static bool is_discard(int behavior)
> +{
> +       switch (behavior) {
> +       case MADV_FREE:
> +       case MADV_DONTNEED:
> +       case MADV_DONTNEED_LOCKED:
> +       case MADV_REMOVE:
> +       case MADV_DONTFORK:
> +       case MADV_WIPEONFORK:
> +       case MADV_GUARD_INSTALL:
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
> +/*
> + * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
> + * circumstances - discarding of data from read-only anonymous SEALED mappings.
> + *
> + * This is because users cannot trivally discard data from these VMAs, and may
> + * only do so via an appropriate madvise() call.
> + */
> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> +{
> +       struct vm_area_struct *vma = madv_behavior->vma;
> +
> +       /* If the VMA isn't sealed we're good. */
> +       if (can_modify_vma(vma))
> +               return true;
> +
> +       /* For a sealed VMA, we only care about discard operations. */
> +       if (!is_discard(madv_behavior->behavior))
> +               return true;
> +
> +       /*
> +        * But shared mappings are fine, as dirty pages will be written to a
> +        * backing store regardless of discard.
> +        */
> +       if (vma->vm_flags & VM_SHARED)
> +               return true;
> +
> +       /* If the user could write to the mapping anyway, then this is fine. */
> +       if ((vma->vm_flags & VM_WRITE) &&
> +           arch_vma_access_permitted(vma, /* write= */ true,
> +                       /* execute= */ false, /* foreign= */ false))
> +               return true;
> +
> +       /* Otherwise, we are not permitted to perform this operation. */
> +       return false;
> +}
> +#else
> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> +{
> +       return true;
> +}
> +#endif
> +
>  /*
>   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
>   * will handle splitting a vm area into separate areas, each area with its own
> @@ -1268,7 +1329,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
>         struct madvise_behavior_range *range = &madv_behavior->range;
>         int error;
>
> -       if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
> +       if (unlikely(!can_madvise_modify(madv_behavior)))
>                 return -EPERM;
>
>         switch (behavior) {
> diff --git a/mm/mseal.c b/mm/mseal.c
> index c27197ac04e8..1308e88ab184 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -11,7 +11,6 @@
>  #include <linux/mman.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> -#include <linux/mmu_context.h>
>  #include <linux/syscalls.h>
>  #include <linux/sched.h>
>  #include "internal.h"
> @@ -21,54 +20,6 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
>         vm_flags_set(vma, VM_SEALED);
>  }
>
> -static bool is_madv_discard(int behavior)
> -{
> -       switch (behavior) {
> -       case MADV_FREE:
> -       case MADV_DONTNEED:
> -       case MADV_DONTNEED_LOCKED:
> -       case MADV_REMOVE:
> -       case MADV_DONTFORK:
> -       case MADV_WIPEONFORK:
> -       case MADV_GUARD_INSTALL:
> -               return true;
> -       }
> -
> -       return false;
> -}
> -
> -static bool is_ro_anon(struct vm_area_struct *vma)
> -{
> -       /* check anonymous mapping. */
> -       if (vma->vm_file || vma->vm_flags & VM_SHARED)
> -               return false;
> -
> -       /*
> -        * check for non-writable:
> -        * PROT=RO or PKRU is not writeable.
> -        */
> -       if (!(vma->vm_flags & VM_WRITE) ||
> -               !arch_vma_access_permitted(vma, true, false, false))
> -               return true;
> -
> -       return false;
> -}
> -
> -/*
> - * Check if a vma is allowed to be modified by madvise.
> - */
> -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> -{
> -       if (!is_madv_discard(behavior))
> -               return true;
> -
> -       if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
> -               return false;
> -
> -       /* Allow by default. */
> -       return true;
> -}
> -
>  static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
>                 struct vm_area_struct **prev, unsigned long start,
>                 unsigned long end, vm_flags_t newflags)
> diff --git a/mm/vma.h b/mm/vma.h
> index acdcc515c459..85db5e880fcc 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -577,8 +577,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
>         return true;
>  }
>
> -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior);
> -
>  #else
>
>  static inline bool can_modify_vma(struct vm_area_struct *vma)
> @@ -586,11 +584,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
>         return true;
>  }
>
> -static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> -{
> -       return true;
> -}
> -
>  #endif
>
>  #if defined(CONFIG_STACK_GROWSUP)
> diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c
> index 005f29c86484..34c042da4de2 100644
> --- a/tools/testing/selftests/mm/mseal_test.c
> +++ b/tools/testing/selftests/mm/mseal_test.c
> @@ -1712,7 +1712,6 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
>         unsigned long size = 4 * page_size;
>         int ret;
>         int fd;
> -       unsigned long mapflags = MAP_PRIVATE;
>
>         fd = memfd_create("test", 0);
>         FAIL_TEST_IF_FALSE(fd > 0);
> @@ -1720,7 +1719,7 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
>         ret = fallocate(fd, 0, 0, size);
>         FAIL_TEST_IF_FALSE(!ret);
>
> -       ptr = mmap(NULL, size, PROT_READ, mapflags, fd, 0);
> +       ptr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
>         FAIL_TEST_IF_FALSE(ptr != MAP_FAILED);
>
>         if (seal) {
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/5] mm/mseal: small cleanups
  2025-07-16 17:38 ` [PATCH v3 3/5] mm/mseal: small cleanups Lorenzo Stoakes
@ 2025-07-24 18:40   ` Jeff Xu
  0 siblings, 0 replies; 38+ messages in thread
From: Jeff Xu @ 2025-07-24 18:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Kees Cook, linux-hardening

Hi Lorenzo,

On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Drop the wholly unnecessary set_vma_sealed() helper(), which is used only
> once, and place VMA_ITERATOR() declarations in the correct place.
>
> Retain vma_is_sealed(), and use it instead of the confusingly named
> can_modify_vma(), so it's abundantly clear what's being tested, rather then
> a nebulous sense of 'can the VMA be modified'.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Xu <jeffxu@chromium.org>

Thanks and regards
-Jeff
> ---
>  mm/madvise.c  |  2 +-
>  mm/mprotect.c |  2 +-
>  mm/mremap.c   |  2 +-
>  mm/mseal.c    |  9 +--------
>  mm/vma.c      |  4 ++--
>  mm/vma.h      | 20 ++------------------
>  6 files changed, 8 insertions(+), 31 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index dc3d8497b0f4..da6e0e7c00b5 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1286,7 +1286,7 @@ static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
>         struct vm_area_struct *vma = madv_behavior->vma;
>
>         /* If the VMA isn't sealed we're good. */
> -       if (can_modify_vma(vma))
> +       if (!vma_is_sealed(vma))
>                 return true;
>
>         /* For a sealed VMA, we only care about discard operations. */
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88709c01177b..807939177065 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -605,7 +605,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
>         unsigned long charged = 0;
>         int error;
>
> -       if (!can_modify_vma(vma))
> +       if (vma_is_sealed(vma))
>                 return -EPERM;
>
>         if (newflags == oldflags) {
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 5b7fe8f36074..8e93eca86721 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -1649,7 +1649,7 @@ static int check_prep_vma(struct vma_remap_struct *vrm)
>                 return -EFAULT;
>
>         /* If mseal()'d, mremap() is prohibited. */
> -       if (!can_modify_vma(vma))
> +       if (vma_is_sealed(vma))
>                 return -EPERM;
>
>         /* Align to hugetlb page size, if required. */
> diff --git a/mm/mseal.c b/mm/mseal.c
> index 1308e88ab184..adbcc65e9660 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -15,11 +15,6 @@
>  #include <linux/sched.h>
>  #include "internal.h"
>
> -static inline void set_vma_sealed(struct vm_area_struct *vma)
> -{
> -       vm_flags_set(vma, VM_SEALED);
> -}
> -
>  static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
>                 struct vm_area_struct **prev, unsigned long start,
>                 unsigned long end, vm_flags_t newflags)
> @@ -36,7 +31,7 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
>                 goto out;
>         }
>
> -       set_vma_sealed(vma);
> +       vm_flags_set(vma, VM_SEALED);
>  out:
>         *prev = vma;
>         return ret;
> @@ -53,7 +48,6 @@ static int check_mm_seal(unsigned long start, unsigned long end)
>  {
>         struct vm_area_struct *vma;
>         unsigned long nstart = start;
> -
>         VMA_ITERATOR(vmi, current->mm, start);
>
>         /* going through each vma to check. */
> @@ -78,7 +72,6 @@ static int apply_mm_seal(unsigned long start, unsigned long end)
>  {
>         unsigned long nstart;
>         struct vm_area_struct *vma, *prev;
> -
>         VMA_ITERATOR(vmi, current->mm, start);
>
>         vma = vma_iter_load(&vmi);
> diff --git a/mm/vma.c b/mm/vma.c
> index fc502b741dcf..75fd2759964b 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -1351,7 +1351,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>                 }
>
>                 /* Don't bother splitting the VMA if we can't unmap it anyway */
> -               if (!can_modify_vma(vms->vma)) {
> +               if (vma_is_sealed(vms->vma)) {
>                         error = -EPERM;
>                         goto start_split_failed;
>                 }
> @@ -1371,7 +1371,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>         for_each_vma_range(*(vms->vmi), next, vms->end) {
>                 long nrpages;
>
> -               if (!can_modify_vma(next)) {
> +               if (vma_is_sealed(next)) {
>                         error = -EPERM;
>                         goto modify_vma_failed;
>                 }
> diff --git a/mm/vma.h b/mm/vma.h
> index 85db5e880fcc..b123a9cdedb0 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -559,31 +559,15 @@ struct vm_area_struct *vma_iter_next_rewind(struct vma_iterator *vmi,
>  }
>
>  #ifdef CONFIG_64BIT
> -
>  static inline bool vma_is_sealed(struct vm_area_struct *vma)
>  {
>         return (vma->vm_flags & VM_SEALED);
>  }
> -
> -/*
> - * check if a vma is sealed for modification.
> - * return true, if modification is allowed.
> - */
> -static inline bool can_modify_vma(struct vm_area_struct *vma)
> -{
> -       if (unlikely(vma_is_sealed(vma)))
> -               return false;
> -
> -       return true;
> -}
> -
>  #else
> -
> -static inline bool can_modify_vma(struct vm_area_struct *vma)
> +static inline bool vma_is_sealed(struct vm_area_struct *vma)
>  {
> -       return true;
> +       return false;
>  }
> -
>  #endif
>
>  #if defined(CONFIG_STACK_GROWSUP)
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/5] mm/mseal: Simplify and rename VMA gap check
  2025-07-16 17:38 ` [PATCH v3 4/5] mm/mseal: Simplify and rename VMA gap check Lorenzo Stoakes
@ 2025-07-24 18:40   ` Jeff Xu
  2025-07-25  5:33     ` Lorenzo Stoakes
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Xu @ 2025-07-24 18:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Kees Cook, linux-hardening

Hi Lorenzo,

On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> The check_mm_seal() function is doing something general - checking whether
> a range contains only VMAs (or rather that it does NOT contain any
> unmapped regions).
>
> So rename this function to range_contains_unmapped().
>
> Additionally simplify the logic, we are simply checking whether the last
> vma->vm_end has either a VMA starting after it or ends before the end
> parameter.
>
> This check is rather dubious, so it is sensible to keep it local to
> mm/mseal.c as at a later stage it may be removed, and we don't want any
> other mm code to perform such a check.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/mseal.c | 36 +++++++++++-------------------------
>  1 file changed, 11 insertions(+), 25 deletions(-)
>
> diff --git a/mm/mseal.c b/mm/mseal.c
> index adbcc65e9660..61c07b1369cb 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -37,32 +37,22 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
>         return ret;
>  }
>
> -/*
> - * Check for do_mseal:
> - * 1> start is part of a valid vma.
> - * 2> end is part of a valid vma.
> - * 3> No gap (unallocated address) between start and end.
> - * 4> map is sealable.
> - */
> -static int check_mm_seal(unsigned long start, unsigned long end)
Is it possible to leave the check_mm_seal() function together with its
header comments? My original reason was to have a contract that
documents the exact entry check for mseal(). That way, no matter how
the code is refactored in the future, as long as the contract remains
true, I won't need to worry about behavior changes for mseal(). This
could be helpful if you move range_contains_unmapped into vma.c in the
future.

Note: "4> map is sealable." can be removed,  which is obsolete, we no
longer use sealable flags.

Thanks and regards,
-Jeff
> +/* Does the [start, end) range contain any unmapped memory? */
> +static bool range_contains_unmapped(struct mm_struct *mm,
> +               unsigned long start, unsigned long end)
>  {
>         struct vm_area_struct *vma;
> -       unsigned long nstart = start;
> +       unsigned long prev_end = start;
>         VMA_ITERATOR(vmi, current->mm, start);
>
> -       /* going through each vma to check. */
>         for_each_vma_range(vmi, vma, end) {
> -               if (vma->vm_start > nstart)
> -                       /* unallocated memory found. */
> -                       return -ENOMEM;
> -
> -               if (vma->vm_end >= end)
> -                       return 0;
> +               if (vma->vm_start > prev_end)
> +                       return true;
>
> -               nstart = vma->vm_end;
> +               prev_end = vma->vm_end;
>         }
>
> -       return -ENOMEM;
> +       return prev_end < end;
>  }
>
>  /*
> @@ -184,14 +174,10 @@ int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
>         if (mmap_write_lock_killable(mm))
>                 return -EINTR;
>
> -       /*
> -        * First pass, this helps to avoid
> -        * partial sealing in case of error in input address range,
> -        * e.g. ENOMEM error.
> -        */
> -       ret = check_mm_seal(start, end);
> -       if (ret)
> +       if (range_contains_unmapped(mm, start, end)) {
> +               ret = -ENOMEM;
>                 goto out;
> +       }
>
>         /*
>          * Second pass, this should success, unless there are errors
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 5/5] mm/mseal: rework mseal apply logic
  2025-07-16 17:38 ` [PATCH v3 5/5] mm/mseal: rework mseal apply logic Lorenzo Stoakes
@ 2025-07-24 18:41   ` Jeff Xu
  0 siblings, 0 replies; 38+ messages in thread
From: Jeff Xu @ 2025-07-24 18:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Kees Cook, linux-hardening

Hi Lorenzo,

On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> The logic can be simplified - firstly by renaming the inconsistently named
> apply_mm_seal() to mseal_apply().
>
> We then wrap mseal_fixup() into the main loop as the logic is simple enough
> to not require it, equally it isn't a hugely pleasant pattern in mprotect()
> etc.  so it's not something we want to perpetuate.
>
> We eliminate the need for invoking vma_iter_end() on each loop by directly
> determining if the VMA was merged - the only thing we need concern
> ourselves with is whether the start/end of the (gapless) range are offset
> into VMAs.
>
> This refactoring also avoids the rather horrid 'pass pointer to prev
> around' pattern used in mprotect() et al.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Xu <jeffxu@chromium.org>

Thanks and regards,
-Jeff
> ---
>  mm/mseal.c | 67 ++++++++++++++++--------------------------------------
>  1 file changed, 20 insertions(+), 47 deletions(-)
>
> diff --git a/mm/mseal.c b/mm/mseal.c
> index 61c07b1369cb..0ab12e09792a 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -15,28 +15,6 @@
>  #include <linux/sched.h>
>  #include "internal.h"
>
> -static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> -               struct vm_area_struct **prev, unsigned long start,
> -               unsigned long end, vm_flags_t newflags)
> -{
> -       int ret = 0;
> -       vm_flags_t oldflags = vma->vm_flags;
> -
> -       if (newflags == oldflags)
> -               goto out;
> -
> -       vma = vma_modify_flags(vmi, *prev, vma, start, end, newflags);
> -       if (IS_ERR(vma)) {
> -               ret = PTR_ERR(vma);
> -               goto out;
> -       }
> -
> -       vm_flags_set(vma, VM_SEALED);
> -out:
> -       *prev = vma;
> -       return ret;
> -}
> -
>  /* Does the [start, end) range contain any unmapped memory? */
>  static bool range_contains_unmapped(struct mm_struct *mm,
>                 unsigned long start, unsigned long end)
> @@ -55,38 +33,33 @@ static bool range_contains_unmapped(struct mm_struct *mm,
>         return prev_end < end;
>  }
>
> -/*
> - * Apply sealing.
> - */
> -static int apply_mm_seal(unsigned long start, unsigned long end)
> +static int mseal_apply(struct mm_struct *mm,
> +               unsigned long start, unsigned long end)
>  {
> -       unsigned long nstart;
>         struct vm_area_struct *vma, *prev;
> -       VMA_ITERATOR(vmi, current->mm, start);
> +       unsigned long curr_start = start;
> +       VMA_ITERATOR(vmi, mm, start);
>
> +       /* We know there are no gaps so this will be non-NULL. */
>         vma = vma_iter_load(&vmi);
> -       /*
> -        * Note: check_mm_seal should already checked ENOMEM case.
> -        * so vma should not be null, same for the other ENOMEM cases.
> -        */
>         prev = vma_prev(&vmi);
>         if (start > vma->vm_start)
>                 prev = vma;
>
> -       nstart = start;
>         for_each_vma_range(vmi, vma, end) {
> -               int error;
> -               unsigned long tmp;
> -               vm_flags_t newflags;
> -
> -               newflags = vma->vm_flags | VM_SEALED;
> -               tmp = vma->vm_end;
> -               if (tmp > end)
> -                       tmp = end;
> -               error = mseal_fixup(&vmi, vma, &prev, nstart, tmp, newflags);
> -               if (error)
> -                       return error;
> -               nstart = vma_iter_end(&vmi);
> +               unsigned long curr_end = MIN(vma->vm_end, end);
> +
> +               if (!(vma->vm_flags & VM_SEALED)) {
> +                       vma = vma_modify_flags(&vmi, prev, vma,
> +                                       curr_start, curr_end,
> +                                       vma->vm_flags | VM_SEALED);
> +                       if (IS_ERR(vma))
> +                               return PTR_ERR(vma);
> +                       vm_flags_set(vma, VM_SEALED);
> +               }
> +
> +               prev = vma;
> +               curr_start = curr_end;
>         }
>
>         return 0;
> @@ -185,10 +158,10 @@ int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
>          * reaching the max supported VMAs, however, those cases shall
>          * be rare.
>          */
> -       ret = apply_mm_seal(start, end);
> +       ret = mseal_apply(mm, start, end);
>
>  out:
> -       mmap_write_unlock(current->mm);
> +       mmap_write_unlock(mm);
>         return ret;
>  }
>
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/5] mm/mseal: always define VM_SEALED
  2025-07-24 18:34   ` Jeff Xu
@ 2025-07-24 18:44     ` Lorenzo Stoakes
  0 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24 18:44 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Kees Cook, linux-hardening

On Thu, Jul 24, 2025 at 11:34:31AM -0700, Jeff Xu wrote:
> Hi Lorenzo,
>
> On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > There is no reason to treat VM_SEALED in a special way, in each other case
> > in which a VMA flag is unavailable due to configuration, we simply assign
> > that flag to VM_NONE, so make VM_SEALED consistent with all other VMA
> > flags in this respect.
> >
> Originally, I wanted to avoid using VM_NONE for VM_SEALED to catch
> compiler errors in 32-bit builds. This would serve as a safeguard
> against inadvertently using VM_SEALED in code paths shared between
> 32-bit and 64-bit architectures.

I understand why you did it, and I fundamentally disagree.

This would make this the only VMA flag in the entire kernel having this
special treatment for something with very little implementation code, it
simply makes no sense.

The commit message makes this clear.

>
> Take an example of show_smap_vma_flags [1] :
>
> #ifdef CONFIG_64BIT
> [ilog2(VM_SEALED)] = "sl",
> #endif
>
> If a developer forgets to add #ifdef CONFIG_64BIT, the build will fail

This is a really silly thing to worry about.

> for 32-bit systems. With VM_SEALED redefined as VM_NONE, the problem
> will go unnoticed.

What problem exactly?

>
> This coding practice is more defensive and helps catch errors early

It's silly.

> on. It seems that you're working on introducing VM_SEALED for 32-bit
> systems. If that happens, we won't need this safeguard anymore. But
> until then, is it OK to keep it in place?  That said, if VM_SEALED
> support for 32-bit is coming really soon, we can merge this change,
> however, perhaps you could update the description to explain why we're
> removing this safeguard, i.e. due to 32-bit support will soon be in
> place.

No I won't, because that's not why I did it.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 18:39   ` Jeff Xu
@ 2025-07-24 18:56     ` David Hildenbrand
  2025-07-24 22:18       ` David Hildenbrand
  2025-07-24 19:07     ` Lorenzo Stoakes
  1 sibling, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-24 18:56 UTC (permalink / raw)
  To: Jeff Xu, Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Kees Cook, linux-hardening

> 
> To summarize all the discussion points so far:
> 1. It's questionable behavior for madvise to allow destructive
> behavior for read-only anonymous mappings, regardless of mseal state.
 > 2. We could potentially fix point 1 within madvise itself, without> 
involving mseal, as Linus desires.

IIUC: disallow madvise(MADV_DONTNEED) without PROT_WRITE.

I am 99.99999% sure that that would break user case, unfortunately.

> 3. Android userspace uses destructive madvise to free up RAM, but I
> need to take a closer look at the patterns and usage to understand why
> they do that.

I am shocked that you question why they would use MADV_DONTNEED instead 
of ...

 > 4. We could ask applications to switch to non-destructive madvise,> 
like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
> switch the kernel to use non-destructive madvise implicitly for
> destructive madvise in suitable situations.

... MADV_COLD / MADV_PAGEOUT.

I am also shocked that you think asking apps to switch would not make us 
break user space.

> 5. We could investigate more based on vma->anon_vma

Or we do what sealing is supposed to do.

With the hope that this sealing fix here would not break user space.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 18:39   ` Jeff Xu
  2025-07-24 18:56     ` David Hildenbrand
@ 2025-07-24 19:07     ` Lorenzo Stoakes
  2025-07-24 21:53       ` David Hildenbrand
  1 sibling, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24 19:07 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Kees Cook, linux-hardening

On Thu, Jul 24, 2025 at 11:39:05AM -0700, Jeff Xu wrote:
> Hi Lorenzo,
>
> This change has two parts: a non-functional refactoring work of moving
> function from mseal.c to madvise.c, and a functional change to the
> behavior of madvise under mseal. Would you consider splitting the
> change into two parts? which seems to be a common practice at
> supplying patches in lkml.

No I won't do that.

it's a very small change, and it was intentionally bundled so we correct
the oddly implemented version while moving.

>
> On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > The madvise() logic is inexplicably performed in mm/mseal.c - this ought
> > to be located in mm/madvise.c.
> >
> This is part one of a non-functional refactoring work to move a
> function from mseal.c to madvise.c.
>
> There are two main reasons why I initially wanted to keep all
> mseal-related functions in mseal.c:
>
> 1 Keeping all mseal related logic in mseal.c makes it easier for
> developers to find all the impacted areas of mseal.

That's very silly, sorry but no.

You're putting stuff entirely specific to madvise() away from madvise(),
to bundle things up for no really good reason.

Again, you have a desire to do things that are at odds with how everything
else in mm is done.

> 2 mseal is not supported in 32 bits, and mseal.c is excluded from 32
> bits build (see makefile).This would prevent accidentally using mseal
> in code paths shared between 32-bit and 64-bit architectures. It also
> avoids adding #Ifdef in the .c file, which is recommended by the mm
> coding standard (I received comments about avoiding  #ifdef in .c  in
> the past).

You're doing something strictly worse by putting madvise() stuff to bitrot
in another file.

It's always a trade-off.

There's no set 'coding standard', there's what maintainers accept.

>
> Point 2 can go aways if 32 bits mseal support is coming soon, same as
> my other comments for patch 1/5.

But that's not the reason, as commit message states.

>
> Point 1 remains. However, I want to focus on the functional change
> part of this patch, rather than the other aspects.

No point 1 is dimsissed as is point 2.

>
> > Additionally can_modify_vma_madv() is inconsistently named and, in
> > combination with is_ro_anon(), is very confusing logic.
> >
> > Put a static function in mm/madvise.c instead - can_madvise_modify() -
> > that spells out exactly what's happening.  Also explicitly check for an
> > anon VMA.
> >
> > Also add commentary to explain what's going on.
> >
> > Essentially - we disallow discarding of data in mseal()'d mappings in
> > instances where the user couldn't otherwise write to that data.
> >
> > Shared mappings are always backed, so no discard will actually truly
> > discard the data.  Read-only anonymous and MAP_PRIVATE file-backed
> > mappings are the ones we are interested in.
> >
> > We make a change to the logic here to correct a mistake - we must disallow
> > discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> > were not.
> >
> > The justification for this change is to account for the case where:
> >
> > 1. A MAP_PRIVATE R/W file-backed mapping is established.
> > 2. The mapping is written to, which backs it with anonymous memory.
> > 3. The mapping is mprotect()'d read-only.
> > 4. The mapping is mseal()'d.
> >
> > If we were to now allow discard of this data, it would mean mseal() would
> > not prevent the unrecoverable discarding of data and it was thus violate
> > the semantics of sealed VMAs.
> >
> This is the functional change and the most important area that I would
> like to discuss in this patch series.

OK.

>
> Since Jann Horn first highlighted [1] the problematic behavior of
> destructive madvise for anonymous mapping during initial discussions
> of mseal(), the proper solution has been open to discussion and
> exploration. Linus Torvalds has expressed interest [2] in fixing this
> within madvise itself, without requiring mseal, and I copied it here
> for ease of reference:
>
> “Hmm. I actually would be happier if we just made that change in
> general. Maybe even without sealing, but I agree that it *definitely*
> makes sense in general as a sealing thing.”
>
> After mseal was merged, Pedro Falcato raised a valid concern regarding
> file-backed private mappings. The issue is that these mappings can be
> written to, and subsequently changed to read-only, which is precisely
> the problem this patch aims to fix. I attempted to address this in
> [3]. However, that patch was rejected due to the introduction of a new
> vm_flags, which was a valid rejection as it wasn't the ideal solution.
> Nevertheless, it sparked an interesting discussion, with me raising a
> point that userspace might use this feature to free up RAM for
> file-backed private mapping that is never written to,  and input about
> this topic from Vlastimil Babka [4] is about MADV_COLD/MADV_PAGEOUT.

OK not sure what point you're making yet?

>
> A detail about userspace calling those madvise for file-backed private
> mapping. Previously, I added extra logging in the kernel, and observed
> many instances of those calls  during Android phone startup, although
> I haven’t delved into the reason behind why user space calls those,
> e.g. if it is from an individual app or Android platform.

Hang on, sorry, are you saying that you intentionally allowed destruction
of mseal()'d VMAs to serve android?

I hope I'm misunderstanding you here.

Either way I don't know what your point is? Don't mseal them if you want to
perform destructive operations on them?

You have to argue as to why this change is not valid in _upstream_ linux.

>
> Incidentally, recently while I was studying selinux code, particularly
> exemod permission [5] , I learned that selinux utilizes vma->anon_vma
> to identify file-backed mappings that have been written to.  Jann
> pointed out to me that the kernel creates a COW mapping when a private
> file-backed mapping is written. Now I'm wondering if this could be the
> answer to our problem. We could try having destructive madvise check
> vma->anon_vma and reject the call if it's true. I haven't had a chance
> to test this theory yet, though.

Umm what? Why? What are you talking about?

A MAP_PRIVATE mapping will not have VM_SHARED set. This is why I changed
the check to this.

I really don't understand what point you're trying to make here.

The check I've provided works correctly to disallow the previously
_incorrectly allowed_ MAP_PRIVATE mapping of a file-backed mapping.

This was something you missed, and is now corrected.

>
> To summarize all the discussion points so far:
> 1. It's questionable behavior for madvise to allow destructive
> behavior for read-only anonymous mappings, regardless of mseal state.

Umm, ok. Well maybe, but it's essentially uAPI at this point. This feels
irrelevant to this series.

> 2. We could potentially fix point 1 within madvise itself, without
> involving mseal, as Linus desires.

No, we will not do that.

> 3. Android userspace uses destructive madvise to free up RAM, but I
> need to take a closer look at the patterns and usage to understand why
> they do that.

OK so what?

> 4. We could ask applications to switch to non-destructive madvise,
> like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
> switch the kernel to use non-destructive madvise implicitly for
> destructive madvise in suitable situations.

Umm what? I don't understand your point.

> 5. We could investigate more based on vma->anon_vma

I have no idea what you mean by this. I am an rmap maintainer and have
worked extensively with anon_vma, what's the point exactly?

>
> Link: https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/
> [1]
> Link: https://lore.kernel.org/lkml/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/
> [2]
> Link: https://lore.kernel.org/all/20241017005105.3047458-2-jeffxu@chromium.org/
> [3]
> Link: https://lore.kernel.org/all/8f68ad82-2f60-49f8-b150-0cf183c9cc71@suse.cz/
> [4]
> Link: https://elixir.bootlin.com/linux/v6.15.7/source/security/selinux/hooks.c#L3878
> [5]
>
> > Finally, update the mseal tests, which were asserting previously that a
> > read-only MAP_PRIVATE file-backed mapping could be discarded.
> >
> The test you are updating is not intended for the scenario this patch
> is aimed to fix: i.e. the scenario:
> 1. A MAP_PRIVATE R/W file-backed mapping is established.
> 2. The mapping is written to, which backs it with anonymous memory.
> 3. The mapping is mprotect()'d read-only.
> 4. The mapping is mseal()'d.
>
> The test case doesn't include steps 1, 2, and 3, is it possible to
> keep the existing one and create a new test case? But I don't think
> that's the main discussion point. We can revisit test cases once we've
> fully discussed the design.

I do not know why you put this here. Can you please put review along side
the code you're reviewing.

You're making my life unnecessarily hard here.

But yes, you're right, I messed up the test here, I'll send a fix-patch.

Incidentally, it seems like the test _explicitly_ asserted that this
behaviour was the opposite of what it should be which makes me think again
this might be intentional... Concerning.

>
> Thanks and regards,
> -Jeff
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Reviewed-by: Pedro Falcato <pfalcato@suse.de>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > ---
> >  mm/madvise.c                            | 63 ++++++++++++++++++++++++-
> >  mm/mseal.c                              | 49 -------------------
> >  mm/vma.h                                |  7 ---
> >  tools/testing/selftests/mm/mseal_test.c |  3 +-
> >  4 files changed, 63 insertions(+), 59 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 2bf80989d5b6..dc3d8497b0f4 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/sched.h>
> >  #include <linux/sched/mm.h>
> >  #include <linux/mm_inline.h>
> > +#include <linux/mmu_context.h>
> >  #include <linux/string.h>
> >  #include <linux/uio.h>
> >  #include <linux/ksm.h>
> > @@ -1255,6 +1256,66 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
> >                                &guard_remove_walk_ops, NULL);
> >  }
> >
> > +#ifdef CONFIG_64BIT
> > +/* Does the madvise operation result in discarding of mapped data? */
> > +static bool is_discard(int behavior)
> > +{
> > +       switch (behavior) {
> > +       case MADV_FREE:
> > +       case MADV_DONTNEED:
> > +       case MADV_DONTNEED_LOCKED:
> > +       case MADV_REMOVE:
> > +       case MADV_DONTFORK:
> > +       case MADV_WIPEONFORK:
> > +       case MADV_GUARD_INSTALL:
> > +               return true;
> > +       }
> > +
> > +       return false;
> > +}
> > +
> > +/*
> > + * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
> > + * circumstances - discarding of data from read-only anonymous SEALED mappings.
> > + *
> > + * This is because users cannot trivally discard data from these VMAs, and may
> > + * only do so via an appropriate madvise() call.
> > + */
> > +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> > +{
> > +       struct vm_area_struct *vma = madv_behavior->vma;
> > +
> > +       /* If the VMA isn't sealed we're good. */
> > +       if (can_modify_vma(vma))
> > +               return true;
> > +
> > +       /* For a sealed VMA, we only care about discard operations. */
> > +       if (!is_discard(madv_behavior->behavior))
> > +               return true;
> > +
> > +       /*
> > +        * But shared mappings are fine, as dirty pages will be written to a
> > +        * backing store regardless of discard.
> > +        */
> > +       if (vma->vm_flags & VM_SHARED)
> > +               return true;
> > +
> > +       /* If the user could write to the mapping anyway, then this is fine. */
> > +       if ((vma->vm_flags & VM_WRITE) &&
> > +           arch_vma_access_permitted(vma, /* write= */ true,
> > +                       /* execute= */ false, /* foreign= */ false))
> > +               return true;
> > +
> > +       /* Otherwise, we are not permitted to perform this operation. */
> > +       return false;
> > +}
> > +#else
> > +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> > +{
> > +       return true;
> > +}
> > +#endif
> > +
> >  /*
> >   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
> >   * will handle splitting a vm area into separate areas, each area with its own
> > @@ -1268,7 +1329,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
> >         struct madvise_behavior_range *range = &madv_behavior->range;
> >         int error;
> >
> > -       if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
> > +       if (unlikely(!can_madvise_modify(madv_behavior)))
> >                 return -EPERM;
> >
> >         switch (behavior) {
> > diff --git a/mm/mseal.c b/mm/mseal.c
> > index c27197ac04e8..1308e88ab184 100644
> > --- a/mm/mseal.c
> > +++ b/mm/mseal.c
> > @@ -11,7 +11,6 @@
> >  #include <linux/mman.h>
> >  #include <linux/mm.h>
> >  #include <linux/mm_inline.h>
> > -#include <linux/mmu_context.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/sched.h>
> >  #include "internal.h"
> > @@ -21,54 +20,6 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
> >         vm_flags_set(vma, VM_SEALED);
> >  }
> >
> > -static bool is_madv_discard(int behavior)
> > -{
> > -       switch (behavior) {
> > -       case MADV_FREE:
> > -       case MADV_DONTNEED:
> > -       case MADV_DONTNEED_LOCKED:
> > -       case MADV_REMOVE:
> > -       case MADV_DONTFORK:
> > -       case MADV_WIPEONFORK:
> > -       case MADV_GUARD_INSTALL:
> > -               return true;
> > -       }
> > -
> > -       return false;
> > -}
> > -
> > -static bool is_ro_anon(struct vm_area_struct *vma)
> > -{
> > -       /* check anonymous mapping. */
> > -       if (vma->vm_file || vma->vm_flags & VM_SHARED)
> > -               return false;
> > -
> > -       /*
> > -        * check for non-writable:
> > -        * PROT=RO or PKRU is not writeable.
> > -        */
> > -       if (!(vma->vm_flags & VM_WRITE) ||
> > -               !arch_vma_access_permitted(vma, true, false, false))
> > -               return true;
> > -
> > -       return false;
> > -}
> > -
> > -/*
> > - * Check if a vma is allowed to be modified by madvise.
> > - */
> > -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> > -{
> > -       if (!is_madv_discard(behavior))
> > -               return true;
> > -
> > -       if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
> > -               return false;
> > -
> > -       /* Allow by default. */
> > -       return true;
> > -}
> > -
> >  static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >                 struct vm_area_struct **prev, unsigned long start,
> >                 unsigned long end, vm_flags_t newflags)
> > diff --git a/mm/vma.h b/mm/vma.h
> > index acdcc515c459..85db5e880fcc 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -577,8 +577,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
> >         return true;
> >  }
> >
> > -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior);
> > -
> >  #else
> >
> >  static inline bool can_modify_vma(struct vm_area_struct *vma)
> > @@ -586,11 +584,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
> >         return true;
> >  }
> >
> > -static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> > -{
> > -       return true;
> > -}
> > -
> >  #endif
> >
> >  #if defined(CONFIG_STACK_GROWSUP)
> > diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c
> > index 005f29c86484..34c042da4de2 100644
> > --- a/tools/testing/selftests/mm/mseal_test.c
> > +++ b/tools/testing/selftests/mm/mseal_test.c
> > @@ -1712,7 +1712,6 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
> >         unsigned long size = 4 * page_size;
> >         int ret;
> >         int fd;
> > -       unsigned long mapflags = MAP_PRIVATE;
> >
> >         fd = memfd_create("test", 0);
> >         FAIL_TEST_IF_FALSE(fd > 0);
> > @@ -1720,7 +1719,7 @@ static void test_seal_discard_ro_anon_on_filebacked(bool seal)
> >         ret = fallocate(fd, 0, 0, size);
> >         FAIL_TEST_IF_FALSE(!ret);
> >
> > -       ptr = mmap(NULL, size, PROT_READ, mapflags, fd, 0);
> > +       ptr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
> >         FAIL_TEST_IF_FALSE(ptr != MAP_FAILED);
> >
> >         if (seal) {
> > --
> > 2.50.1
> >


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case
  2025-07-24 18:32 ` [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Jeff Xu
@ 2025-07-24 19:10   ` Lorenzo Stoakes
  0 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-24 19:10 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Kees Cook, linux-hardening

On Thu, Jul 24, 2025 at 11:32:26AM -0700, Jeff Xu wrote:
> Hi Lorenzo,
>
> Thanks for including me in this thread. I've just returned from
> vacation and am catching up on my emails. I'll respond to each patch
> separately in the following emails.

You're welcome, I promised I would always cc you, and I keep my promises as
best I can.

It's unfortunate that you're sending this review on more or less the last
day of the cycle, but there we are.

>
> Could you consider adding mm/mseal.c to the HARDENING section of
> MAINTAINERS? Please include Kees and linux-hardening in future emails
> about mseal - Kees has been helping me with mseal since the beginning.

No, because we might move any of this logic elsewhere and I consider it
fundamental to VMA logic.

I am more than happy to include Kees as well on any emails regarding
this. But it does not serve VMA logic to arbitrarily keep things in a
certain file to satisfy MAINTAINERS.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-16 17:38 ` [PATCH v3 2/5] mm/mseal: update madvise() logic Lorenzo Stoakes
  2025-07-24 18:39   ` Jeff Xu
@ 2025-07-24 21:15   ` Kees Cook
  2025-07-24 21:32     ` David Hildenbrand
                       ` (2 more replies)
  2025-07-24 22:12   ` David Hildenbrand
  2 siblings, 3 replies; 38+ messages in thread
From: Kees Cook @ 2025-07-24 21:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Jeff Xu

On Wed, Jul 16, 2025 at 06:38:03PM +0100, Lorenzo Stoakes wrote:
> We make a change to the logic here to correct a mistake - we must disallow
> discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> were not.
> The justification for this change is to account for the case where:
> 
> 1. A MAP_PRIVATE R/W file-backed mapping is established.
> 2. The mapping is written to, which backs it with anonymous memory.
> 3. The mapping is mprotect()'d read-only.
> 4. The mapping is mseal()'d.
> 
> If we were to now allow discard of this data, it would mean mseal() would
> not prevent the unrecoverable discarding of data and it was thus violate
> the semantics of sealed VMAs.

I want to make sure I'm understanding this right:

Was the old behavior to allow discard? (If so, that seems like it wasn't
doing what Linus asked for[1], but it's not clear to me if that was
the behavior Chrome wanted.) The test doesn't appear to validate which
contents end up being visible after the discard, only whether or not
madvise() succeeds.

As an aside, why should discard work in this case even without step 4?
Wouldn't setting "read-only" imply you don't want the memory to change
out from under you? I guess I'm not clear on the semantics: how do memory
protection bits map to madvise actions like this?

-Kees

[1] https://lore.kernel.org/lkml/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 21:15   ` Kees Cook
@ 2025-07-24 21:32     ` David Hildenbrand
  2025-07-24 21:41       ` David Hildenbrand
  2025-07-25  5:49     ` Lorenzo Stoakes
  2025-07-25 16:21     ` Jeff Xu
  2 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-24 21:32 UTC (permalink / raw)
  To: Kees Cook, Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

> As an aside, why should discard work in this case even without step 4?
> Wouldn't setting "read-only" imply you don't want the memory to change
> out from under you? I guess I'm not clear on the semantics: how do memory
> protection bits map to madvise actions like this?

They generally don't affect MADV_DONTNEED behavior. The only documented 
(man page) reason for EPERM in the man page is related to MADV_HWPOISON.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 21:32     ` David Hildenbrand
@ 2025-07-24 21:41       ` David Hildenbrand
  2025-07-24 22:29         ` Kees Cook
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-24 21:41 UTC (permalink / raw)
  To: Kees Cook, Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

On 24.07.25 23:32, David Hildenbrand wrote:
>> As an aside, why should discard work in this case even without step 4?
>> Wouldn't setting "read-only" imply you don't want the memory to change
>> out from under you? I guess I'm not clear on the semantics: how do memory
>> protection bits map to madvise actions like this?
> 
> They generally don't affect MADV_DONTNEED behavior. The only documented
> (man page) reason for EPERM in the man page is related to MADV_HWPOISON.
> 

(Exception: MADV_POPULATE_READ/MADV_POPULATE_WRITE requires 
corresponding permissions)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 19:07     ` Lorenzo Stoakes
@ 2025-07-24 21:53       ` David Hildenbrand
  2025-07-25  6:17         ` Lorenzo Stoakes
  2025-07-25 16:22         ` Jeff Xu
  0 siblings, 2 replies; 38+ messages in thread
From: David Hildenbrand @ 2025-07-24 21:53 UTC (permalink / raw)
  To: Lorenzo Stoakes, Jeff Xu
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Kees Cook, linux-hardening

> 
>> 4. We could ask applications to switch to non-destructive madvise,
>> like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
>> switch the kernel to use non-destructive madvise implicitly for
>> destructive madvise in suitable situations.
> 
> Umm what? I don't understand your point.
> 
>> 5. We could investigate more based on vma->anon_vma
> 
> I have no idea what you mean by this. I am an rmap maintainer and have
> worked extensively with anon_vma, what's the point exactly?

I think, the idea would be to add an additional anon_vma check: so if 
you have a MAP_PRIVATE file mapping, you could still allow for 
MADV_DONTNEED if you are sure that there are no anon folios in there.

If there is an anon_vma, the only way to find out is actually looking at 
the page tables.

To be completely precise, one would have to enlighten the zap logic to 
refuse to zap if there is any anon folio there, and bail out.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-16 17:38 ` [PATCH v3 2/5] mm/mseal: update madvise() logic Lorenzo Stoakes
  2025-07-24 18:39   ` Jeff Xu
  2025-07-24 21:15   ` Kees Cook
@ 2025-07-24 22:12   ` David Hildenbrand
  2025-07-25  7:01     ` Lorenzo Stoakes
  2 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-24 22:12 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
	linux-mm, linux-kernel, Jeff Xu

On 16.07.25 19:38, Lorenzo Stoakes wrote:
> The madvise() logic is inexplicably performed in mm/mseal.c - this ought
> to be located in mm/madvise.c.
> 
> Additionally can_modify_vma_madv() is inconsistently named and, in
> combination with is_ro_anon(), is very confusing logic.
> 
> Put a static function in mm/madvise.c instead - can_madvise_modify() -
> that spells out exactly what's happening.  Also explicitly check for an
> anon VMA.
> 
> Also add commentary to explain what's going on.
> 
> Essentially - we disallow discarding of data in mseal()'d mappings in
> instances where the user couldn't otherwise write to that data.
> 
> Shared mappings are always backed, so no discard will actually truly
> discard the data.  Read-only anonymous and MAP_PRIVATE file-backed
> mappings are the ones we are interested in.
> 
> We make a change to the logic here to correct a mistake - we must disallow
> discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> were not.
> 
> The justification for this change is to account for the case where:
> 
> 1. A MAP_PRIVATE R/W file-backed mapping is established.
> 2. The mapping is written to, which backs it with anonymous memory.
> 3. The mapping is mprotect()'d read-only.
> 4. The mapping is mseal()'d.

Thinking about this a bit (should have realized this implication 
earlier) ... assuming we have:

1. A MAP_PRIVATE R/O file-backed mapping.
2. The mapping is mseal()'d.

We only really have anon folios in there with things like (a) uprobe (b) 
debugger access (c) similarly weird FOLL_FORCE stuff.

Now, most executables/libraries are mapped that way. If someone would 
rely on MADV_DONTNEED to zap pages in there (to free up memory), that 
would get rejected.

Does something like that rely on MADV_DONTNEED working? Good question.

Checking for anon_vma in addition, ad mentioned in the other thread, 
would be a "cheap" check to rule out that there are currently anon vmas 
in there.

Well, not 100% reliable, because MADV_DONTNEED can race with page faults ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 18:56     ` David Hildenbrand
@ 2025-07-24 22:18       ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2025-07-24 22:18 UTC (permalink / raw)
  To: Jeff Xu, Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Kees Cook, linux-hardening

On 24.07.25 20:56, David Hildenbrand wrote:
>>
>> To summarize all the discussion points so far:
>> 1. It's questionable behavior for madvise to allow destructive
>> behavior for read-only anonymous mappings, regardless of mseal state.
>   > 2. We could potentially fix point 1 within madvise itself, without>
> involving mseal, as Linus desires.
> 
> IIUC: disallow madvise(MADV_DONTNEED) without PROT_WRITE.
> 
> I am 99.99999% sure that that would break user case, unfortunately.
> 
>> 3. Android userspace uses destructive madvise to free up RAM, but I
>> need to take a closer look at the patterns and usage to understand why
>> they do that.
> 
> I am shocked that you question why they would use MADV_DONTNEED instead
> of ...
> 
>   > 4. We could ask applications to switch to non-destructive madvise,>
> like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
>> switch the kernel to use non-destructive madvise implicitly for
>> destructive madvise in suitable situations.
> 
> ... MADV_COLD / MADV_PAGEOUT.
> 
> I am also shocked that you think asking apps to switch would not make us
> break user space.
 > >> 5. We could investigate more based on vma->anon_vma
> 
> Or we do what sealing is supposed to do.

Sorry for the rather hard replies, I was not understanding at all what 
you were getting at really.

> 
> With the hope that this sealing fix here would not break user space.

Is your concern that something (in Chrome?) would be relying on 
MADV_DONTNEED working in case we had a MAP_PRIVATE R/O file mapping?

Again, disallowing that completely (even without mseal()) would break 
user space, I am very sure.

Whether we should allow zapping *anonymous folios* in MAP_PRIVATE R/O 
file mapping is a good  question, hard to tell if that would break anything.

For zapping *anonymous folios* in MAP_PRIVATE R/O anon mappings, I am 
sure there are use cases around userfaultfd, I'm afraid ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 21:41       ` David Hildenbrand
@ 2025-07-24 22:29         ` Kees Cook
  2025-07-24 22:47           ` David Hildenbrand
  0 siblings, 1 reply; 38+ messages in thread
From: Kees Cook @ 2025-07-24 22:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

On Thu, Jul 24, 2025 at 11:41:04PM +0200, David Hildenbrand wrote:
> On 24.07.25 23:32, David Hildenbrand wrote:
> > > As an aside, why should discard work in this case even without step 4?
> > > Wouldn't setting "read-only" imply you don't want the memory to change
> > > out from under you? I guess I'm not clear on the semantics: how do memory
> > > protection bits map to madvise actions like this?
> > 
> > They generally don't affect MADV_DONTNEED behavior. The only documented
> > (man page) reason for EPERM in the man page is related to MADV_HWPOISON.
> > 
> 
> (Exception: MADV_POPULATE_READ/MADV_POPULATE_WRITE requires corresponding
> permissions)

Shouldn't an MADV action that changes memory contents require the W bit
though? I mean, I assume the ship may have sailed on this, but it feels
mismatched to me.

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 22:29         ` Kees Cook
@ 2025-07-24 22:47           ` David Hildenbrand
  2025-07-25  7:41             ` David Hildenbrand
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-24 22:47 UTC (permalink / raw)
  To: Kees Cook
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

On 25.07.25 00:29, Kees Cook wrote:
> On Thu, Jul 24, 2025 at 11:41:04PM +0200, David Hildenbrand wrote:
>> On 24.07.25 23:32, David Hildenbrand wrote:
>>>> As an aside, why should discard work in this case even without step 4?
>>>> Wouldn't setting "read-only" imply you don't want the memory to change
>>>> out from under you? I guess I'm not clear on the semantics: how do memory
>>>> protection bits map to madvise actions like this?
>>>
>>> They generally don't affect MADV_DONTNEED behavior. The only documented
>>> (man page) reason for EPERM in the man page is related to MADV_HWPOISON.
>>>
>>
>> (Exception: MADV_POPULATE_READ/MADV_POPULATE_WRITE requires corresponding
>> permissions)
> 
> Shouldn't an MADV action that changes memory contents require the W bit
> though?

In a MAP_RPIVATE file mapping, to know whether you are actually 
modifying memory ("discarding pages" ...) would require checking the 
actually mapped pages (mixture of anon and !anon folios). Only zapping 
anon folios is the problematic bit, really.

It could be implemented (e.g., fail halfway through while actually 
walking the page tables and zap).

But, yeah ...

> I mean, I assume the ship may have sailed on this, but it feels
> mismatched to me.

... I think that is that case, unfortunately.

I remember that userfaultfd can do some really nasty stuff with 
UFFDIO_COPY and MADV_DONTNEED on memory areas that don't have write 
permissions ... or even read permissions. Not sure if CRIU or other use 
cases depend on that in some weird way.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/5] mm/mseal: Simplify and rename VMA gap check
  2025-07-24 18:40   ` Jeff Xu
@ 2025-07-25  5:33     ` Lorenzo Stoakes
  0 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25  5:33 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Kees Cook, linux-hardening

On Thu, Jul 24, 2025 at 11:40:59AM -0700, Jeff Xu wrote:
> Hi Lorenzo,
>
> On Wed, Jul 16, 2025 at 10:38 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > The check_mm_seal() function is doing something general - checking whether
> > a range contains only VMAs (or rather that it does NOT contain any
> > unmapped regions).
> >
> > So rename this function to range_contains_unmapped().
> >
> > Additionally simplify the logic, we are simply checking whether the last
> > vma->vm_end has either a VMA starting after it or ends before the end
> > parameter.
> >
> > This check is rather dubious, so it is sensible to keep it local to
> > mm/mseal.c as at a later stage it may be removed, and we don't want any
> > other mm code to perform such a check.
> >
> > No functional change intended.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > ---
> >  mm/mseal.c | 36 +++++++++++-------------------------
> >  1 file changed, 11 insertions(+), 25 deletions(-)
> >
> > diff --git a/mm/mseal.c b/mm/mseal.c
> > index adbcc65e9660..61c07b1369cb 100644
> > --- a/mm/mseal.c
> > +++ b/mm/mseal.c
> > @@ -37,32 +37,22 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >         return ret;
> >  }
> >
> > -/*
> > - * Check for do_mseal:
> > - * 1> start is part of a valid vma.
> > - * 2> end is part of a valid vma.
> > - * 3> No gap (unallocated address) between start and end.
> > - * 4> map is sealable.
> > - */
> > -static int check_mm_seal(unsigned long start, unsigned long end)
> Is it possible to leave the check_mm_seal() function together with its
> header comments? My original reason was to have a contract that
> documents the exact entry check for mseal(). That way, no matter how
> the code is refactored in the future, as long as the contract remains
> true, I won't need to worry about behavior changes for mseal(). This
> could be helpful if you move range_contains_unmapped into vma.c in the
> future.
>
> Note: "4> map is sealable." can be removed,  which is obsolete, we no
> longer use sealable flags.

Sure, I will add in a comment to make this abundantly clear.

>
> Thanks and regards,
> -Jeff
> > +/* Does the [start, end) range contain any unmapped memory? */
> > +static bool range_contains_unmapped(struct mm_struct *mm,
> > +               unsigned long start, unsigned long end)
> >  {
> >         struct vm_area_struct *vma;
> > -       unsigned long nstart = start;
> > +       unsigned long prev_end = start;
> >         VMA_ITERATOR(vmi, current->mm, start);
> >
> > -       /* going through each vma to check. */
> >         for_each_vma_range(vmi, vma, end) {
> > -               if (vma->vm_start > nstart)
> > -                       /* unallocated memory found. */
> > -                       return -ENOMEM;
> > -
> > -               if (vma->vm_end >= end)
> > -                       return 0;
> > +               if (vma->vm_start > prev_end)
> > +                       return true;
> >
> > -               nstart = vma->vm_end;
> > +               prev_end = vma->vm_end;
> >         }
> >
> > -       return -ENOMEM;
> > +       return prev_end < end;
> >  }
> >
> >  /*
> > @@ -184,14 +174,10 @@ int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
> >         if (mmap_write_lock_killable(mm))
> >                 return -EINTR;
> >
> > -       /*
> > -        * First pass, this helps to avoid
> > -        * partial sealing in case of error in input address range,
> > -        * e.g. ENOMEM error.
> > -        */
> > -       ret = check_mm_seal(start, end);
> > -       if (ret)
> > +       if (range_contains_unmapped(mm, start, end)) {
> > +               ret = -ENOMEM;
> >                 goto out;
> > +       }
> >
> >         /*
> >          * Second pass, this should success, unless there are errors
> > --
> > 2.50.1
> >


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 21:15   ` Kees Cook
  2025-07-24 21:32     ` David Hildenbrand
@ 2025-07-25  5:49     ` Lorenzo Stoakes
  2025-07-25 16:21     ` Jeff Xu
  2 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25  5:49 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Jeff Xu

On Thu, Jul 24, 2025 at 02:15:26PM -0700, Kees Cook wrote:
> On Wed, Jul 16, 2025 at 06:38:03PM +0100, Lorenzo Stoakes wrote:
> > We make a change to the logic here to correct a mistake - we must disallow
> > discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> > were not.
> > The justification for this change is to account for the case where:
> >
> > 1. A MAP_PRIVATE R/W file-backed mapping is established.
> > 2. The mapping is written to, which backs it with anonymous memory.
> > 3. The mapping is mprotect()'d read-only.
> > 4. The mapping is mseal()'d.
> >
> > If we were to now allow discard of this data, it would mean mseal() would
> > not prevent the unrecoverable discarding of data and it was thus violate
> > the semantics of sealed VMAs.
>
> I want to make sure I'm understanding this right:
>
> Was the old behavior to allow discard? (If so, that seems like it wasn't
> doing what Linus asked for[1], but it's not clear to me if that was
> the behavior Chrome wanted.) The test doesn't appear to validate which
> contents end up being visible after the discard, only whether or not
> madvise() succeeds.

Yes the old behaviour allowed discard in this case, because:

	/* check anonymous mapping. */
	if (vma->vm_file || vma->vm_flags & VM_SHARED)
		return false;

In is_ro_anon() would return false (we have vma->vm_file), and in
can_modify_vma_madv() we'd hit:

	if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
		return false;

	/* Allow by default. */
	return true;

The fix is to check vma->vm_files & VM_SHARED only in effect.

>
> As an aside, why should discard work in this case even without step 4?
> Wouldn't setting "read-only" imply you don't want the memory to change
> out from under you? I guess I'm not clear on the semantics: how do memory
> protection bits map to madvise actions like this?

I mean this is uAPI so it's moot, we can't change this.

I think you're thinking read-only is stronger than you think it is in the
general case.

VM_MAYWRITE is the key thing here.

In do_mmap() in mm/mmap.c:

	if (file) {
		struct inode *inode = file_inode(file);
		unsigned long flags_mask;
		int err;

		...

		switch (flags & MAP_TYPE) {
		case MAP_SHARED:
			...
			fallthrough;
		case MAP_SHARED_VALIDATE:
			...
			if (!(file->f_mode & FMODE_WRITE))
				vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

			...
		}
		...
	}

So we're only actually prevented VM_MAYWRITE if the _file_ itself doesn't have
write permission.

Otherwise we might at any time mprotect() the mapping to be writable in any
csae.

mseal() changes things, as it's a stronger requirement. You're explicitly saying
'I don't want this data to be discarded', which is why we should be firmer here.

I disagree this needs to be changed more broadly, but in any case, it'd break
uAPI so it's moot.

And wrt this series, it's further moot.

>
> -Kees
>
> [1] https://lore.kernel.org/lkml/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/
>
> --
> Kees Cook


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 21:53       ` David Hildenbrand
@ 2025-07-25  6:17         ` Lorenzo Stoakes
  2025-07-25 16:22         ` Jeff Xu
  1 sibling, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25  6:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jeff Xu, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, linux-mm, linux-kernel, Kees Cook,
	linux-hardening

On Thu, Jul 24, 2025 at 11:53:52PM +0200, David Hildenbrand wrote:
> >
> > > 4. We could ask applications to switch to non-destructive madvise,
> > > like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
> > > switch the kernel to use non-destructive madvise implicitly for
> > > destructive madvise in suitable situations.
> >
> > Umm what? I don't understand your point.
> >
> > > 5. We could investigate more based on vma->anon_vma
> >
> > I have no idea what you mean by this. I am an rmap maintainer and have
> > worked extensively with anon_vma, what's the point exactly?
>
> I think, the idea would be to add an additional anon_vma check: so if you
> have a MAP_PRIVATE file mapping, you could still allow for MADV_DONTNEED if
> you are sure that there are no anon folios in there.

OK this is a more coherent explanation of what this means, thanks.

In no other case are we checking if there is data there that is different from
post-discard, so this would be inconsistent with other disallowed madvise()
modes.

Equally, to me setting mprotect(PROT_READ) then mseal()'ing is a contract, and
adding a 'but we let you discard if we go check and it's fine' feels like really
inconsistent semantics.

We're dealing with a real edge-case scenario here of a MAP_PRIVATE mapping
(which means you are essentially asking for anon) being intentionally marked
read-only then sealed.

I think it's _better_ to be clearer on this.

>
> If there is an anon_vma, the only way to find out is actually looking at the
> page tables.
>
> To be completely precise, one would have to enlighten the zap logic to
> refuse to zap if there is any anon folio there, and bail out.

Yeah absolutely not this would be crazy, especially for such an edge case.

I'm sure you agree :)

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case
  2025-07-16 17:38 [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Lorenzo Stoakes
                   ` (5 preceding siblings ...)
  2025-07-24 18:32 ` [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Jeff Xu
@ 2025-07-25  6:40 ` Lorenzo Stoakes
  6 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25  6:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

Since there's debate about the semantics of the MAP_PRIVATE stuff I'll send
a v4 with that taken out.

I very much want the refactorings to land in 6.17.

So we can carry on debating the ins and outs of this and make any
_semantic_ changes for 6.18.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 22:12   ` David Hildenbrand
@ 2025-07-25  7:01     ` Lorenzo Stoakes
  2025-07-25  7:38       ` David Hildenbrand
  0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25  7:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

On Fri, Jul 25, 2025 at 12:12:54AM +0200, David Hildenbrand wrote:
> On 16.07.25 19:38, Lorenzo Stoakes wrote:
> > The madvise() logic is inexplicably performed in mm/mseal.c - this ought
> > to be located in mm/madvise.c.
> >
> > Additionally can_modify_vma_madv() is inconsistently named and, in
> > combination with is_ro_anon(), is very confusing logic.
> >
> > Put a static function in mm/madvise.c instead - can_madvise_modify() -
> > that spells out exactly what's happening.  Also explicitly check for an
> > anon VMA.
> >
> > Also add commentary to explain what's going on.
> >
> > Essentially - we disallow discarding of data in mseal()'d mappings in
> > instances where the user couldn't otherwise write to that data.
> >
> > Shared mappings are always backed, so no discard will actually truly
> > discard the data.  Read-only anonymous and MAP_PRIVATE file-backed
> > mappings are the ones we are interested in.
> >
> > We make a change to the logic here to correct a mistake - we must disallow
> > discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> > were not.
> >
> > The justification for this change is to account for the case where:
> >
> > 1. A MAP_PRIVATE R/W file-backed mapping is established.
> > 2. The mapping is written to, which backs it with anonymous memory.
> > 3. The mapping is mprotect()'d read-only.
> > 4. The mapping is mseal()'d.
>
> Thinking about this a bit (should have realized this implication earlier)

Well none of us did...

> ... assuming we have:
>
> 1. A MAP_PRIVATE R/O file-backed mapping.
> 2. The mapping is mseal()'d.
>
> We only really have anon folios in there with things like (a) uprobe (b)
> debugger access (c) similarly weird FOLL_FORCE stuff.
>
> Now, most executables/libraries are mapped that way. If someone would rely
> on MADV_DONTNEED to zap pages in there (to free up memory), that would get
> rejected.

Right, yes.

This is odd behaviour to me. But I guess this is what Jeff meant by 'detecting
this' in android.

The documentation is really not specific enough, we need to fix that. It's
effectively stating any anon mappings are sealed, which is just not true with
existing semantics.

However I see:

	Memory sealing can automatically be applied by the runtime loader to
	seal .text and .rodata pages and applications can additionally seal
	security critical data at runtime.

So yes, we're going to break MADV_DONTNEED of this mappings.

BUT.

Would you really want to MADV_DONTNEED away uprobes etc.?? That seems... very
strange and broken behaviour no?

Note that, also, mappings of read-only files have VM_SHARED stripped. So they
become read-only (With ~VM_MAYWRITE).

To be clear this is where the mode of the file is read-only, not that the
mapping is read-only alone.

So with this change, we'd disallow discard of this.

It'd be pretty odd to mseal() a read-only file-backed mapping and then try to
discard, but maybe somebody would weirdly rely upon this?

It's inconsistent, as a person MAP_SHARED mapping a file that is read/write but
mapped read-only (or r/w of course), can discard fine eve if sealed, but if the
file happens to be read-only can't.

But we could add a VM_MAYWRITE check also.

OK maybe I"m softening on the anon_vma thing see below.

So we could combine these checks to avoid these issues.


>
> Does something like that rely on MADV_DONTNEED working? Good question.

Kees/Jeff? Can you check if android relies on this?

>
> Checking for anon_vma in addition, ad mentioned in the other thread, would
> be a "cheap" check to rule out that there are currently anon vmas in there.
>
> Well, not 100% reliable, because MADV_DONTNEED can race with page faults ...

But hang on, it's read-only so we shouldn't get racing faults... right?

Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
rule out the _usual_ cases.

We're not changing zapping logic for this though, sorry. That's just a crazy
length to go to.

>
> --
> Cheers,
>
> David / dhildenb
>

In any case, I'm going to send a version of this with the controversial bit
stripped so we (hopefully) land the refactorings for 6.17 and can change
semantics if necessary for 6.18.

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-25  7:01     ` Lorenzo Stoakes
@ 2025-07-25  7:38       ` David Hildenbrand
  2025-07-25  8:53         ` Lorenzo Stoakes
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-25  7:38 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

>>>
>>> The justification for this change is to account for the case where:
>>>
>>> 1. A MAP_PRIVATE R/W file-backed mapping is established.
>>> 2. The mapping is written to, which backs it with anonymous memory.
>>> 3. The mapping is mprotect()'d read-only.
>>> 4. The mapping is mseal()'d.
>>
>> Thinking about this a bit (should have realized this implication earlier)
> 
> Well none of us did...
 > >> ... assuming we have:
>>
>> 1. A MAP_PRIVATE R/O file-backed mapping.
>> 2. The mapping is mseal()'d.
>>
>> We only really have anon folios in there with things like (a) uprobe (b)
>> debugger access (c) similarly weird FOLL_FORCE stuff.
>>
>> Now, most executables/libraries are mapped that way. If someone would rely
>> on MADV_DONTNEED to zap pages in there (to free up memory), that would get
>> rejected.
> 
> Right, yes.
> 
> This is odd behaviour to me. But I guess this is what Jeff meant by 'detecting
> this' in android.

It's rather weird usage of MADV_DONTNEED, but maybe, for some R/O 
buffers ...

> 
> The documentation is really not specific enough, we need to fix that. It's
> effectively stating any anon mappings are sealed, which is just not true with
> existing semantics.
> 
> However I see:
> 
> 	Memory sealing can automatically be applied by the runtime loader to
> 	seal .text and .rodata pages and applications can additionally seal
> 	security critical data at runtime.
> 
> So yes, we're going to break MADV_DONTNEED of this mappings.
 > > BUT.
> 
> Would you really want to MADV_DONTNEED away uprobes etc.?? That seems... very
> strange and broken behaviour no?
> 
> Note that, also, mappings of read-only files have VM_SHARED stripped. So they
> become read-only (With ~VM_MAYWRITE).
> 
> To be clear this is where the mode of the file is read-only, not that the
> mapping is read-only alone.
> 
> So with this change, we'd disallow discard of this.
> 
> It'd be pretty odd to mseal() a read-only file-backed mapping and then try to
> discard, but maybe somebody would weirdly rely upon this?
> 
> It's inconsistent, as a person MAP_SHARED mapping a file that is read/write but
> mapped read-only (or r/w of course), can discard fine eve if sealed, but if the
> file happens to be read-only can't.
> 
> But we could add a VM_MAYWRITE check also.
> 
> OK maybe I"m softening on the anon_vma thing see below.
> 
> So we could combine these checks to avoid these issues.
> 
> 
>>
>> Does something like that rely on MADV_DONTNEED working? Good question.
> 
> Kees/Jeff? Can you check if android relies on this?
> 
>>
>> Checking for anon_vma in addition, ad mentioned in the other thread, would
>> be a "cheap" check to rule out that there are currently anon vmas in there.
>>
>> Well, not 100% reliable, because MADV_DONTNEED can race with page faults ...
> 
> But hang on, it's read-only so we shouldn't get racing faults... right?

You mean, ones that populate anon folios.

Well, there is long-term pinning that can break COW and other weird 
stuff like FOLL_FORCE. Most of the latter probably holds the mmap lock 
in write mode. Probably.

> 
> Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
> rule out the _usual_ cases.

Yeah, something to evaluate.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 22:47           ` David Hildenbrand
@ 2025-07-25  7:41             ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2025-07-25  7:41 UTC (permalink / raw)
  To: Kees Cook
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

On 25.07.25 00:47, David Hildenbrand wrote:
> On 25.07.25 00:29, Kees Cook wrote:
>> On Thu, Jul 24, 2025 at 11:41:04PM +0200, David Hildenbrand wrote:
>>> On 24.07.25 23:32, David Hildenbrand wrote:
>>>>> As an aside, why should discard work in this case even without step 4?
>>>>> Wouldn't setting "read-only" imply you don't want the memory to change
>>>>> out from under you? I guess I'm not clear on the semantics: how do memory
>>>>> protection bits map to madvise actions like this?
>>>>
>>>> They generally don't affect MADV_DONTNEED behavior. The only documented
>>>> (man page) reason for EPERM in the man page is related to MADV_HWPOISON.
>>>>
>>>
>>> (Exception: MADV_POPULATE_READ/MADV_POPULATE_WRITE requires corresponding
>>> permissions)
>>
>> Shouldn't an MADV action that changes memory contents require the W bit
>> though?
> 

Pondering about this some more, at least MADV_DONTNEED is mostly a 
cheaper way of doing mmap(MAP_FIXED): in other word, zap everything but 
leave the original mapping unchanged.

So if you allow for mmap(MAP_FIXED) -- ignore any permissions bits, of 
course -- nothing really wrong about allowing MADV_DONTNEED.

With mseal(), it got all weird I am afraid, because we have this 
exception list, and apparently, it has holes. :(

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-25  7:38       ` David Hildenbrand
@ 2025-07-25  8:53         ` Lorenzo Stoakes
  2025-07-25  9:46           ` David Hildenbrand
  0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25  8:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

On Fri, Jul 25, 2025 at 09:38:14AM +0200, David Hildenbrand wrote:
> > >> ... assuming we have:
> > >
> > > 1. A MAP_PRIVATE R/O file-backed mapping.
> > > 2. The mapping is mseal()'d.
> > >
> > > We only really have anon folios in there with things like (a) uprobe (b)
> > > debugger access (c) similarly weird FOLL_FORCE stuff.
> > >
> > > Now, most executables/libraries are mapped that way. If someone would rely
> > > on MADV_DONTNEED to zap pages in there (to free up memory), that would get
> > > rejected.
> >
> > Right, yes.
> >
> > This is odd behaviour to me. But I guess this is what Jeff meant by 'detecting
> > this' in android.
>
> It's rather weird usage of MADV_DONTNEED, but maybe, for some R/O buffers
> ...

Yeah, curious if anybody is actually doing this.

> > >
> > > Checking for anon_vma in addition, ad mentioned in the other thread, would
> > > be a "cheap" check to rule out that there are currently anon vmas in there.
> > >
> > > Well, not 100% reliable, because MADV_DONTNEED can race with page faults ...
> >
> > But hang on, it's read-only so we shouldn't get racing faults... right?
>
> You mean, ones that populate anon folios.

Right, but these are the only ones we care about right? file-backed mappings
won't change vma->anon_vma.

Changes to that field from NULL use mmap read lock and... page_table_lock :P
fun.

>
> Well, there is long-term pinning that can break COW and other weird stuff
> like FOLL_FORCE. Most of the latter probably holds the mmap lock in write
> mode. Probably.

Well GUP uses read lock.

FOLL_FORCE won't override anything as we have this check in check_vma_flags():

	if (write) {
		if (!vma_anon &&
		    !writable_file_mapping_allowed(vma, gup_flags))
			return -EFAULT;

		if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
			if (!(gup_flags & FOLL_FORCE))
				return -EFAULT;
			/*
			 * We used to let the write,force case do COW in a
			 * VM_MAYWRITE VM_SHARED !VM_WRITE vma, so ptrace could
			 * set a breakpoint in a read-only mapping of an
			 * executable, without corrupting the file (yet only
			 * when that file had been opened for writing!).
			 * Anon pages in shared mappings are surprising: now
			 * just reject it.
			 */
			if (!is_cow_mapping(vm_flags))
				return -EFAULT;
		}
	}

With:

static inline bool is_cow_mapping(vm_flags_t flags)
{
	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

So - we explicitly disallow FOLL_FORCE write override for CoW file-backed
mappings.

Obviously if FOLL_FORCE is not set, then we're ALSO not allowed to get past a
FOLL_WRITE and !VM_WRITE situation.

>
> >
> > Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
> > rule out the _usual_ cases.
>
> Yeah, something to evaluate.

I'm thinking more and more we're probably actually safe with !vma->anon_vma ||
!(vma->vm_flags & VM_MAYWRITE).

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-25  8:53         ` Lorenzo Stoakes
@ 2025-07-25  9:46           ` David Hildenbrand
  2025-07-25 10:05             ` Lorenzo Stoakes
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-25  9:46 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

>>
>> Well, there is long-term pinning that can break COW and other weird stuff
>> like FOLL_FORCE. Most of the latter probably holds the mmap lock in write
>> mode. Probably.
> 
> Well GUP uses read lock.

Right, so it can race with MADV_DONTNEED.

> 
> FOLL_FORCE won't override anything as we have this check in check_vma_flags():
> 
> 	if (write) {
> 		if (!vma_anon &&
> 		    !writable_file_mapping_allowed(vma, gup_flags))
> 			return -EFAULT;
> 
> 		if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
> 			if (!(gup_flags & FOLL_FORCE))
> 				return -EFAULT;
> 			/*
> 			 * We used to let the write,force case do COW in a
> 			 * VM_MAYWRITE VM_SHARED !VM_WRITE vma, so ptrace could
> 			 * set a breakpoint in a read-only mapping of an
> 			 * executable, without corrupting the file (yet only
> 			 * when that file had been opened for writing!).
> 			 * Anon pages in shared mappings are surprising: now
> 			 * just reject it.
> 			 */
> 			if (!is_cow_mapping(vm_flags))
> 				return -EFAULT;
> 		}
> 	}
> 
> With:
> 
> static inline bool is_cow_mapping(vm_flags_t flags)
> {
> 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> }
> 

Not sure what you mean. Using FOLL_FORCE you can write into MAP_PRIVATE 
R/O mappings. Particular useful for installing breakpoints into loaded 
executables etc.

is_cow_mapping() tells you exactly that: the only place where we can 
have anon folios is when we have a MAP_PRIVATE mapping (!VM_SHARED) that 
can be writable, for example, through mprotect(PROT_WRITE) (VM_MAYWRITE).

A MAP_PRIVATE R/O file mapping matches is_cow_mapping().

> So - we explicitly disallow FOLL_FORCE write override for CoW file-backed
> mappings.
> 
> Obviously if FOLL_FORCE is not set, then we're ALSO not allowed to get past a
> FOLL_WRITE and !VM_WRITE situation.
> 
>>
>>>
>>> Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
>>> rule out the _usual_ cases.
>>
>> Yeah, something to evaluate.
> 
> I'm thinking more and more we're probably actually safe with !vma->anon_vma ||
> !(vma->vm_flags & VM_MAYWRITE).

I think there are possible races, the question is how much you care 
about them.

In case of CoW-unsharing, you're not actually discarding data, because 
the page in the anon folio is to maintain a copy of the pagecache page 
(of course, they can go out of sync, but that's a different discussion).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-25  9:46           ` David Hildenbrand
@ 2025-07-25 10:05             ` Lorenzo Stoakes
  2025-07-25 10:10               ` David Hildenbrand
  0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25 10:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

On Fri, Jul 25, 2025 at 11:46:13AM +0200, David Hildenbrand wrote:
> > >
> > > Well, there is long-term pinning that can break COW and other weird stuff
> > > like FOLL_FORCE. Most of the latter probably holds the mmap lock in write
> > > mode. Probably.
> >
> > Well GUP uses read lock.
>
> Right, so it can race with MADV_DONTNEED.
>
> >
> > FOLL_FORCE won't override anything as we have this check in check_vma_flags():
> >
> > 	if (write) {
> > 		if (!vma_anon &&
> > 		    !writable_file_mapping_allowed(vma, gup_flags))
> > 			return -EFAULT;
> >
> > 		if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
> > 			if (!(gup_flags & FOLL_FORCE))
> > 				return -EFAULT;
> > 			/*
> > 			 * We used to let the write,force case do COW in a
> > 			 * VM_MAYWRITE VM_SHARED !VM_WRITE vma, so ptrace could
> > 			 * set a breakpoint in a read-only mapping of an
> > 			 * executable, without corrupting the file (yet only
> > 			 * when that file had been opened for writing!).
> > 			 * Anon pages in shared mappings are surprising: now
> > 			 * just reject it.
> > 			 */
> > 			if (!is_cow_mapping(vm_flags))
> > 				return -EFAULT;
> > 		}
> > 	}
> >
> > With:
> >
> > static inline bool is_cow_mapping(vm_flags_t flags)
> > {
> > 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> > }
> >
>
> Not sure what you mean. Using FOLL_FORCE you can write into MAP_PRIVATE R/O
> mappings. Particular useful for installing breakpoints into loaded
> executables etc.

Sigh. Really sorry, I'm having a terrible week, my brain just isn't working at
the moment.

Yes it proves the opposite of what I said, I misread it foolishly.

Please disregard.

>
> is_cow_mapping() tells you exactly that: the only place where we can have
> anon folios is when we have a MAP_PRIVATE mapping (!VM_SHARED) that can be
> writable, for example, through mprotect(PROT_WRITE) (VM_MAYWRITE).
>
> A MAP_PRIVATE R/O file mapping matches is_cow_mapping().

Yup.

>
> > So - we explicitly disallow FOLL_FORCE write override for CoW file-backed
> > mappings.
> >
> > Obviously if FOLL_FORCE is not set, then we're ALSO not allowed to get past a
> > FOLL_WRITE and !VM_WRITE situation.
> >
> > >
> > > >
> > > > Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
> > > > rule out the _usual_ cases.
> > >
> > > Yeah, something to evaluate.
> >
> > I'm thinking more and more we're probably actually safe with !vma->anon_vma ||
> > !(vma->vm_flags & VM_MAYWRITE).
>
> I think there are possible races, the question is how much you care about
> them.

Yes I was just wrong. Please just disregard.

I mean racing with MADV_POPULATE_WRITE seems a niche thing to worry about, and
so what if you did, it's writing a... copy of the underlying file-backed folios
no?

Equally long-term GUP, assuming it breaks CoW for migration, is what, populating
unchanged folios, so what's the issue?


>
> In case of CoW-unsharing, you're not actually discarding data, because the
> page in the anon folio is to maintain a copy of the pagecache page (of
> course, they can go out of sync, but that's a different discussion).

Yes I know.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-25 10:05             ` Lorenzo Stoakes
@ 2025-07-25 10:10               ` David Hildenbrand
  2025-07-25 10:17                 ` Lorenzo Stoakes
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-25 10:10 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

>>> So - we explicitly disallow FOLL_FORCE write override for CoW file-backed
>>> mappings.
>>>
>>> Obviously if FOLL_FORCE is not set, then we're ALSO not allowed to get past a
>>> FOLL_WRITE and !VM_WRITE situation.
>>>
>>>>
>>>>>
>>>>> Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
>>>>> rule out the _usual_ cases.
>>>>
>>>> Yeah, something to evaluate.
>>>
>>> I'm thinking more and more we're probably actually safe with !vma->anon_vma ||
>>> !(vma->vm_flags & VM_MAYWRITE).
>>
>> I think there are possible races, the question is how much you care about
>> them.
> 
> Yes I was just wrong. Please just disregard.
> 
> I mean racing with MADV_POPULATE_WRITE seems a niche thing to worry about, and
> so what if you did, it's writing a... copy of the underlying file-backed folios
> no?

MADV_POPULATE_WRITE does not apply. It's only racing with FOLL_FORCE, 
like debugger access.

Again, a race one probably shouldn't worry about in the context of mseal.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-25 10:10               ` David Hildenbrand
@ 2025-07-25 10:17                 ` Lorenzo Stoakes
  0 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-25 10:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Jeff Xu

On Fri, Jul 25, 2025 at 12:10:07PM +0200, David Hildenbrand wrote:
> > > > So - we explicitly disallow FOLL_FORCE write override for CoW file-backed
> > > > mappings.
> > > >
> > > > Obviously if FOLL_FORCE is not set, then we're ALSO not allowed to get past a
> > > > FOLL_WRITE and !VM_WRITE situation.
> > > >
> > > > >
> > > > > >
> > > > > > Hmm maybe I'll soften on this anon_vma idea then. Maybe it is a 'cheap fix' to
> > > > > > rule out the _usual_ cases.
> > > > >
> > > > > Yeah, something to evaluate.
> > > >
> > > > I'm thinking more and more we're probably actually safe with !vma->anon_vma ||
> > > > !(vma->vm_flags & VM_MAYWRITE).
> > >
> > > I think there are possible races, the question is how much you care about
> > > them.
> >
> > Yes I was just wrong. Please just disregard.
> >
> > I mean racing with MADV_POPULATE_WRITE seems a niche thing to worry about, and
> > so what if you did, it's writing a... copy of the underlying file-backed folios
> > no?
>
> MADV_POPULATE_WRITE does not apply. It's only racing with FOLL_FORCE, like
> debugger access.

Yeah right, that too. OK so we might have some breakpoint or changed some data,
then discard, I don't think this is _too_ problematic right? The debugger is
naturally racey anyway.

I think we can accept these.

>
> Again, a race one probably shouldn't worry about in the context of mseal.

Yeah agreed, so I think this is still sensible.

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 21:15   ` Kees Cook
  2025-07-24 21:32     ` David Hildenbrand
  2025-07-25  5:49     ` Lorenzo Stoakes
@ 2025-07-25 16:21     ` Jeff Xu
  2 siblings, 0 replies; 38+ messages in thread
From: Jeff Xu @ 2025-07-25 16:21 UTC (permalink / raw)
  To: Kees Cook
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Pedro Falcato,
	linux-mm, linux-kernel

On Thu, Jul 24, 2025 at 2:15 PM Kees Cook <kees@kernel.org> wrote:
>
> On Wed, Jul 16, 2025 at 06:38:03PM +0100, Lorenzo Stoakes wrote:
> > We make a change to the logic here to correct a mistake - we must disallow
> > discard of read-only MAP_PRIVATE file-backed mappings, which previously we
> > were not.
> > The justification for this change is to account for the case where:
> >
> > 1. A MAP_PRIVATE R/W file-backed mapping is established.
> > 2. The mapping is written to, which backs it with anonymous memory.
> > 3. The mapping is mprotect()'d read-only.
> > 4. The mapping is mseal()'d.
> >
> > If we were to now allow discard of this data, it would mean mseal() would
> > not prevent the unrecoverable discarding of data and it was thus violate
> > the semantics of sealed VMAs.
>
> I want to make sure I'm understanding this right:
>
> Was the old behavior to allow discard? (If so, that seems like it wasn't
> doing what Linus asked for[1], but it's not clear to me if that was
> the behavior Chrome wanted.)
Chrome V8 JIT engine only cares about  anonymous mapping, not file backed.
I'm not sure about all of Chrome though.

> The test doesn't appear to validate which
> contents end up being visible after the discard, only whether or not
> madvise() succeeds.
>
Agree the test can be improved to validate the read-back after discard.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/5] mm/mseal: update madvise() logic
  2025-07-24 21:53       ` David Hildenbrand
  2025-07-25  6:17         ` Lorenzo Stoakes
@ 2025-07-25 16:22         ` Jeff Xu
  1 sibling, 0 replies; 38+ messages in thread
From: Jeff Xu @ 2025-07-25 16:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, linux-mm, linux-kernel, Kees Cook,
	linux-hardening

On Thu, Jul 24, 2025 at 2:53 PM David Hildenbrand <david@redhat.com> wrote:
>
> >
> >> 4. We could ask applications to switch to non-destructive madvise,
> >> like MADV_COLD or MADV_PAGEOUT. Or, another option is that we could
> >> switch the kernel to use non-destructive madvise implicitly for
> >> destructive madvise in suitable situations.
> >
> > Umm what? I don't understand your point.
> >
> >> 5. We could investigate more based on vma->anon_vma
> >
> > I have no idea what you mean by this. I am an rmap maintainer and have
> > worked extensively with anon_vma, what's the point exactly?
>
> I think, the idea would be to add an additional anon_vma check: so if
> you have a MAP_PRIVATE file mapping, you could still allow for
> MADV_DONTNEED if you are sure that there are no anon folios in there.
>
Yes. That is the theory, thanks for clarifying it for me.
This is exactly what I was trying to say in my previous response:

"We could try having destructive madvise check
vma->anon_vma and reject the call if it's true. I haven't had a chance
to test this theory yet, though."


> If there is an anon_vma, the only way to find out is actually looking at
> the page tables.
>
> To be completely precise, one would have to enlighten the zap logic to
> refuse to zap if there is any anon folio there, and bail out.
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-07-25 16:22 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-16 17:38 [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Lorenzo Stoakes
2025-07-16 17:38 ` [PATCH v3 1/5] mm/mseal: always define VM_SEALED Lorenzo Stoakes
2025-07-24 18:34   ` Jeff Xu
2025-07-24 18:44     ` Lorenzo Stoakes
2025-07-16 17:38 ` [PATCH v3 2/5] mm/mseal: update madvise() logic Lorenzo Stoakes
2025-07-24 18:39   ` Jeff Xu
2025-07-24 18:56     ` David Hildenbrand
2025-07-24 22:18       ` David Hildenbrand
2025-07-24 19:07     ` Lorenzo Stoakes
2025-07-24 21:53       ` David Hildenbrand
2025-07-25  6:17         ` Lorenzo Stoakes
2025-07-25 16:22         ` Jeff Xu
2025-07-24 21:15   ` Kees Cook
2025-07-24 21:32     ` David Hildenbrand
2025-07-24 21:41       ` David Hildenbrand
2025-07-24 22:29         ` Kees Cook
2025-07-24 22:47           ` David Hildenbrand
2025-07-25  7:41             ` David Hildenbrand
2025-07-25  5:49     ` Lorenzo Stoakes
2025-07-25 16:21     ` Jeff Xu
2025-07-24 22:12   ` David Hildenbrand
2025-07-25  7:01     ` Lorenzo Stoakes
2025-07-25  7:38       ` David Hildenbrand
2025-07-25  8:53         ` Lorenzo Stoakes
2025-07-25  9:46           ` David Hildenbrand
2025-07-25 10:05             ` Lorenzo Stoakes
2025-07-25 10:10               ` David Hildenbrand
2025-07-25 10:17                 ` Lorenzo Stoakes
2025-07-16 17:38 ` [PATCH v3 3/5] mm/mseal: small cleanups Lorenzo Stoakes
2025-07-24 18:40   ` Jeff Xu
2025-07-16 17:38 ` [PATCH v3 4/5] mm/mseal: Simplify and rename VMA gap check Lorenzo Stoakes
2025-07-24 18:40   ` Jeff Xu
2025-07-25  5:33     ` Lorenzo Stoakes
2025-07-16 17:38 ` [PATCH v3 5/5] mm/mseal: rework mseal apply logic Lorenzo Stoakes
2025-07-24 18:41   ` Jeff Xu
2025-07-24 18:32 ` [PATCH v3 0/5] mseal cleanups, fixup MAP_PRIVATE file-backed case Jeff Xu
2025-07-24 19:10   ` Lorenzo Stoakes
2025-07-25  6:40 ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).