* [PATCH 0/5] mseal cleanups
@ 2025-07-14 13:00 Lorenzo Stoakes
2025-07-14 13:00 ` [PATCH 1/5] mm/mseal: always define VM_SEALED Lorenzo Stoakes
` (4 more replies)
0 siblings, 5 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 13:00 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
Perform a number of cleanups to the mseal logic. Firstly, VM_SEALED is
treated differently from every other VMA flag, it really doesn't make sense
to do this and complicates logic, so we start by making this consistent
with everything else.
Next we place the madvise logic where it belongs - in mm/madvise.c. It
really makes no sense to abstract this elsewhere. In doing so, we go to
great lengths to explain very clearly the previously very confusing logic
as to what sealed mappings are impacted here.
We abstract out and explain the 'are there are any gaps in this range in
the mm?' check being performed as a prerequisite to mseal being performed,
and finally we simplify the actual mseal logic which is really quite
straightforward.
Lorenzo Stoakes (5):
mm/mseal: always define VM_SEALED
mm/mseal: move madvise() logic to mm/madvise.c
mm/mseal: small cleanups
mm/mseal: separate out and simplify VMA gap check
mm/mseal: rework mseal apply logic
include/linux/mm.h | 6 +-
mm/madvise.c | 62 +++++++++++-
mm/mseal.c | 161 +++++--------------------------
mm/vma.c | 18 ++++
mm/vma.h | 26 +----
tools/testing/vma/vma_internal.h | 6 +-
6 files changed, 116 insertions(+), 163 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 1/5] mm/mseal: always define VM_SEALED
2025-07-14 13:00 [PATCH 0/5] mseal cleanups Lorenzo Stoakes
@ 2025-07-14 13:00 ` Lorenzo Stoakes
2025-07-14 14:31 ` David Hildenbrand
` (2 more replies)
2025-07-14 13:00 ` [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c Lorenzo Stoakes
` (3 subsequent siblings)
4 siblings, 3 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 13:00 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
There is no reason to treat VM_SEALED in a special way, in each other case
in which a VMA flag is unavailable due to configuration, we simply assign
that flag to VM_NONE, make VM_SEALED consistent with all other VMA flags in
this respect.
Additionally, use the next available bit for VM_SEALED, 42, rather than
arbitrarily putting it at 63 and update the declaration to match all other
VMA flags.
No functional change intended.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
include/linux/mm.h | 6 ++++--
tools/testing/vma/vma_internal.h | 6 ++++--
2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e5459d43267..aba67c3df42b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -414,8 +414,10 @@ extern unsigned int kobjsize(const void *objp);
#endif
#ifdef CONFIG_64BIT
-/* VM is sealed, in vm_flags */
-#define VM_SEALED _BITUL(63)
+#define VM_SEALED_BIT 42
+#define VM_SEALED BIT(VM_SEALED_BIT)
+#else
+#define VM_SEALED VM_NONE
#endif
/* Bits set in the VMA until the stack is in its final location */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 991022e9e0d3..0fe52fd6782b 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -108,8 +108,10 @@ extern unsigned long dac_mmap_min_addr;
#define CAP_IPC_LOCK 14
#ifdef CONFIG_64BIT
-/* VM is sealed, in vm_flags */
-#define VM_SEALED _BITUL(63)
+#define VM_SEALED_BIT 42
+#define VM_SEALED BIT(VM_SEALED_BIT)
+#else
+#define VM_SEALED VM_NONE
#endif
#define FIRST_USER_ADDRESS 0UL
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 13:00 [PATCH 0/5] mseal cleanups Lorenzo Stoakes
2025-07-14 13:00 ` [PATCH 1/5] mm/mseal: always define VM_SEALED Lorenzo Stoakes
@ 2025-07-14 13:00 ` Lorenzo Stoakes
2025-07-14 14:37 ` David Hildenbrand
` (2 more replies)
2025-07-14 13:00 ` [PATCH 3/5] mm/mseal: small cleanups Lorenzo Stoakes
` (2 subsequent siblings)
4 siblings, 3 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 13:00 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
The madvise() logic is inexplicably performed in mm/mseal.c - this ought to
be located in mm/madvise.c.
Additionally can_modify_vma_madv() is inconsistently named and, in
combination with is_ro_anon(), is very confusing logic.
Put a static function in mm/madvise.c instead - can_madvise_modify() - that
spells out exactly what's happening. Also explicitly check for an anon VMA.
Additionally add commentary to explain what's going on.
No functional change intended.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/madvise.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++-
mm/mseal.c | 49 -----------------------------------------
mm/vma.h | 7 ------
3 files changed, 61 insertions(+), 57 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 9de9b7c797c6..75757ba418a8 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -19,6 +19,7 @@
#include <linux/sched.h>
#include <linux/sched/mm.h>
#include <linux/mm_inline.h>
+#include <linux/mmu_context.h>
#include <linux/string.h>
#include <linux/uio.h>
#include <linux/ksm.h>
@@ -1256,6 +1257,65 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
&guard_remove_walk_ops, NULL);
}
+#ifdef CONFIG_64BIT
+/* Does the madvise operation result in discarding of mapped data? */
+static bool is_discard(int behavior)
+{
+ switch (behavior) {
+ case MADV_FREE:
+ case MADV_DONTNEED:
+ case MADV_DONTNEED_LOCKED:
+ case MADV_REMOVE:
+ case MADV_DONTFORK:
+ case MADV_WIPEONFORK:
+ case MADV_GUARD_INSTALL:
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
+ * circumstances - discarding of data from read-only anonymous SEALED mappings.
+ *
+ * This is because users cannot trivally discard data from these VMAs, and may
+ * only do so via an appropriate madvise() call.
+ */
+static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
+{
+ struct vm_area_struct *vma = madv_behavior->vma;
+
+ /* If the operation won't discard, we're good. */
+ if (!is_discard(madv_behavior->behavior))
+ return true;
+
+ /* If the VMA isn't sealed we're also good. */
+ if (can_modify_vma(vma))
+ return true;
+
+ /* File-backed mappings don't lose data on discard. */
+ if (!vma_is_anonymous(vma))
+ return true;
+
+ /*
+ * If the VMA is writable and the architecture permits write access,
+ * then discarding is fine.
+ */
+ if ((vma->vm_flags & VM_WRITE) &&
+ arch_vma_access_permitted(vma, /* write= */ true,
+ /* execute= */ false, /* foreign= */ false))
+ return true;
+
+ return false;
+}
+#else
+static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
+{
+ return true;
+}
+#endif
+
/*
* Apply an madvise behavior to a region of a vma. madvise_update_vma
* will handle splitting a vm area into separate areas, each area with its own
@@ -1269,7 +1329,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
struct madvise_behavior_range *range = &madv_behavior->range;
int error;
- if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
+ if (unlikely(!can_madvise_modify(madv_behavior)))
return -EPERM;
switch (behavior) {
diff --git a/mm/mseal.c b/mm/mseal.c
index c27197ac04e8..1308e88ab184 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -11,7 +11,6 @@
#include <linux/mman.h>
#include <linux/mm.h>
#include <linux/mm_inline.h>
-#include <linux/mmu_context.h>
#include <linux/syscalls.h>
#include <linux/sched.h>
#include "internal.h"
@@ -21,54 +20,6 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
vm_flags_set(vma, VM_SEALED);
}
-static bool is_madv_discard(int behavior)
-{
- switch (behavior) {
- case MADV_FREE:
- case MADV_DONTNEED:
- case MADV_DONTNEED_LOCKED:
- case MADV_REMOVE:
- case MADV_DONTFORK:
- case MADV_WIPEONFORK:
- case MADV_GUARD_INSTALL:
- return true;
- }
-
- return false;
-}
-
-static bool is_ro_anon(struct vm_area_struct *vma)
-{
- /* check anonymous mapping. */
- if (vma->vm_file || vma->vm_flags & VM_SHARED)
- return false;
-
- /*
- * check for non-writable:
- * PROT=RO or PKRU is not writeable.
- */
- if (!(vma->vm_flags & VM_WRITE) ||
- !arch_vma_access_permitted(vma, true, false, false))
- return true;
-
- return false;
-}
-
-/*
- * Check if a vma is allowed to be modified by madvise.
- */
-bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
-{
- if (!is_madv_discard(behavior))
- return true;
-
- if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
- return false;
-
- /* Allow by default. */
- return true;
-}
-
static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct vm_area_struct **prev, unsigned long start,
unsigned long end, vm_flags_t newflags)
diff --git a/mm/vma.h b/mm/vma.h
index cf6e3a6371b6..6515045ba342 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -578,8 +578,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
return true;
}
-bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior);
-
#else
static inline bool can_modify_vma(struct vm_area_struct *vma)
@@ -587,11 +585,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
return true;
}
-static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
-{
- return true;
-}
-
#endif
#if defined(CONFIG_STACK_GROWSUP)
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 3/5] mm/mseal: small cleanups
2025-07-14 13:00 [PATCH 0/5] mseal cleanups Lorenzo Stoakes
2025-07-14 13:00 ` [PATCH 1/5] mm/mseal: always define VM_SEALED Lorenzo Stoakes
2025-07-14 13:00 ` [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c Lorenzo Stoakes
@ 2025-07-14 13:00 ` Lorenzo Stoakes
2025-07-14 14:39 ` David Hildenbrand
` (2 more replies)
2025-07-14 13:00 ` [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check Lorenzo Stoakes
2025-07-14 13:00 ` [PATCH 5/5] mm/mseal: rework mseal apply logic Lorenzo Stoakes
4 siblings, 3 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 13:00 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
Drop the wholly unnecessary set_vma_sealed() and vma_is_sealed() helpers
which are used only once, and place VMA_ITERATOR() declarations in the
correct place.
No functional change intended.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mseal.c | 9 +--------
mm/vma.h | 16 ++--------------
2 files changed, 3 insertions(+), 22 deletions(-)
diff --git a/mm/mseal.c b/mm/mseal.c
index 1308e88ab184..adbcc65e9660 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -15,11 +15,6 @@
#include <linux/sched.h>
#include "internal.h"
-static inline void set_vma_sealed(struct vm_area_struct *vma)
-{
- vm_flags_set(vma, VM_SEALED);
-}
-
static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct vm_area_struct **prev, unsigned long start,
unsigned long end, vm_flags_t newflags)
@@ -36,7 +31,7 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
goto out;
}
- set_vma_sealed(vma);
+ vm_flags_set(vma, VM_SEALED);
out:
*prev = vma;
return ret;
@@ -53,7 +48,6 @@ static int check_mm_seal(unsigned long start, unsigned long end)
{
struct vm_area_struct *vma;
unsigned long nstart = start;
-
VMA_ITERATOR(vmi, current->mm, start);
/* going through each vma to check. */
@@ -78,7 +72,6 @@ static int apply_mm_seal(unsigned long start, unsigned long end)
{
unsigned long nstart;
struct vm_area_struct *vma, *prev;
-
VMA_ITERATOR(vmi, current->mm, start);
vma = vma_iter_load(&vmi);
diff --git a/mm/vma.h b/mm/vma.h
index 6515045ba342..d17f560cf53d 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -560,31 +560,19 @@ struct vm_area_struct *vma_iter_next_rewind(struct vma_iterator *vmi,
}
#ifdef CONFIG_64BIT
-
-static inline bool vma_is_sealed(struct vm_area_struct *vma)
-{
- return (vma->vm_flags & VM_SEALED);
-}
-
-/*
- * check if a vma is sealed for modification.
- * return true, if modification is allowed.
- */
+/* Check if a vma is sealed for modification. */
static inline bool can_modify_vma(struct vm_area_struct *vma)
{
- if (unlikely(vma_is_sealed(vma)))
+ if (unlikely(vma->vm_flags & VM_SEALED))
return false;
return true;
}
-
#else
-
static inline bool can_modify_vma(struct vm_area_struct *vma)
{
return true;
}
-
#endif
#if defined(CONFIG_STACK_GROWSUP)
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check
2025-07-14 13:00 [PATCH 0/5] mseal cleanups Lorenzo Stoakes
` (2 preceding siblings ...)
2025-07-14 13:00 ` [PATCH 3/5] mm/mseal: small cleanups Lorenzo Stoakes
@ 2025-07-14 13:00 ` Lorenzo Stoakes
2025-07-14 14:41 ` David Hildenbrand
` (2 more replies)
2025-07-14 13:00 ` [PATCH 5/5] mm/mseal: rework mseal apply logic Lorenzo Stoakes
4 siblings, 3 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 13:00 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
The check_mm_seal() function is doing something general - checking whether
a range contains only VMAs (or rather that it does NOT contain any unmapped
regions).
Generalise this and put the logic in mm/vma.c - introducing
range_contains_unmapped(). Additionally we can simplify the logic, we are
simply checking whether the last vma->vm_end has either a VMA starting
after it or ends before the end parameter.
No functional change intended.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mseal.c | 38 +++-----------------------------------
mm/vma.c | 18 ++++++++++++++++++
mm/vma.h | 3 +++
3 files changed, 24 insertions(+), 35 deletions(-)
diff --git a/mm/mseal.c b/mm/mseal.c
index adbcc65e9660..8e4c605af700 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -37,34 +37,6 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
return ret;
}
-/*
- * Check for do_mseal:
- * 1> start is part of a valid vma.
- * 2> end is part of a valid vma.
- * 3> No gap (unallocated address) between start and end.
- * 4> map is sealable.
- */
-static int check_mm_seal(unsigned long start, unsigned long end)
-{
- struct vm_area_struct *vma;
- unsigned long nstart = start;
- VMA_ITERATOR(vmi, current->mm, start);
-
- /* going through each vma to check. */
- for_each_vma_range(vmi, vma, end) {
- if (vma->vm_start > nstart)
- /* unallocated memory found. */
- return -ENOMEM;
-
- if (vma->vm_end >= end)
- return 0;
-
- nstart = vma->vm_end;
- }
-
- return -ENOMEM;
-}
-
/*
* Apply sealing.
*/
@@ -184,14 +156,10 @@ int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
if (mmap_write_lock_killable(mm))
return -EINTR;
- /*
- * First pass, this helps to avoid
- * partial sealing in case of error in input address range,
- * e.g. ENOMEM error.
- */
- ret = check_mm_seal(start, end);
- if (ret)
+ if (range_contains_unmapped(mm, start, end)) {
+ ret = -ENOMEM;
goto out;
+ }
/*
* Second pass, this should success, unless there are errors
diff --git a/mm/vma.c b/mm/vma.c
index b3d880652359..b57545568ae6 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -3203,3 +3203,21 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
return 0;
}
+
+/* Does the [start, end) range contain any unmapped memory? */
+bool range_contains_unmapped(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct vm_area_struct *vma;
+ unsigned long prev_end = start;
+ VMA_ITERATOR(vmi, current->mm, start);
+
+ for_each_vma_range(vmi, vma, end) {
+ if (vma->vm_start > prev_end)
+ return true;
+
+ prev_end = vma->vm_end;
+ }
+
+ return prev_end < end;
+}
diff --git a/mm/vma.h b/mm/vma.h
index d17f560cf53d..bfe9a04e6018 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -598,4 +598,7 @@ int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
#endif
+bool range_contains_unmapped(struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+
#endif /* __MM_VMA_H */
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 5/5] mm/mseal: rework mseal apply logic
2025-07-14 13:00 [PATCH 0/5] mseal cleanups Lorenzo Stoakes
` (3 preceding siblings ...)
2025-07-14 13:00 ` [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check Lorenzo Stoakes
@ 2025-07-14 13:00 ` Lorenzo Stoakes
2025-07-14 15:26 ` Pedro Falcato
4 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 13:00 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
The logic can be simplified - firstly by renaming the inconsistently named
apply_mm_seal() to mseal_apply().
We then wrap mseal_fixup() into the main loop as the logic is simple enough
to not require it, equally it isn't a hugely pleasant pattern in mprotect()
etc. so it's not something we want to perpetuate.
We remove some redundant comments, and then avoid the entirely unnecessary
and slightly bizarre invocation of vma_iter_end() on each loop - really
what we want, given we have asserted there are no gaps in the range - is to
handle start, end being offset into a VMAs. This is easily handled with
MIN()/MAX().
There's no need to have an 'out' label block since on vma_modify_flags()
error we abort anyway.
And by refactoring like this we avoid the rather horrid 'pass pointer to
prev around' pattern used in mprotect() et al.
No functional change intended.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mseal.c | 69 +++++++++++++++++-------------------------------------
1 file changed, 22 insertions(+), 47 deletions(-)
diff --git a/mm/mseal.c b/mm/mseal.c
index 8e4c605af700..cf28efbac371 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -15,60 +15,35 @@
#include <linux/sched.h>
#include "internal.h"
-static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
- struct vm_area_struct **prev, unsigned long start,
- unsigned long end, vm_flags_t newflags)
+static int mseal_apply(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
- int ret = 0;
- vm_flags_t oldflags = vma->vm_flags;
-
- if (newflags == oldflags)
- goto out;
-
- vma = vma_modify_flags(vmi, *prev, vma, start, end, newflags);
- if (IS_ERR(vma)) {
- ret = PTR_ERR(vma);
- goto out;
- }
-
- vm_flags_set(vma, VM_SEALED);
-out:
- *prev = vma;
- return ret;
-}
-
-/*
- * Apply sealing.
- */
-static int apply_mm_seal(unsigned long start, unsigned long end)
-{
- unsigned long nstart;
struct vm_area_struct *vma, *prev;
- VMA_ITERATOR(vmi, current->mm, start);
+ VMA_ITERATOR(vmi, mm, start);
+ /* We know there are no gaps so this will be non-NULL. */
vma = vma_iter_load(&vmi);
- /*
- * Note: check_mm_seal should already checked ENOMEM case.
- * so vma should not be null, same for the other ENOMEM cases.
- */
prev = vma_prev(&vmi);
if (start > vma->vm_start)
prev = vma;
- nstart = start;
for_each_vma_range(vmi, vma, end) {
- int error;
- unsigned long tmp;
- vm_flags_t newflags;
-
- newflags = vma->vm_flags | VM_SEALED;
- tmp = vma->vm_end;
- if (tmp > end)
- tmp = end;
- error = mseal_fixup(&vmi, vma, &prev, nstart, tmp, newflags);
- if (error)
- return error;
- nstart = vma_iter_end(&vmi);
+ unsigned long curr_start, curr_end;
+
+ if (vma->vm_flags & VM_SEALED) {
+ prev = vma;
+ continue;
+ }
+ curr_start = MAX(start, vma->vm_start);
+ curr_end = MIN(vma->vm_end, end);
+
+ vma = vma_modify_flags(&vmi, prev, vma, curr_start, curr_end,
+ vma->vm_flags | VM_SEALED);
+ if (IS_ERR(vma))
+ return PTR_ERR(vma);
+ vm_flags_set(vma, VM_SEALED);
+
+ prev = vma;
}
return 0;
@@ -167,10 +142,10 @@ int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
* reaching the max supported VMAs, however, those cases shall
* be rare.
*/
- ret = apply_mm_seal(start, end);
+ ret = mseal_apply(mm, start, end);
out:
- mmap_write_unlock(current->mm);
+ mmap_write_unlock(mm);
return ret;
}
--
2.50.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 1/5] mm/mseal: always define VM_SEALED
2025-07-14 13:00 ` [PATCH 1/5] mm/mseal: always define VM_SEALED Lorenzo Stoakes
@ 2025-07-14 14:31 ` David Hildenbrand
2025-07-14 15:20 ` Pedro Falcato
2025-07-14 15:32 ` Liam R. Howlett
2 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2025-07-14 14:31 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
linux-mm, linux-kernel, Jeff Xu
On 14.07.25 15:00, Lorenzo Stoakes wrote:
> There is no reason to treat VM_SEALED in a special way, in each other case
> in which a VMA flag is unavailable due to configuration, we simply assign
> that flag to VM_NONE, make VM_SEALED consistent with all other VMA flags in
> this respect.
>
> Additionally, use the next available bit for VM_SEALED, 42, rather than
> arbitrarily putting it at 63 and update the declaration to match all other
> VMA flags.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 13:00 ` [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c Lorenzo Stoakes
@ 2025-07-14 14:37 ` David Hildenbrand
2025-07-14 14:56 ` Lorenzo Stoakes
2025-07-14 15:18 ` Pedro Falcato
2025-07-14 15:33 ` Liam R. Howlett
2 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-14 14:37 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
linux-mm, linux-kernel, Jeff Xu
On 14.07.25 15:00, Lorenzo Stoakes wrote:
> The madvise() logic is inexplicably performed in mm/mseal.c - this ought to
> be located in mm/madvise.c.
>
> Additionally can_modify_vma_madv() is inconsistently named and, in
> combination with is_ro_anon(), is very confusing logic.
>
> Put a static function in mm/madvise.c instead - can_madvise_modify() - that
> spells out exactly what's happening. Also explicitly check for an anon VMA.
>
> Additionally add commentary to explain what's going on.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> mm/mseal.c | 49 -----------------------------------------
> mm/vma.h | 7 ------
> 3 files changed, 61 insertions(+), 57 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9de9b7c797c6..75757ba418a8 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -19,6 +19,7 @@
> #include <linux/sched.h>
> #include <linux/sched/mm.h>
> #include <linux/mm_inline.h>
> +#include <linux/mmu_context.h>
> #include <linux/string.h>
> #include <linux/uio.h>
> #include <linux/ksm.h>
> @@ -1256,6 +1257,65 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
> &guard_remove_walk_ops, NULL);
> }
>
> +#ifdef CONFIG_64BIT
It's consistent with mm/Makefile etc. but having a simple
config MSEAL
def_bool y if 64BIT
or sth like that would surely clean that up further.
> +/* Does the madvise operation result in discarding of mapped data? */
> +static bool is_discard(int behavior)
> +{
> + switch (behavior) {
> + case MADV_FREE:
> + case MADV_DONTNEED:
> + case MADV_DONTNEED_LOCKED:
> + case MADV_REMOVE:
> + case MADV_DONTFORK:
> + case MADV_WIPEONFORK:
> + case MADV_GUARD_INSTALL:
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/*
> + * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
> + * circumstances - discarding of data from read-only anonymous SEALED mappings.
> + *
> + * This is because users cannot trivally discard data from these VMAs, and may
s/trivally/trivially/
> + * only do so via an appropriate madvise() call.
> + */
> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> +{
> + struct vm_area_struct *vma = madv_behavior->vma;
> +
> + /* If the operation won't discard, we're good. */
> + if (!is_discard(madv_behavior->behavior))
> + return true;
Conceptually, I would do this first and then handle all the discard
cases / exceptions.
> +
> + /* If the VMA isn't sealed we're also good. */
> + if (can_modify_vma(vma))
> + return true;
> +> + /* File-backed mappings don't lose data on discard. */
> + if (!vma_is_anonymous(vma))
> + return true;
> +
> + /*
> + * If the VMA is writable and the architecture permits write access,
> + * then discarding is fine.
> + */
> + if ((vma->vm_flags & VM_WRITE) &&
> + arch_vma_access_permitted(vma, /* write= */ true,
> + /* execute= */ false, /* foreign= */ false))
> + return true;
> +
> + return false;
> +}
> +#else
> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> +{
> + return true;
> +}
> +#endif
> +
> /*
LGTM
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/5] mm/mseal: small cleanups
2025-07-14 13:00 ` [PATCH 3/5] mm/mseal: small cleanups Lorenzo Stoakes
@ 2025-07-14 14:39 ` David Hildenbrand
2025-07-14 15:23 ` Pedro Falcato
2025-07-14 15:33 ` Liam R. Howlett
2 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2025-07-14 14:39 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
linux-mm, linux-kernel, Jeff Xu
On 14.07.25 15:00, Lorenzo Stoakes wrote:
> Drop the wholly unnecessary set_vma_sealed() and vma_is_sealed() helpers
> which are used only once, and place VMA_ITERATOR() declarations in the
> correct place.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check
2025-07-14 13:00 ` [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check Lorenzo Stoakes
@ 2025-07-14 14:41 ` David Hildenbrand
2025-07-14 15:17 ` Pedro Falcato
2025-07-14 15:35 ` Liam R. Howlett
2 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2025-07-14 14:41 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Pedro Falcato,
linux-mm, linux-kernel, Jeff Xu
On 14.07.25 15:00, Lorenzo Stoakes wrote:
> The check_mm_seal() function is doing something general - checking whether
> a range contains only VMAs (or rather that it does NOT contain any unmapped
> regions).
>
> Generalise this and put the logic in mm/vma.c - introducing
> range_contains_unmapped(). Additionally we can simplify the logic, we are
> simply checking whether the last vma->vm_end has either a VMA starting
> after it or ends before the end parameter.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 14:37 ` David Hildenbrand
@ 2025-07-14 14:56 ` Lorenzo Stoakes
2025-07-14 15:03 ` David Hildenbrand
0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 14:56 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 04:37:30PM +0200, David Hildenbrand wrote:
> On 14.07.25 15:00, Lorenzo Stoakes wrote:
> > The madvise() logic is inexplicably performed in mm/mseal.c - this ought to
> > be located in mm/madvise.c.
> >
> > Additionally can_modify_vma_madv() is inconsistently named and, in
> > combination with is_ro_anon(), is very confusing logic.
> >
> > Put a static function in mm/madvise.c instead - can_madvise_modify() - that
> > spells out exactly what's happening. Also explicitly check for an anon VMA.
> >
> > Additionally add commentary to explain what's going on.
> >
> > No functional change intended.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> > mm/madvise.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> > mm/mseal.c | 49 -----------------------------------------
> > mm/vma.h | 7 ------
> > 3 files changed, 61 insertions(+), 57 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 9de9b7c797c6..75757ba418a8 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -19,6 +19,7 @@
> > #include <linux/sched.h>
> > #include <linux/sched/mm.h>
> > #include <linux/mm_inline.h>
> > +#include <linux/mmu_context.h>
> > #include <linux/string.h>
> > #include <linux/uio.h>
> > #include <linux/ksm.h>
> > @@ -1256,6 +1257,65 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
> > &guard_remove_walk_ops, NULL);
> > }
> >
> > +#ifdef CONFIG_64BIT
>
> It's consistent with mm/Makefile etc. but having a simple
>
> config MSEAL
> def_bool y if 64BIT
>
> or sth like that would surely clean that up further.
Well, I plan to make this not a thing soon so I'd rather not.
The intent is to make _all_ VMA flags work on 32-bit kernels. I have done some
preparatory work and next cycle intend to do more on this.
So I'd rather avoid any config changes on this until I've given this a shot.
>
> > +/* Does the madvise operation result in discarding of mapped data? */
> > +static bool is_discard(int behavior)
> > +{
> > + switch (behavior) {
> > + case MADV_FREE:
> > + case MADV_DONTNEED:
> > + case MADV_DONTNEED_LOCKED:
> > + case MADV_REMOVE:
> > + case MADV_DONTFORK:
> > + case MADV_WIPEONFORK:
> > + case MADV_GUARD_INSTALL:
> > + return true;
> > + }
> > +
> > + return false;
> > +}
> > +
> > +/*
> > + * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
> > + * circumstances - discarding of data from read-only anonymous SEALED mappings.
> > + *
> > + * This is because users cannot trivally discard data from these VMAs, and may
>
> s/trivally/trivially/
Ack thanks - Andrew can you fixup? Can send a fix-patch otherwise.
>
> > + * only do so via an appropriate madvise() call.
> > + */
> > +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> > +{
> > + struct vm_area_struct *vma = madv_behavior->vma;
> > +
> > + /* If the operation won't discard, we're good. */
> > + if (!is_discard(madv_behavior->behavior))
> > + return true;
>
>
> Conceptually, I would do this first and then handle all the discard cases /
> exceptions.
Hm I'm confused :P we do do this first? I think the idea with this is we can
very cheaply ignore any MADV_ that isn't applicable.
Did you mean to put this comment under line below?
I mean it's not exactly a perf hotspot so don't mind moving them around.
>
> > +
> > + /* If the VMA isn't sealed we're also good. */
> > + if (can_modify_vma(vma))
> > + return true;
> > +> + /* File-backed mappings don't lose data on discard. */
> > + if (!vma_is_anonymous(vma))
> > + return true;
> > +
> > + /*
> > + * If the VMA is writable and the architecture permits write access,
> > + * then discarding is fine.
> > + */
> > + if ((vma->vm_flags & VM_WRITE) &&
> > + arch_vma_access_permitted(vma, /* write= */ true,
> > + /* execute= */ false, /* foreign= */ false))
> > + return true;
> > +
> > + return false;
> > +}
> > +#else
> > +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> > +{
> > + return true;
> > +}
> > +#endif
> > +
> > /*
>
> LGTM
Cheers!
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 14:56 ` Lorenzo Stoakes
@ 2025-07-14 15:03 ` David Hildenbrand
2025-07-14 15:18 ` Lorenzo Stoakes
2025-07-14 15:31 ` Pedro Falcato
0 siblings, 2 replies; 38+ messages in thread
From: David Hildenbrand @ 2025-07-14 15:03 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
>> or sth like that would surely clean that up further.
>
> Well, I plan to make this not a thing soon so I'd rather not.
>
> The intent is to make _all_ VMA flags work on 32-bit kernels. I have done some
> preparatory work and next cycle intend to do more on this.
>
> So I'd rather avoid any config changes on this until I've given this a shot.
Sure, if that is in sight.
>>
>>> +/* Does the madvise operation result in discarding of mapped data? */
>>> +static bool is_discard(int behavior)
>>> +{
>>> + switch (behavior) {
>>> + case MADV_FREE:
>>> + case MADV_DONTNEED:
>>> + case MADV_DONTNEED_LOCKED:
>>> + case MADV_REMOVE:
>>> + case MADV_DONTFORK:
>>> + case MADV_WIPEONFORK:
>>> + case MADV_GUARD_INSTALL:
>>> + return true;
>>> + }
>>> +
>>> + return false;
>>> +}
>>> +
>>> +/*
>>> + * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
>>> + * circumstances - discarding of data from read-only anonymous SEALED mappings.
>>> + *
>>> + * This is because users cannot trivally discard data from these VMAs, and may
>>
>> s/trivally/trivially/
>
> Ack thanks - Andrew can you fixup? Can send a fix-patch otherwise.
>
>>
>>> + * only do so via an appropriate madvise() call.
>>> + */
>>> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
>>> +{
>>> + struct vm_area_struct *vma = madv_behavior->vma;
>>> +
>>> + /* If the operation won't discard, we're good. */
>>> + if (!is_discard(madv_behavior->behavior))
>>> + return true;
>>
>>
>> Conceptually, I would do this first and then handle all the discard cases /
>> exceptions.
>
> Hm I'm confused :P we do do this first? I think the idea with this is we can
> very cheaply ignore any MADV_ that isn't applicable.
>
> Did you mean to put this comment under line below?
>
> I mean it's not exactly a perf hotspot so don't mind moving them around.
I was thinking of this (start with sealed, then go into details about
discards):
/* If the VMA isn't sealed, we're all good. */
if (can_modify_vma(vma))
return true;
/* In a sealed VMA, we only care about discard operations. */
if (!is_discard(madv_behavior->behavior))
return true;
/* But discards of file-backed mappings are fine. */
if (!vma_is_anonymous(vma))
return true;
...
But now I wonder, why is it okay to discard anon pages in a MAP_PRIVATE
file mapping?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check
2025-07-14 13:00 ` [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check Lorenzo Stoakes
2025-07-14 14:41 ` David Hildenbrand
@ 2025-07-14 15:17 ` Pedro Falcato
2025-07-14 15:23 ` Lorenzo Stoakes
2025-07-14 15:35 ` Liam R. Howlett
2 siblings, 1 reply; 38+ messages in thread
From: Pedro Falcato @ 2025-07-14 15:17 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
Vlastimil Babka, Jann Horn, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 02:00:39PM +0100, Lorenzo Stoakes wrote:
> The check_mm_seal() function is doing something general - checking whether
> a range contains only VMAs (or rather that it does NOT contain any unmapped
> regions).
>
> Generalise this and put the logic in mm/vma.c - introducing
> range_contains_unmapped(). Additionally we can simplify the logic, we are
> simply checking whether the last vma->vm_end has either a VMA starting
> after it or ends before the end parameter.
>
I don't like this. Unless you have any other user for this in mind,
we'll proliferate this awful behavior (and add this into core vma code).
I have some patches locally to fully remove this upfront check, and AFAIK
we're somewhat in agreement that we can simply nuke this check (for
various reasons, including that we *still* don't have a man page for the
syscall). I can send them for proper discussion after your series lands.
--
Pedro
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 15:03 ` David Hildenbrand
@ 2025-07-14 15:18 ` Lorenzo Stoakes
2025-07-14 15:24 ` David Hildenbrand
2025-07-14 15:31 ` Pedro Falcato
1 sibling, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 15:18 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 05:03:03PM +0200, David Hildenbrand wrote:
> > > or sth like that would surely clean that up further.
> >
> > Well, I plan to make this not a thing soon so I'd rather not.
> >
> > The intent is to make _all_ VMA flags work on 32-bit kernels. I have done some
> > preparatory work and next cycle intend to do more on this.
> >
> > So I'd rather avoid any config changes on this until I've given this a shot.
>
> Sure, if that is in sight.
Yes :)
> > > > + * only do so via an appropriate madvise() call.
> > > > + */
> > > > +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> > > > +{
> > > > + struct vm_area_struct *vma = madv_behavior->vma;
> > > > +
> > > > + /* If the operation won't discard, we're good. */
> > > > + if (!is_discard(madv_behavior->behavior))
> > > > + return true;
> > >
> > >
> > > Conceptually, I would do this first and then handle all the discard cases /
> > > exceptions.
> >
> > Hm I'm confused :P we do do this first? I think the idea with this is we can
> > very cheaply ignore any MADV_ that isn't applicable.
> >
> > Did you mean to put this comment under line below?
> >
> > I mean it's not exactly a perf hotspot so don't mind moving them around.
>
> I was thinking of this (start with sealed, then go into details about
> discards):
>
> /* If the VMA isn't sealed, we're all good. */
> if (can_modify_vma(vma))
> return true;
>
> /* In a sealed VMA, we only care about discard operations. */
> if (!is_discard(madv_behavior->behavior))
> return true;
>
> /* But discards of file-backed mappings are fine. */
> if (!vma_is_anonymous(vma))
> return true;
Right yeah.
>
> ...
>
>
> But now I wonder, why is it okay to discard anon pages in a MAP_PRIVATE file
> mapping?
I'm duplicating existing logic here (well updating from the vma->vm_file check
and a seemingly pointless !vma->vm_file && vma->vm_flags & VM_SHARED check), but
this is a good point...
For the purposes of the refactoring I guess best to keep the logic ostensibly
the same given the 'no functional change intended', but we do need to fix this
yes.
That change would probably be better as a follow-up with a test change added
too.
But I agree this is an oversight here afaict.
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 13:00 ` [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c Lorenzo Stoakes
2025-07-14 14:37 ` David Hildenbrand
@ 2025-07-14 15:18 ` Pedro Falcato
2025-07-14 15:33 ` Liam R. Howlett
2 siblings, 0 replies; 38+ messages in thread
From: Pedro Falcato @ 2025-07-14 15:18 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
Vlastimil Babka, Jann Horn, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 02:00:37PM +0100, Lorenzo Stoakes wrote:
> The madvise() logic is inexplicably performed in mm/mseal.c - this ought to
> be located in mm/madvise.c.
>
> Additionally can_modify_vma_madv() is inconsistently named and, in
> combination with is_ro_anon(), is very confusing logic.
>
> Put a static function in mm/madvise.c instead - can_madvise_modify() - that
> spells out exactly what's happening. Also explicitly check for an anon VMA.
>
> Additionally add commentary to explain what's going on.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
--
Pedro
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 1/5] mm/mseal: always define VM_SEALED
2025-07-14 13:00 ` [PATCH 1/5] mm/mseal: always define VM_SEALED Lorenzo Stoakes
2025-07-14 14:31 ` David Hildenbrand
@ 2025-07-14 15:20 ` Pedro Falcato
2025-07-14 15:32 ` Liam R. Howlett
2 siblings, 0 replies; 38+ messages in thread
From: Pedro Falcato @ 2025-07-14 15:20 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
Vlastimil Babka, Jann Horn, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 02:00:36PM +0100, Lorenzo Stoakes wrote:
> There is no reason to treat VM_SEALED in a special way, in each other case
> in which a VMA flag is unavailable due to configuration, we simply assign
> that flag to VM_NONE, make VM_SEALED consistent with all other VMA flags in
> this respect.
>
> Additionally, use the next available bit for VM_SEALED, 42, rather than
> arbitrarily putting it at 63 and update the declaration to match all other
> VMA flags.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
--
Pedro
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check
2025-07-14 15:17 ` Pedro Falcato
@ 2025-07-14 15:23 ` Lorenzo Stoakes
2025-07-14 15:25 ` David Hildenbrand
0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 15:23 UTC (permalink / raw)
To: Pedro Falcato
Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
Vlastimil Babka, Jann Horn, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 04:17:23PM +0100, Pedro Falcato wrote:
> On Mon, Jul 14, 2025 at 02:00:39PM +0100, Lorenzo Stoakes wrote:
> > The check_mm_seal() function is doing something general - checking whether
> > a range contains only VMAs (or rather that it does NOT contain any unmapped
> > regions).
> >
> > Generalise this and put the logic in mm/vma.c - introducing
> > range_contains_unmapped(). Additionally we can simplify the logic, we are
> > simply checking whether the last vma->vm_end has either a VMA starting
> > after it or ends before the end parameter.
> >
>
> I don't like this. Unless you have any other user for this in mind,
> we'll proliferate this awful behavior (and add this into core vma code).
I'm not sure how putting it in an internal-only mm file perpetuates
anything.
I'm naming the function by what it does, and putting it where it belongs in
the VMA logic, and additionally making the function less horrible.
Let's not please get stuck on the isues with mseal implementation which
will catch-22 us into not being able to refactor.
We can do the refactoring first and it's fine to just yank this if it's not
used.
I'm not having a function like this sat in mm/mseal.c when it has
absolutely nothing to do with mseal specifically though.
>
> I have some patches locally to fully remove this upfront check, and AFAIK
> we're somewhat in agreement that we can simply nuke this check (for
> various reasons, including that we *still* don't have a man page for the
> syscall). I can send them for proper discussion after your series lands.
Yes I agree this check is odd, I don't really see why on earth we're
concerned with whether there are gaps or not, you'd surely want to just
seal whatever VMAs exist in the range?
But this belongs as something separate (a _functional_ change), let's get
the code sane first.
And ack on manpage, that's a horrible oversight that needs addressing!
>
> --
> Pedro
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/5] mm/mseal: small cleanups
2025-07-14 13:00 ` [PATCH 3/5] mm/mseal: small cleanups Lorenzo Stoakes
2025-07-14 14:39 ` David Hildenbrand
@ 2025-07-14 15:23 ` Pedro Falcato
2025-07-14 15:33 ` Liam R. Howlett
2 siblings, 0 replies; 38+ messages in thread
From: Pedro Falcato @ 2025-07-14 15:23 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
Vlastimil Babka, Jann Horn, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 02:00:38PM +0100, Lorenzo Stoakes wrote:
> Drop the wholly unnecessary set_vma_sealed() and vma_is_sealed() helpers
> which are used only once, and place VMA_ITERATOR() declarations in the
> correct place.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/mseal.c | 9 +--------
> mm/vma.h | 16 ++--------------
> 2 files changed, 3 insertions(+), 22 deletions(-)
>
> diff --git a/mm/mseal.c b/mm/mseal.c
> index 1308e88ab184..adbcc65e9660 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -15,11 +15,6 @@
> #include <linux/sched.h>
> #include "internal.h"
>
> -static inline void set_vma_sealed(struct vm_area_struct *vma)
> -{
> - vm_flags_set(vma, VM_SEALED);
> -}
> -
> static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> struct vm_area_struct **prev, unsigned long start,
> unsigned long end, vm_flags_t newflags)
> @@ -36,7 +31,7 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> goto out;
> }
>
> - set_vma_sealed(vma);
> + vm_flags_set(vma, VM_SEALED);
> out:
> *prev = vma;
> return ret;
> @@ -53,7 +48,6 @@ static int check_mm_seal(unsigned long start, unsigned long end)
> {
> struct vm_area_struct *vma;
> unsigned long nstart = start;
> -
> VMA_ITERATOR(vmi, current->mm, start);
>
> /* going through each vma to check. */
> @@ -78,7 +72,6 @@ static int apply_mm_seal(unsigned long start, unsigned long end)
> {
> unsigned long nstart;
> struct vm_area_struct *vma, *prev;
> -
> VMA_ITERATOR(vmi, current->mm, start);
>
> vma = vma_iter_load(&vmi);
> diff --git a/mm/vma.h b/mm/vma.h
> index 6515045ba342..d17f560cf53d 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -560,31 +560,19 @@ struct vm_area_struct *vma_iter_next_rewind(struct vma_iterator *vmi,
> }
>
> #ifdef CONFIG_64BIT
> -
> -static inline bool vma_is_sealed(struct vm_area_struct *vma)
> -{
> - return (vma->vm_flags & VM_SEALED);
> -}
I actually don't hate this helper...
> -
> -/*
> - * check if a vma is sealed for modification.
> - * return true, if modification is allowed.
> - */
> +/* Check if a vma is sealed for modification. */
> static inline bool can_modify_vma(struct vm_area_struct *vma)
As you're in the area, vma_can_modify() maybe?
In any case, LGTM.
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
--
Pedro
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 15:18 ` Lorenzo Stoakes
@ 2025-07-14 15:24 ` David Hildenbrand
2025-07-14 15:27 ` Lorenzo Stoakes
0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-14 15:24 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
On 14.07.25 17:18, Lorenzo Stoakes wrote:
> On Mon, Jul 14, 2025 at 05:03:03PM +0200, David Hildenbrand wrote:
>>>> or sth like that would surely clean that up further.
>>>
>>> Well, I plan to make this not a thing soon so I'd rather not.
>>>
>>> The intent is to make _all_ VMA flags work on 32-bit kernels. I have done some
>>> preparatory work and next cycle intend to do more on this.
>>>
>>> So I'd rather avoid any config changes on this until I've given this a shot.
>>
>> Sure, if that is in sight.
>
> Yes :)
>
>>>>> + * only do so via an appropriate madvise() call.
>>>>> + */
>>>>> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
>>>>> +{
>>>>> + struct vm_area_struct *vma = madv_behavior->vma;
>>>>> +
>>>>> + /* If the operation won't discard, we're good. */
>>>>> + if (!is_discard(madv_behavior->behavior))
>>>>> + return true;
>>>>
>>>>
>>>> Conceptually, I would do this first and then handle all the discard cases /
>>>> exceptions.
>>>
>>> Hm I'm confused :P we do do this first? I think the idea with this is we can
>>> very cheaply ignore any MADV_ that isn't applicable.
>>>
>>> Did you mean to put this comment under line below?
>>>
>>> I mean it's not exactly a perf hotspot so don't mind moving them around.
>>
>> I was thinking of this (start with sealed, then go into details about
>> discards):
>>
>> /* If the VMA isn't sealed, we're all good. */
>> if (can_modify_vma(vma))
>> return true;
>>
>> /* In a sealed VMA, we only care about discard operations. */
>> if (!is_discard(madv_behavior->behavior))
>> return true;
>>
>> /* But discards of file-backed mappings are fine. */
>> if (!vma_is_anonymous(vma))
>> return true;
>
> Right yeah.
>
>>
>> ...
>>
>>
>> But now I wonder, why is it okay to discard anon pages in a MAP_PRIVATE file
>> mapping?
>
> I'm duplicating existing logic here (well updating from the vma->vm_file check
> and a seemingly pointless !vma->vm_file && vma->vm_flags & VM_SHARED check), but
> this is a good point...
Yeah, not blaming you, just scratching my head :)
>
> For the purposes of the refactoring I guess best to keep the logic ostensibly
> the same given the 'no functional change intended', but we do need to fix this
> yes.
Likely a fix should be pulled in early? Not sure ... but it sure sounds
broken.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check
2025-07-14 15:23 ` Lorenzo Stoakes
@ 2025-07-14 15:25 ` David Hildenbrand
2025-07-14 15:32 ` Lorenzo Stoakes
0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-14 15:25 UTC (permalink / raw)
To: Lorenzo Stoakes, Pedro Falcato
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
linux-mm, linux-kernel, Jeff Xu
On 14.07.25 17:23, Lorenzo Stoakes wrote:
> On Mon, Jul 14, 2025 at 04:17:23PM +0100, Pedro Falcato wrote:
>> On Mon, Jul 14, 2025 at 02:00:39PM +0100, Lorenzo Stoakes wrote:
>>> The check_mm_seal() function is doing something general - checking whether
>>> a range contains only VMAs (or rather that it does NOT contain any unmapped
>>> regions).
>>>
>>> Generalise this and put the logic in mm/vma.c - introducing
>>> range_contains_unmapped(). Additionally we can simplify the logic, we are
>>> simply checking whether the last vma->vm_end has either a VMA starting
>>> after it or ends before the end parameter.
>>>
>>
>> I don't like this. Unless you have any other user for this in mind,
>> we'll proliferate this awful behavior (and add this into core vma code).
>
> I'm not sure how putting it in an internal-only mm file perpetuates
> anything.
>
> I'm naming the function by what it does, and putting it where it belongs in
> the VMA logic, and additionally making the function less horrible.
>
> Let's not please get stuck on the isues with mseal implementation which
> will catch-22 us into not being able to refactor.
>
> We can do the refactoring first and it's fine to just yank this if it's not
> used.
>
> I'm not having a function like this sat in mm/mseal.c when it has
> absolutely nothing to do with mseal specifically though.
>
>>
>> I have some patches locally to fully remove this upfront check, and AFAIK
>> we're somewhat in agreement that we can simply nuke this check (for
>> various reasons, including that we *still* don't have a man page for the
>> syscall). I can send them for proper discussion after your series lands.
>
> Yes I agree this check is odd, I don't really see why on earth we're
> concerned with whether there are gaps or not, you'd surely want to just
> seal whatever VMAs exist in the range?
Probably because GAPs cannot be sealed. So user space could assume that
in fact, nothing in that area can change after a successful mseal, while
it can ...
Not sure, though ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 5/5] mm/mseal: rework mseal apply logic
2025-07-14 13:00 ` [PATCH 5/5] mm/mseal: rework mseal apply logic Lorenzo Stoakes
@ 2025-07-14 15:26 ` Pedro Falcato
0 siblings, 0 replies; 38+ messages in thread
From: Pedro Falcato @ 2025-07-14 15:26 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
Vlastimil Babka, Jann Horn, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 02:00:40PM +0100, Lorenzo Stoakes wrote:
> The logic can be simplified - firstly by renaming the inconsistently named
> apply_mm_seal() to mseal_apply().
>
> We then wrap mseal_fixup() into the main loop as the logic is simple enough
> to not require it, equally it isn't a hugely pleasant pattern in mprotect()
> etc. so it's not something we want to perpetuate.
>
> We remove some redundant comments, and then avoid the entirely unnecessary
> and slightly bizarre invocation of vma_iter_end() on each loop - really
> what we want, given we have asserted there are no gaps in the range - is to
> handle start, end being offset into a VMAs. This is easily handled with
> MIN()/MAX().
>
> There's no need to have an 'out' label block since on vma_modify_flags()
> error we abort anyway.
>
> And by refactoring like this we avoid the rather horrid 'pass pointer to
> prev around' pattern used in mprotect() et al.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Thanks for the cleanup all around :)
--
Pedro
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 15:24 ` David Hildenbrand
@ 2025-07-14 15:27 ` Lorenzo Stoakes
2025-07-14 15:37 ` David Hildenbrand
0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 15:27 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 05:24:14PM +0200, David Hildenbrand wrote:
> On 14.07.25 17:18, Lorenzo Stoakes wrote:
> > On Mon, Jul 14, 2025 at 05:03:03PM +0200, David Hildenbrand wrote:
> > > > > or sth like that would surely clean that up further.
> > > >
> > > > Well, I plan to make this not a thing soon so I'd rather not.
> > > >
> > > > The intent is to make _all_ VMA flags work on 32-bit kernels. I have done some
> > > > preparatory work and next cycle intend to do more on this.
> > > >
> > > > So I'd rather avoid any config changes on this until I've given this a shot.
> > >
> > > Sure, if that is in sight.
> >
> > Yes :)
> >
> > > > > > + * only do so via an appropriate madvise() call.
> > > > > > + */
> > > > > > +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> > > > > > +{
> > > > > > + struct vm_area_struct *vma = madv_behavior->vma;
> > > > > > +
> > > > > > + /* If the operation won't discard, we're good. */
> > > > > > + if (!is_discard(madv_behavior->behavior))
> > > > > > + return true;
> > > > >
> > > > >
> > > > > Conceptually, I would do this first and then handle all the discard cases /
> > > > > exceptions.
> > > >
> > > > Hm I'm confused :P we do do this first? I think the idea with this is we can
> > > > very cheaply ignore any MADV_ that isn't applicable.
> > > >
> > > > Did you mean to put this comment under line below?
> > > >
> > > > I mean it's not exactly a perf hotspot so don't mind moving them around.
> > >
> > > I was thinking of this (start with sealed, then go into details about
> > > discards):
> > >
> > > /* If the VMA isn't sealed, we're all good. */
> > > if (can_modify_vma(vma))
> > > return true;
> > >
> > > /* In a sealed VMA, we only care about discard operations. */
> > > if (!is_discard(madv_behavior->behavior))
> > > return true;
> > >
> > > /* But discards of file-backed mappings are fine. */
> > > if (!vma_is_anonymous(vma))
> > > return true;
> >
> > Right yeah.
> >
> > >
> > > ...
> > >
> > >
> > > But now I wonder, why is it okay to discard anon pages in a MAP_PRIVATE file
> > > mapping?
> >
> > I'm duplicating existing logic here (well updating from the vma->vm_file check
> > and a seemingly pointless !vma->vm_file && vma->vm_flags & VM_SHARED check), but
> > this is a good point...
>
> Yeah, not blaming you, just scratching my head :)
>
> >
> > For the purposes of the refactoring I guess best to keep the logic ostensibly
> > the same given the 'no functional change intended', but we do need to fix this
> > yes.
>
>
> Likely a fix should be pulled in early? Not sure ... but it sure sounds
> broken.
Once this lands in mm-new I can send a fix (like litearlly tomorrow if it lands
:P). I just don't think a functional change belongs as part of a refactoring.
I worry that we'll get catch-22 caught on the (numerous) problems with the mseal
implementation and thus not provide a foundation upon which to fix them...
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 15:03 ` David Hildenbrand
2025-07-14 15:18 ` Lorenzo Stoakes
@ 2025-07-14 15:31 ` Pedro Falcato
2025-07-14 15:37 ` Lorenzo Stoakes
2025-07-14 15:41 ` David Hildenbrand
1 sibling, 2 replies; 38+ messages in thread
From: Pedro Falcato @ 2025-07-14 15:31 UTC (permalink / raw)
To: David Hildenbrand
Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 05:03:03PM +0200, David Hildenbrand wrote:
> [...]
>
> But now I wonder, why is it okay to discard anon pages in a MAP_PRIVATE file
> mapping?
IIRC this was originally suggested by Linus, on one of the versions introducing
mseal. But the gist is that discarding pages is okay if you could already zero
them manually, using e.g memset. Hence the writeability checks.
--
Pedro
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 1/5] mm/mseal: always define VM_SEALED
2025-07-14 13:00 ` [PATCH 1/5] mm/mseal: always define VM_SEALED Lorenzo Stoakes
2025-07-14 14:31 ` David Hildenbrand
2025-07-14 15:20 ` Pedro Falcato
@ 2025-07-14 15:32 ` Liam R. Howlett
2 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2025-07-14 15:32 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250714 09:08]:
> There is no reason to treat VM_SEALED in a special way, in each other case
> in which a VMA flag is unavailable due to configuration, we simply assign
> that flag to VM_NONE, make VM_SEALED consistent with all other VMA flags in
> this respect.
>
> Additionally, use the next available bit for VM_SEALED, 42, rather than
> arbitrarily putting it at 63 and update the declaration to match all other
> VMA flags.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> include/linux/mm.h | 6 ++++--
> tools/testing/vma/vma_internal.h | 6 ++++--
> 2 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2e5459d43267..aba67c3df42b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -414,8 +414,10 @@ extern unsigned int kobjsize(const void *objp);
> #endif
>
> #ifdef CONFIG_64BIT
> -/* VM is sealed, in vm_flags */
> -#define VM_SEALED _BITUL(63)
> +#define VM_SEALED_BIT 42
> +#define VM_SEALED BIT(VM_SEALED_BIT)
> +#else
> +#define VM_SEALED VM_NONE
> #endif
>
> /* Bits set in the VMA until the stack is in its final location */
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 991022e9e0d3..0fe52fd6782b 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -108,8 +108,10 @@ extern unsigned long dac_mmap_min_addr;
> #define CAP_IPC_LOCK 14
>
> #ifdef CONFIG_64BIT
> -/* VM is sealed, in vm_flags */
> -#define VM_SEALED _BITUL(63)
> +#define VM_SEALED_BIT 42
> +#define VM_SEALED BIT(VM_SEALED_BIT)
> +#else
> +#define VM_SEALED VM_NONE
> #endif
>
> #define FIRST_USER_ADDRESS 0UL
> --
> 2.50.1
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check
2025-07-14 15:25 ` David Hildenbrand
@ 2025-07-14 15:32 ` Lorenzo Stoakes
2025-07-14 15:40 ` Liam R. Howlett
0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 15:32 UTC (permalink / raw)
To: David Hildenbrand
Cc: Pedro Falcato, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 05:25:59PM +0200, David Hildenbrand wrote:
> On 14.07.25 17:23, Lorenzo Stoakes wrote:
> > On Mon, Jul 14, 2025 at 04:17:23PM +0100, Pedro Falcato wrote:
> > > On Mon, Jul 14, 2025 at 02:00:39PM +0100, Lorenzo Stoakes wrote:
> > > > The check_mm_seal() function is doing something general - checking whether
> > > > a range contains only VMAs (or rather that it does NOT contain any unmapped
> > > > regions).
> > > >
> > > > Generalise this and put the logic in mm/vma.c - introducing
> > > > range_contains_unmapped(). Additionally we can simplify the logic, we are
> > > > simply checking whether the last vma->vm_end has either a VMA starting
> > > > after it or ends before the end parameter.
> > > >
> > >
> > > I don't like this. Unless you have any other user for this in mind,
> > > we'll proliferate this awful behavior (and add this into core vma code).
> >
> > I'm not sure how putting it in an internal-only mm file perpetuates
> > anything.
> >
> > I'm naming the function by what it does, and putting it where it belongs in
> > the VMA logic, and additionally making the function less horrible.
> >
> > Let's not please get stuck on the isues with mseal implementation which
> > will catch-22 us into not being able to refactor.
> >
> > We can do the refactoring first and it's fine to just yank this if it's not
> > used.
> >
> > I'm not having a function like this sat in mm/mseal.c when it has
> > absolutely nothing to do with mseal specifically though.
> >
> > >
> > > I have some patches locally to fully remove this upfront check, and AFAIK
> > > we're somewhat in agreement that we can simply nuke this check (for
> > > various reasons, including that we *still* don't have a man page for the
> > > syscall). I can send them for proper discussion after your series lands.
> >
> > Yes I agree this check is odd, I don't really see why on earth we're
> > concerned with whether there are gaps or not, you'd surely want to just
> > seal whatever VMAs exist in the range?
>
> Probably because GAPs cannot be sealed. So user space could assume that in
> fact, nothing in that area can change after a successful mseal, while it can
> ...
>
> Not sure, though ...
Yeah maybe a sekuriteh thing where you want to be sure the range is in fact
_all_ sealed.
I'm not sure that really makes much sense in practice honestly though, because
if another thread can fiddle with things, then surely you've already 'lost'.
if you expected to touch a VMA where in fact aa gap exists your software is
_already_ broken because you're going to segfault.
So it just seems overly theoretical to me.
I think we should error out if there's _no_ VMAs at all, but otherwise succeed.
The semantics of 'all VMAs which exist in the range are sealed' would still be
maintained.
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 13:00 ` [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c Lorenzo Stoakes
2025-07-14 14:37 ` David Hildenbrand
2025-07-14 15:18 ` Pedro Falcato
@ 2025-07-14 15:33 ` Liam R. Howlett
2 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2025-07-14 15:33 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250714 09:08]:
> The madvise() logic is inexplicably performed in mm/mseal.c - this ought to
> be located in mm/madvise.c.
>
> Additionally can_modify_vma_madv() is inconsistently named and, in
> combination with is_ro_anon(), is very confusing logic.
>
> Put a static function in mm/madvise.c instead - can_madvise_modify() - that
> spells out exactly what's happening. Also explicitly check for an anon VMA.
>
> Additionally add commentary to explain what's going on.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> mm/madvise.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> mm/mseal.c | 49 -----------------------------------------
> mm/vma.h | 7 ------
> 3 files changed, 61 insertions(+), 57 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9de9b7c797c6..75757ba418a8 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -19,6 +19,7 @@
> #include <linux/sched.h>
> #include <linux/sched/mm.h>
> #include <linux/mm_inline.h>
> +#include <linux/mmu_context.h>
> #include <linux/string.h>
> #include <linux/uio.h>
> #include <linux/ksm.h>
> @@ -1256,6 +1257,65 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
> &guard_remove_walk_ops, NULL);
> }
>
> +#ifdef CONFIG_64BIT
> +/* Does the madvise operation result in discarding of mapped data? */
> +static bool is_discard(int behavior)
> +{
> + switch (behavior) {
> + case MADV_FREE:
> + case MADV_DONTNEED:
> + case MADV_DONTNEED_LOCKED:
> + case MADV_REMOVE:
> + case MADV_DONTFORK:
> + case MADV_WIPEONFORK:
> + case MADV_GUARD_INSTALL:
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/*
> + * We are restricted from madvise()'ing mseal()'d VMAs only in very particular
> + * circumstances - discarding of data from read-only anonymous SEALED mappings.
> + *
> + * This is because users cannot trivally discard data from these VMAs, and may
> + * only do so via an appropriate madvise() call.
> + */
> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> +{
> + struct vm_area_struct *vma = madv_behavior->vma;
> +
> + /* If the operation won't discard, we're good. */
> + if (!is_discard(madv_behavior->behavior))
> + return true;
> +
> + /* If the VMA isn't sealed we're also good. */
> + if (can_modify_vma(vma))
> + return true;
> +
> + /* File-backed mappings don't lose data on discard. */
> + if (!vma_is_anonymous(vma))
> + return true;
> +
> + /*
> + * If the VMA is writable and the architecture permits write access,
> + * then discarding is fine.
> + */
> + if ((vma->vm_flags & VM_WRITE) &&
> + arch_vma_access_permitted(vma, /* write= */ true,
> + /* execute= */ false, /* foreign= */ false))
> + return true;
> +
> + return false;
> +}
> +#else
> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
> +{
> + return true;
> +}
> +#endif
> +
> /*
> * Apply an madvise behavior to a region of a vma. madvise_update_vma
> * will handle splitting a vm area into separate areas, each area with its own
> @@ -1269,7 +1329,7 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
> struct madvise_behavior_range *range = &madv_behavior->range;
> int error;
>
> - if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
> + if (unlikely(!can_madvise_modify(madv_behavior)))
> return -EPERM;
>
> switch (behavior) {
> diff --git a/mm/mseal.c b/mm/mseal.c
> index c27197ac04e8..1308e88ab184 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -11,7 +11,6 @@
> #include <linux/mman.h>
> #include <linux/mm.h>
> #include <linux/mm_inline.h>
> -#include <linux/mmu_context.h>
> #include <linux/syscalls.h>
> #include <linux/sched.h>
> #include "internal.h"
> @@ -21,54 +20,6 @@ static inline void set_vma_sealed(struct vm_area_struct *vma)
> vm_flags_set(vma, VM_SEALED);
> }
>
> -static bool is_madv_discard(int behavior)
> -{
> - switch (behavior) {
> - case MADV_FREE:
> - case MADV_DONTNEED:
> - case MADV_DONTNEED_LOCKED:
> - case MADV_REMOVE:
> - case MADV_DONTFORK:
> - case MADV_WIPEONFORK:
> - case MADV_GUARD_INSTALL:
> - return true;
> - }
> -
> - return false;
> -}
> -
> -static bool is_ro_anon(struct vm_area_struct *vma)
> -{
> - /* check anonymous mapping. */
> - if (vma->vm_file || vma->vm_flags & VM_SHARED)
> - return false;
> -
> - /*
> - * check for non-writable:
> - * PROT=RO or PKRU is not writeable.
> - */
> - if (!(vma->vm_flags & VM_WRITE) ||
> - !arch_vma_access_permitted(vma, true, false, false))
> - return true;
> -
> - return false;
> -}
> -
> -/*
> - * Check if a vma is allowed to be modified by madvise.
> - */
> -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> -{
> - if (!is_madv_discard(behavior))
> - return true;
> -
> - if (unlikely(!can_modify_vma(vma) && is_ro_anon(vma)))
> - return false;
> -
> - /* Allow by default. */
> - return true;
> -}
> -
> static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> struct vm_area_struct **prev, unsigned long start,
> unsigned long end, vm_flags_t newflags)
> diff --git a/mm/vma.h b/mm/vma.h
> index cf6e3a6371b6..6515045ba342 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -578,8 +578,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
> return true;
> }
>
> -bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior);
> -
> #else
>
> static inline bool can_modify_vma(struct vm_area_struct *vma)
> @@ -587,11 +585,6 @@ static inline bool can_modify_vma(struct vm_area_struct *vma)
> return true;
> }
>
> -static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
> -{
> - return true;
> -}
> -
> #endif
>
> #if defined(CONFIG_STACK_GROWSUP)
> --
> 2.50.1
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/5] mm/mseal: small cleanups
2025-07-14 13:00 ` [PATCH 3/5] mm/mseal: small cleanups Lorenzo Stoakes
2025-07-14 14:39 ` David Hildenbrand
2025-07-14 15:23 ` Pedro Falcato
@ 2025-07-14 15:33 ` Liam R. Howlett
2 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2025-07-14 15:33 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250714 09:08]:
> Drop the wholly unnecessary set_vma_sealed() and vma_is_sealed() helpers
> which are used only once, and place VMA_ITERATOR() declarations in the
> correct place.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> mm/mseal.c | 9 +--------
> mm/vma.h | 16 ++--------------
> 2 files changed, 3 insertions(+), 22 deletions(-)
>
> diff --git a/mm/mseal.c b/mm/mseal.c
> index 1308e88ab184..adbcc65e9660 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -15,11 +15,6 @@
> #include <linux/sched.h>
> #include "internal.h"
>
> -static inline void set_vma_sealed(struct vm_area_struct *vma)
> -{
> - vm_flags_set(vma, VM_SEALED);
> -}
> -
> static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> struct vm_area_struct **prev, unsigned long start,
> unsigned long end, vm_flags_t newflags)
> @@ -36,7 +31,7 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> goto out;
> }
>
> - set_vma_sealed(vma);
> + vm_flags_set(vma, VM_SEALED);
> out:
> *prev = vma;
> return ret;
> @@ -53,7 +48,6 @@ static int check_mm_seal(unsigned long start, unsigned long end)
> {
> struct vm_area_struct *vma;
> unsigned long nstart = start;
> -
> VMA_ITERATOR(vmi, current->mm, start);
>
> /* going through each vma to check. */
> @@ -78,7 +72,6 @@ static int apply_mm_seal(unsigned long start, unsigned long end)
> {
> unsigned long nstart;
> struct vm_area_struct *vma, *prev;
> -
> VMA_ITERATOR(vmi, current->mm, start);
>
> vma = vma_iter_load(&vmi);
> diff --git a/mm/vma.h b/mm/vma.h
> index 6515045ba342..d17f560cf53d 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -560,31 +560,19 @@ struct vm_area_struct *vma_iter_next_rewind(struct vma_iterator *vmi,
> }
>
> #ifdef CONFIG_64BIT
> -
> -static inline bool vma_is_sealed(struct vm_area_struct *vma)
> -{
> - return (vma->vm_flags & VM_SEALED);
> -}
> -
> -/*
> - * check if a vma is sealed for modification.
> - * return true, if modification is allowed.
> - */
> +/* Check if a vma is sealed for modification. */
> static inline bool can_modify_vma(struct vm_area_struct *vma)
> {
> - if (unlikely(vma_is_sealed(vma)))
> + if (unlikely(vma->vm_flags & VM_SEALED))
> return false;
>
> return true;
> }
> -
> #else
> -
> static inline bool can_modify_vma(struct vm_area_struct *vma)
> {
> return true;
> }
> -
> #endif
>
> #if defined(CONFIG_STACK_GROWSUP)
> --
> 2.50.1
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check
2025-07-14 13:00 ` [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check Lorenzo Stoakes
2025-07-14 14:41 ` David Hildenbrand
2025-07-14 15:17 ` Pedro Falcato
@ 2025-07-14 15:35 ` Liam R. Howlett
2025-07-14 15:40 ` Lorenzo Stoakes
2 siblings, 1 reply; 38+ messages in thread
From: Liam R. Howlett @ 2025-07-14 15:35 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250714 09:08]:
> The check_mm_seal() function is doing something general - checking whether
> a range contains only VMAs (or rather that it does NOT contain any unmapped
> regions).
>
> Generalise this and put the logic in mm/vma.c - introducing
> range_contains_unmapped(). Additionally we can simplify the logic, we are
> simply checking whether the last vma->vm_end has either a VMA starting
> after it or ends before the end parameter.
>
> No functional change intended.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
I do fear that people will find this function and try and use it
internally, it may make our jobs of avoiding this being expanded more
annoying.
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
> mm/mseal.c | 38 +++-----------------------------------
> mm/vma.c | 18 ++++++++++++++++++
> mm/vma.h | 3 +++
> 3 files changed, 24 insertions(+), 35 deletions(-)
>
> diff --git a/mm/mseal.c b/mm/mseal.c
> index adbcc65e9660..8e4c605af700 100644
> --- a/mm/mseal.c
> +++ b/mm/mseal.c
> @@ -37,34 +37,6 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> return ret;
> }
>
> -/*
> - * Check for do_mseal:
> - * 1> start is part of a valid vma.
> - * 2> end is part of a valid vma.
> - * 3> No gap (unallocated address) between start and end.
> - * 4> map is sealable.
> - */
> -static int check_mm_seal(unsigned long start, unsigned long end)
> -{
> - struct vm_area_struct *vma;
> - unsigned long nstart = start;
> - VMA_ITERATOR(vmi, current->mm, start);
> -
> - /* going through each vma to check. */
> - for_each_vma_range(vmi, vma, end) {
> - if (vma->vm_start > nstart)
> - /* unallocated memory found. */
> - return -ENOMEM;
> -
> - if (vma->vm_end >= end)
> - return 0;
> -
> - nstart = vma->vm_end;
> - }
> -
> - return -ENOMEM;
> -}
> -
> /*
> * Apply sealing.
> */
> @@ -184,14 +156,10 @@ int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
> if (mmap_write_lock_killable(mm))
> return -EINTR;
>
> - /*
> - * First pass, this helps to avoid
> - * partial sealing in case of error in input address range,
> - * e.g. ENOMEM error.
> - */
> - ret = check_mm_seal(start, end);
> - if (ret)
> + if (range_contains_unmapped(mm, start, end)) {
> + ret = -ENOMEM;
> goto out;
> + }
>
> /*
> * Second pass, this should success, unless there are errors
> diff --git a/mm/vma.c b/mm/vma.c
> index b3d880652359..b57545568ae6 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -3203,3 +3203,21 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
>
> return 0;
> }
> +
> +/* Does the [start, end) range contain any unmapped memory? */
> +bool range_contains_unmapped(struct mm_struct *mm,
> + unsigned long start, unsigned long end)
> +{
> + struct vm_area_struct *vma;
> + unsigned long prev_end = start;
> + VMA_ITERATOR(vmi, current->mm, start);
> +
> + for_each_vma_range(vmi, vma, end) {
> + if (vma->vm_start > prev_end)
> + return true;
> +
> + prev_end = vma->vm_end;
> + }
> +
> + return prev_end < end;
> +}
> diff --git a/mm/vma.h b/mm/vma.h
> index d17f560cf53d..bfe9a04e6018 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -598,4 +598,7 @@ int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
> int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
> #endif
>
> +bool range_contains_unmapped(struct mm_struct *mm,
> + unsigned long start, unsigned long end);
> +
> #endif /* __MM_VMA_H */
> --
> 2.50.1
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 15:31 ` Pedro Falcato
@ 2025-07-14 15:37 ` Lorenzo Stoakes
2025-07-14 15:41 ` David Hildenbrand
1 sibling, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 15:37 UTC (permalink / raw)
To: Pedro Falcato
Cc: David Hildenbrand, Andrew Morton, Liam R . Howlett,
Vlastimil Babka, Jann Horn, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 04:31:59PM +0100, Pedro Falcato wrote:
> On Mon, Jul 14, 2025 at 05:03:03PM +0200, David Hildenbrand wrote:
> > [...]
> >
> > But now I wonder, why is it okay to discard anon pages in a MAP_PRIVATE file
> > mapping?
>
> IIRC this was originally suggested by Linus, on one of the versions introducing
> mseal. But the gist is that discarding pages is okay if you could already zero
> them manually, using e.g memset. Hence the writeability checks.
Right, and if it's read-only and MAP_PRIVATE it doesn't really matter. In which
case we needn't worry.
I wonder if we need to check for VM_MAYWRITE though...
>
> --
> Pedro
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 15:27 ` Lorenzo Stoakes
@ 2025-07-14 15:37 ` David Hildenbrand
0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2025-07-14 15:37 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
On 14.07.25 17:27, Lorenzo Stoakes wrote:
> On Mon, Jul 14, 2025 at 05:24:14PM +0200, David Hildenbrand wrote:
>> On 14.07.25 17:18, Lorenzo Stoakes wrote:
>>> On Mon, Jul 14, 2025 at 05:03:03PM +0200, David Hildenbrand wrote:
>>>>>> or sth like that would surely clean that up further.
>>>>>
>>>>> Well, I plan to make this not a thing soon so I'd rather not.
>>>>>
>>>>> The intent is to make _all_ VMA flags work on 32-bit kernels. I have done some
>>>>> preparatory work and next cycle intend to do more on this.
>>>>>
>>>>> So I'd rather avoid any config changes on this until I've given this a shot.
>>>>
>>>> Sure, if that is in sight.
>>>
>>> Yes :)
>>>
>>>>>>> + * only do so via an appropriate madvise() call.
>>>>>>> + */
>>>>>>> +static bool can_madvise_modify(struct madvise_behavior *madv_behavior)
>>>>>>> +{
>>>>>>> + struct vm_area_struct *vma = madv_behavior->vma;
>>>>>>> +
>>>>>>> + /* If the operation won't discard, we're good. */
>>>>>>> + if (!is_discard(madv_behavior->behavior))
>>>>>>> + return true;
>>>>>>
>>>>>>
>>>>>> Conceptually, I would do this first and then handle all the discard cases /
>>>>>> exceptions.
>>>>>
>>>>> Hm I'm confused :P we do do this first? I think the idea with this is we can
>>>>> very cheaply ignore any MADV_ that isn't applicable.
>>>>>
>>>>> Did you mean to put this comment under line below?
>>>>>
>>>>> I mean it's not exactly a perf hotspot so don't mind moving them around.
>>>>
>>>> I was thinking of this (start with sealed, then go into details about
>>>> discards):
>>>>
>>>> /* If the VMA isn't sealed, we're all good. */
>>>> if (can_modify_vma(vma))
>>>> return true;
>>>>
>>>> /* In a sealed VMA, we only care about discard operations. */
>>>> if (!is_discard(madv_behavior->behavior))
>>>> return true;
>>>>
>>>> /* But discards of file-backed mappings are fine. */
>>>> if (!vma_is_anonymous(vma))
>>>> return true;
>>>
>>> Right yeah.
>>>
>>>>
>>>> ...
>>>>
>>>>
>>>> But now I wonder, why is it okay to discard anon pages in a MAP_PRIVATE file
>>>> mapping?
>>>
>>> I'm duplicating existing logic here (well updating from the vma->vm_file check
>>> and a seemingly pointless !vma->vm_file && vma->vm_flags & VM_SHARED check), but
>>> this is a good point...
>>
>> Yeah, not blaming you, just scratching my head :)
>>
>>>
>>> For the purposes of the refactoring I guess best to keep the logic ostensibly
>>> the same given the 'no functional change intended', but we do need to fix this
>>> yes.
>>
>>
>> Likely a fix should be pulled in early? Not sure ... but it sure sounds
>> broken.
>
> Once this lands in mm-new I can send a fix (like litearlly tomorrow if it lands
> :P). I just don't think a functional change belongs as part of a refactoring.
>
> I worry that we'll get catch-22 caught on the (numerous) problems with the mseal
> implementation and thus not provide a foundation upon which to fix them...
Not sure what's right or wrong at this point. The cover letter only
talked about anonymous memory:
"
Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively
a memset(0) for anonymous memory.
"
We're similarly in the domain of altering memory content that was there
(resetting it back to whatever the file provided).
It does like something that is not desired, but no idea how security
relevant it would be. Probably a corner case we can indeed fixup separately.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check
2025-07-14 15:35 ` Liam R. Howlett
@ 2025-07-14 15:40 ` Lorenzo Stoakes
2025-07-14 15:43 ` Liam R. Howlett
0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 15:40 UTC (permalink / raw)
To: Liam R. Howlett, Andrew Morton, David Hildenbrand,
Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
Jeff Xu
On Mon, Jul 14, 2025 at 11:35:44AM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250714 09:08]:
> > The check_mm_seal() function is doing something general - checking whether
> > a range contains only VMAs (or rather that it does NOT contain any unmapped
> > regions).
> >
> > Generalise this and put the logic in mm/vma.c - introducing
> > range_contains_unmapped(). Additionally we can simplify the logic, we are
> > simply checking whether the last vma->vm_end has either a VMA starting
> > after it or ends before the end parameter.
> >
> > No functional change intended.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> I do fear that people will find this function and try and use it
> internally, it may make our jobs of avoiding this being expanded more
> annoying.
Hmm, surely we should have some ability to dissuade within mm :)
Thing is I don't love having this function in mm/mseal.c when it has
nothing to do with mseal()'ing.
If people want to be weird about gaps they can pretty trivially implement
something like this anyway. Probably. Possibly. Maybe?
>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Thanks!
>
> > ---
> > mm/mseal.c | 38 +++-----------------------------------
> > mm/vma.c | 18 ++++++++++++++++++
> > mm/vma.h | 3 +++
> > 3 files changed, 24 insertions(+), 35 deletions(-)
> >
> > diff --git a/mm/mseal.c b/mm/mseal.c
> > index adbcc65e9660..8e4c605af700 100644
> > --- a/mm/mseal.c
> > +++ b/mm/mseal.c
> > @@ -37,34 +37,6 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > return ret;
> > }
> >
> > -/*
> > - * Check for do_mseal:
> > - * 1> start is part of a valid vma.
> > - * 2> end is part of a valid vma.
> > - * 3> No gap (unallocated address) between start and end.
> > - * 4> map is sealable.
> > - */
> > -static int check_mm_seal(unsigned long start, unsigned long end)
> > -{
> > - struct vm_area_struct *vma;
> > - unsigned long nstart = start;
> > - VMA_ITERATOR(vmi, current->mm, start);
> > -
> > - /* going through each vma to check. */
> > - for_each_vma_range(vmi, vma, end) {
> > - if (vma->vm_start > nstart)
> > - /* unallocated memory found. */
> > - return -ENOMEM;
> > -
> > - if (vma->vm_end >= end)
> > - return 0;
> > -
> > - nstart = vma->vm_end;
> > - }
> > -
> > - return -ENOMEM;
> > -}
> > -
> > /*
> > * Apply sealing.
> > */
> > @@ -184,14 +156,10 @@ int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
> > if (mmap_write_lock_killable(mm))
> > return -EINTR;
> >
> > - /*
> > - * First pass, this helps to avoid
> > - * partial sealing in case of error in input address range,
> > - * e.g. ENOMEM error.
> > - */
> > - ret = check_mm_seal(start, end);
> > - if (ret)
> > + if (range_contains_unmapped(mm, start, end)) {
> > + ret = -ENOMEM;
> > goto out;
> > + }
> >
> > /*
> > * Second pass, this should success, unless there are errors
> > diff --git a/mm/vma.c b/mm/vma.c
> > index b3d880652359..b57545568ae6 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -3203,3 +3203,21 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
> >
> > return 0;
> > }
> > +
> > +/* Does the [start, end) range contain any unmapped memory? */
> > +bool range_contains_unmapped(struct mm_struct *mm,
> > + unsigned long start, unsigned long end)
> > +{
> > + struct vm_area_struct *vma;
> > + unsigned long prev_end = start;
> > + VMA_ITERATOR(vmi, current->mm, start);
> > +
> > + for_each_vma_range(vmi, vma, end) {
> > + if (vma->vm_start > prev_end)
> > + return true;
> > +
> > + prev_end = vma->vm_end;
> > + }
> > +
> > + return prev_end < end;
> > +}
> > diff --git a/mm/vma.h b/mm/vma.h
> > index d17f560cf53d..bfe9a04e6018 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -598,4 +598,7 @@ int create_init_stack_vma(struct mm_struct *mm, struct vm_area_struct **vmap,
> > int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
> > #endif
> >
> > +bool range_contains_unmapped(struct mm_struct *mm,
> > + unsigned long start, unsigned long end);
> > +
> > #endif /* __MM_VMA_H */
> > --
> > 2.50.1
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check
2025-07-14 15:32 ` Lorenzo Stoakes
@ 2025-07-14 15:40 ` Liam R. Howlett
0 siblings, 0 replies; 38+ messages in thread
From: Liam R. Howlett @ 2025-07-14 15:40 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, Pedro Falcato, Andrew Morton, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Jeff Xu
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250714 11:32]:
> On Mon, Jul 14, 2025 at 05:25:59PM +0200, David Hildenbrand wrote:
> > On 14.07.25 17:23, Lorenzo Stoakes wrote:
> > > On Mon, Jul 14, 2025 at 04:17:23PM +0100, Pedro Falcato wrote:
> > > > On Mon, Jul 14, 2025 at 02:00:39PM +0100, Lorenzo Stoakes wrote:
> > > > > The check_mm_seal() function is doing something general - checking whether
> > > > > a range contains only VMAs (or rather that it does NOT contain any unmapped
> > > > > regions).
> > > > >
> > > > > Generalise this and put the logic in mm/vma.c - introducing
> > > > > range_contains_unmapped(). Additionally we can simplify the logic, we are
> > > > > simply checking whether the last vma->vm_end has either a VMA starting
> > > > > after it or ends before the end parameter.
> > > > >
> > > >
> > > > I don't like this. Unless you have any other user for this in mind,
> > > > we'll proliferate this awful behavior (and add this into core vma code).
> > >
> > > I'm not sure how putting it in an internal-only mm file perpetuates
> > > anything.
> > >
> > > I'm naming the function by what it does, and putting it where it belongs in
> > > the VMA logic, and additionally making the function less horrible.
> > >
> > > Let's not please get stuck on the isues with mseal implementation which
> > > will catch-22 us into not being able to refactor.
> > >
> > > We can do the refactoring first and it's fine to just yank this if it's not
> > > used.
> > >
> > > I'm not having a function like this sat in mm/mseal.c when it has
> > > absolutely nothing to do with mseal specifically though.
> > >
> > > >
> > > > I have some patches locally to fully remove this upfront check, and AFAIK
> > > > we're somewhat in agreement that we can simply nuke this check (for
> > > > various reasons, including that we *still* don't have a man page for the
> > > > syscall). I can send them for proper discussion after your series lands.
> > >
> > > Yes I agree this check is odd, I don't really see why on earth we're
> > > concerned with whether there are gaps or not, you'd surely want to just
> > > seal whatever VMAs exist in the range?
> >
> > Probably because GAPs cannot be sealed. So user space could assume that in
> > fact, nothing in that area can change after a successful mseal, while it can
> > ...
> >
> > Not sure, though ...
>
> Yeah maybe a sekuriteh thing where you want to be sure the range is in fact
> _all_ sealed.
>
> I'm not sure that really makes much sense in practice honestly though, because
> if another thread can fiddle with things, then surely you've already 'lost'.
>
> if you expected to touch a VMA where in fact aa gap exists your software is
> _already_ broken because you're going to segfault.
Since this is applying a seal to a range that already exists, we are in
a race condition anyways.
If a gap exists it might be better to fail and exit as it is insecure,
but really, if someone has created a gap then they could have mapped
whatever they want..
>
> So it just seems overly theoretical to me.
I don't see a theoretical means to justify doing this, so I must be
missing something.
>
> I think we should error out if there's _no_ VMAs at all, but otherwise succeed.
>
> The semantics of 'all VMAs which exist in the range are sealed' would still be
> maintained.
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 15:31 ` Pedro Falcato
2025-07-14 15:37 ` Lorenzo Stoakes
@ 2025-07-14 15:41 ` David Hildenbrand
2025-07-14 15:45 ` Lorenzo Stoakes
1 sibling, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-14 15:41 UTC (permalink / raw)
To: Pedro Falcato
Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Jeff Xu
On 14.07.25 17:31, Pedro Falcato wrote:
> On Mon, Jul 14, 2025 at 05:03:03PM +0200, David Hildenbrand wrote:
>> [...]
>>
>> But now I wonder, why is it okay to discard anon pages in a MAP_PRIVATE file
>> mapping?
>
> IIRC this was originally suggested by Linus, on one of the versions introducing
> mseal. But the gist is that discarding pages is okay if you could already zero
> them manually, using e.g memset. Hence the writeability checks.
What you can do is
a) mmap(PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, ...)
b) modify content (write, whatever)
c) mprotect(PROT_READ)
d) mseal()
But then still do
madvise(MADV_DONTNEED)
to discard.
There is no writability anymore.
(Just a note that, with hugetlb, it is fairly common to
mmap(MAP_PRIVATE) empty files and only work with anonymous pages.)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check
2025-07-14 15:40 ` Lorenzo Stoakes
@ 2025-07-14 15:43 ` Liam R. Howlett
2025-07-14 15:47 ` Lorenzo Stoakes
0 siblings, 1 reply; 38+ messages in thread
From: Liam R. Howlett @ 2025-07-14 15:43 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Vlastimil Babka, Jann Horn,
Pedro Falcato, linux-mm, linux-kernel, Jeff Xu
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250714 11:40]:
> On Mon, Jul 14, 2025 at 11:35:44AM -0400, Liam R. Howlett wrote:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250714 09:08]:
> > > The check_mm_seal() function is doing something general - checking whether
> > > a range contains only VMAs (or rather that it does NOT contain any unmapped
> > > regions).
> > >
> > > Generalise this and put the logic in mm/vma.c - introducing
> > > range_contains_unmapped(). Additionally we can simplify the logic, we are
> > > simply checking whether the last vma->vm_end has either a VMA starting
> > > after it or ends before the end parameter.
> > >
> > > No functional change intended.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > I do fear that people will find this function and try and use it
> > internally, it may make our jobs of avoiding this being expanded more
> > annoying.
>
> Hmm, surely we should have some ability to dissuade within mm :)
>
> Thing is I don't love having this function in mm/mseal.c when it has
> nothing to do with mseal()'ing.
>
> If people want to be weird about gaps they can pretty trivially implement
> something like this anyway. Probably. Possibly. Maybe?
Yes, but the existence of a function legitimizes their thought prior to
sending it for review. That is, seeing a function that already does it
makes okay to include the option in the planning.
>
> >
> > Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
>
> Thanks!
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 15:41 ` David Hildenbrand
@ 2025-07-14 15:45 ` Lorenzo Stoakes
2025-07-14 15:52 ` David Hildenbrand
0 siblings, 1 reply; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 15:45 UTC (permalink / raw)
To: David Hildenbrand
Cc: Pedro Falcato, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 05:41:45PM +0200, David Hildenbrand wrote:
> On 14.07.25 17:31, Pedro Falcato wrote:
> > On Mon, Jul 14, 2025 at 05:03:03PM +0200, David Hildenbrand wrote:
> > > [...]
> > >
> > > But now I wonder, why is it okay to discard anon pages in a MAP_PRIVATE file
> > > mapping?
> >
> > IIRC this was originally suggested by Linus, on one of the versions introducing
> > mseal. But the gist is that discarding pages is okay if you could already zero
> > them manually, using e.g memset. Hence the writeability checks.
>
> What you can do is
>
> a) mmap(PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, ...)
>
> b) modify content (write, whatever)
>
> c) mprotect(PROT_READ)
>
> d) mseal()
>
> But then still do
>
> madvise(MADV_DONTNEED)
>
> to discard.
>
>
> There is no writability anymore.
Well, you can mprotect() writable it again :)
>
> (Just a note that, with hugetlb, it is fairly common to mmap(MAP_PRIVATE)
> empty files and only work with anonymous pages.)
This just makes me think that we should be checking VM_MAYWRITE anyway which
squares this circle.
Otherwise you can do 'workarounds' with mprotect() generally.
It's really only meaningful for a MAP_PRIVATE file-backed mapping of a read-only
file.
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check
2025-07-14 15:43 ` Liam R. Howlett
@ 2025-07-14 15:47 ` Lorenzo Stoakes
0 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 15:47 UTC (permalink / raw)
To: Liam R. Howlett, Andrew Morton, David Hildenbrand,
Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
Jeff Xu
On Mon, Jul 14, 2025 at 11:43:57AM -0400, Liam R. Howlett wrote:
> Yes, but the existence of a function legitimizes their thought prior to
> sending it for review. That is, seeing a function that already does it
> makes okay to include the option in the planning.
True.
OK, based on what you're saying and what Pedro's saying, next respin/fixpatch
I"ll put this in mseal.c as a static function, even if it makes me cringe a bit
to do it.
And then Pedro can remove it in his series :)
For the record I don't agree with this check being performed... it seems silly
to me.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 15:45 ` Lorenzo Stoakes
@ 2025-07-14 15:52 ` David Hildenbrand
2025-07-14 16:01 ` Lorenzo Stoakes
0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-07-14 15:52 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Pedro Falcato, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Jeff Xu
On 14.07.25 17:45, Lorenzo Stoakes wrote:
> On Mon, Jul 14, 2025 at 05:41:45PM +0200, David Hildenbrand wrote:
>> On 14.07.25 17:31, Pedro Falcato wrote:
>>> On Mon, Jul 14, 2025 at 05:03:03PM +0200, David Hildenbrand wrote:
>>>> [...]
>>>>
>>>> But now I wonder, why is it okay to discard anon pages in a MAP_PRIVATE file
>>>> mapping?
>>>
>>> IIRC this was originally suggested by Linus, on one of the versions introducing
>>> mseal. But the gist is that discarding pages is okay if you could already zero
>>> them manually, using e.g memset. Hence the writeability checks.
>>
>> What you can do is
>>
>> a) mmap(PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, ...)
>>
>> b) modify content (write, whatever)
>>
>> c) mprotect(PROT_READ)
>>
>> d) mseal()
>>
>> But then still do
>>
>> madvise(MADV_DONTNEED)
>>
>> to discard.
>>
>>
>> There is no writability anymore.
>
> Well, you can mprotect() writable it again :)
Isn't that what sealing ... prohibits?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c
2025-07-14 15:52 ` David Hildenbrand
@ 2025-07-14 16:01 ` Lorenzo Stoakes
0 siblings, 0 replies; 38+ messages in thread
From: Lorenzo Stoakes @ 2025-07-14 16:01 UTC (permalink / raw)
To: David Hildenbrand
Cc: Pedro Falcato, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Jeff Xu
On Mon, Jul 14, 2025 at 05:52:56PM +0200, David Hildenbrand wrote:
> On 14.07.25 17:45, Lorenzo Stoakes wrote:
> > On Mon, Jul 14, 2025 at 05:41:45PM +0200, David Hildenbrand wrote:
> > > On 14.07.25 17:31, Pedro Falcato wrote:
> > > > On Mon, Jul 14, 2025 at 05:03:03PM +0200, David Hildenbrand wrote:
> > > > > [...]
> > > > >
> > > > > But now I wonder, why is it okay to discard anon pages in a MAP_PRIVATE file
> > > > > mapping?
> > > >
> > > > IIRC this was originally suggested by Linus, on one of the versions introducing
> > > > mseal. But the gist is that discarding pages is okay if you could already zero
> > > > them manually, using e.g memset. Hence the writeability checks.
> > >
> > > What you can do is
> > >
> > > a) mmap(PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, ...)
> > >
> > > b) modify content (write, whatever)
> > >
> > > c) mprotect(PROT_READ)
> > >
> > > d) mseal()
> > >
> > > But then still do
> > >
> > > madvise(MADV_DONTNEED)
> > >
> > > to discard.
> > >
> > >
> > > There is no writability anymore.
> >
> > Well, you can mprotect() writable it again :)
>
> Isn't that what sealing ... prohibits?
Doh! Haha. Yes :) very good point...
OK I'll send the fix once this refactoring lands.
I think the check should simply be replaced with a vma->vm_flags & VM_SHARED
check then?
I wonder if the argument was that the MAP_PRIVATE file mapping is somehow
ephemeral anyway so not so problematic to lose. But I'm sure there are users who
feel differently about that...
Note that a truncate operation on the file will immediately drop the MAP_PRIVATE
anon backing anyway so it's sort of 'droppable' the whole time.
However, since a user is explicitly asking to seal the memory, and would have
expectation of seal semantics, so I agree that we need to fix this.
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2025-07-14 16:01 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-14 13:00 [PATCH 0/5] mseal cleanups Lorenzo Stoakes
2025-07-14 13:00 ` [PATCH 1/5] mm/mseal: always define VM_SEALED Lorenzo Stoakes
2025-07-14 14:31 ` David Hildenbrand
2025-07-14 15:20 ` Pedro Falcato
2025-07-14 15:32 ` Liam R. Howlett
2025-07-14 13:00 ` [PATCH 2/5] mm/mseal: move madvise() logic to mm/madvise.c Lorenzo Stoakes
2025-07-14 14:37 ` David Hildenbrand
2025-07-14 14:56 ` Lorenzo Stoakes
2025-07-14 15:03 ` David Hildenbrand
2025-07-14 15:18 ` Lorenzo Stoakes
2025-07-14 15:24 ` David Hildenbrand
2025-07-14 15:27 ` Lorenzo Stoakes
2025-07-14 15:37 ` David Hildenbrand
2025-07-14 15:31 ` Pedro Falcato
2025-07-14 15:37 ` Lorenzo Stoakes
2025-07-14 15:41 ` David Hildenbrand
2025-07-14 15:45 ` Lorenzo Stoakes
2025-07-14 15:52 ` David Hildenbrand
2025-07-14 16:01 ` Lorenzo Stoakes
2025-07-14 15:18 ` Pedro Falcato
2025-07-14 15:33 ` Liam R. Howlett
2025-07-14 13:00 ` [PATCH 3/5] mm/mseal: small cleanups Lorenzo Stoakes
2025-07-14 14:39 ` David Hildenbrand
2025-07-14 15:23 ` Pedro Falcato
2025-07-14 15:33 ` Liam R. Howlett
2025-07-14 13:00 ` [PATCH 4/5] mm/mseal: separate out and simplify VMA gap check Lorenzo Stoakes
2025-07-14 14:41 ` David Hildenbrand
2025-07-14 15:17 ` Pedro Falcato
2025-07-14 15:23 ` Lorenzo Stoakes
2025-07-14 15:25 ` David Hildenbrand
2025-07-14 15:32 ` Lorenzo Stoakes
2025-07-14 15:40 ` Liam R. Howlett
2025-07-14 15:35 ` Liam R. Howlett
2025-07-14 15:40 ` Lorenzo Stoakes
2025-07-14 15:43 ` Liam R. Howlett
2025-07-14 15:47 ` Lorenzo Stoakes
2025-07-14 13:00 ` [PATCH 5/5] mm/mseal: rework mseal apply logic Lorenzo Stoakes
2025-07-14 15:26 ` Pedro Falcato
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).