[RFC PATCH 0/5] add process_madvise() flags to modify behaviour

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
@ 2025-05-19 20:52 Lorenzo Stoakes
  2025-05-19 20:52 ` [RFC PATCH 1/5] mm: madvise: refactor madvise_populate() Lorenzo Stoakes
                   ` (8 more replies)
  0 siblings, 9 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-19 20:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

REVIEWERS NOTES:
================

This is a VERY EARLY version of the idea, it's relatively untested, and I'm
'putting it out there' for feedback. Any serious version of this will add a
bunch of self-tests to assert correct behaviour and I will more carefully
confirm everything's working.

This is based on discussion arising from Usama's series [0], SJ's input on
the thread around process_madvise() behaviour [1] (and a subsequent
response by me [2]) and prior discussion about a new madvise() interface
[3].

[0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
[1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
[2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
[3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/

================

Currently, we are rather restricted in how madvise() operations
proceed. While effort has been put in to expanding what process_madvise()
can do (that is - unrestricted application of advice to the local process
alongside recent improvements on the efficiency of TLB operations over
these batvches), we are still constrained by existing madvise() limitations
and default behaviours.

This series makes use of the currently unused flags field in
process_madvise() to provide more flexiblity.

It introduces four flags:

1. PMADV_SKIP_ERRORS

Currently, when an error arises applying advice in any individual VMA
(keeping in mind that a range specified to madvise() or as part of the
iovec passed to process_madvise()), the operation stops where it is and
returns an error.

This might not be the desired behaviour of the user, who may wish instead
for the operation to be 'best effort'. By setting this flag, that behaviour
is obtained.

Since process_madvise() would trivially, if skipping errors, simply return
the input vector size, we instead return the number of entries in the
vector which completed successfully without error.

The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.

2. PMADV_NO_ERROR_ON_UNMAPPED

Currently madvise() has the peculiar behaviour of, if the range specified
to it contains unmapped range(s), completing the full operation, but
ultimately returning -ENOMEM.

In the case of process_madvise(), this is fatal, as the operation will stop
immediately upon this occurring.

By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
unmapped areas to simply be entirely ignored.

3. PMADV_SET_FORK_EXEC_DEFAULT

It may be desirable for a user to specify that all VMAs mapped in a process
address space default to having an madvise() behaviour established by
default, in such a fashion as that this persists across fork/exec.

Since this is a very powerful option that would make no sense for many
advice modes, we explicitly only permit known-safe flags here (currently
MADV_HUGEPAGE and MADV_NOHUGEPAGE only).

4. PMADV_ENTIRE_ADDRESS_SPACE

It can be annoying, should a user wish to apply madvise() to all VMAs in an
address space, to have to add a singular large entry to the input iovec.

So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified,
we expect the user to pass NULL and -1 to the vec and vlen parameters
respectively so they explicitly acknowledge that these will be ignored,
e.g.:

	process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE,
			PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS);

Usually a user ought to prefer setting PMADV_SKIP_ERRORS here as it may
well be the case that incompatible VMAs will be encountered that ought to
be skipped.

If this is not set, the PMADV_NO_ERROR_ON_UNMAPPED (which was otherwise
implicitly implied by PMADV_SKIP_ERRORS) ought to be set as of course, the
entire address space spans at least some gaps.

Lorenzo Stoakes (5):
  mm: madvise: refactor madvise_populate()
  mm/madvise: add PMADV_SKIP_ERRORS process_madvise() flag
  mm/madvise: add PMADV_NO_ERROR_ON_UNMAPPED process_madvise() flag
  mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
  mm/madvise: add PMADV_ENTIRE_ADDRESS_SPACE process_madvise() flag

 include/uapi/asm-generic/mman-common.h |   6 +
 mm/madvise.c                           | 206 +++++++++++++++++++------
 2 files changed, 168 insertions(+), 44 deletions(-)

--
2.49.0

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH 1/5] mm: madvise: refactor madvise_populate()
  2025-05-19 20:52 [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Lorenzo Stoakes
@ 2025-05-19 20:52 ` Lorenzo Stoakes
  2025-05-20 10:30   ` David Hildenbrand
  2025-05-19 20:52 ` [RFC PATCH 2/5] mm/madvise: add PMADV_SKIP_ERRORS process_madvise() flag Lorenzo Stoakes
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-19 20:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

Use a for-loop rather than a while with the update of the start argument at
the end of the while-loop.

This is in preparation for a subsequent commit which modifies this
function, we therefore separate the refactoring from the actual change
cleanly by separating the two.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/madvise.c | 39 ++++++++++++++++++++-------------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 8433ac9b27e0..63cc69daa4c7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -967,32 +967,33 @@ static long madvise_populate(struct mm_struct *mm, unsigned long start,
 	int locked = 1;
 	long pages;
 
-	while (start < end) {
+	for (; start < end; start += pages * PAGE_SIZE) {
 		/* Populate (prefault) page tables readable/writable. */
 		pages = faultin_page_range(mm, start, end, write, &locked);
 		if (!locked) {
 			mmap_read_lock(mm);
 			locked = 1;
 		}
-		if (pages < 0) {
-			switch (pages) {
-			case -EINTR:
-				return -EINTR;
-			case -EINVAL: /* Incompatible mappings / permissions. */
-				return -EINVAL;
-			case -EHWPOISON:
-				return -EHWPOISON;
-			case -EFAULT: /* VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV */
-				return -EFAULT;
-			default:
-				pr_warn_once("%s: unhandled return value: %ld\n",
-					     __func__, pages);
-				fallthrough;
-			case -ENOMEM: /* No VMA or out of memory. */
-				return -ENOMEM;
-			}
+
+		if (pages >= 0)
+			continue;
+
+		switch (pages) {
+		case -EINTR:
+			return -EINTR;
+		case -EINVAL: /* Incompatible mappings / permissions. */
+			return -EINVAL;
+		case -EHWPOISON:
+			return -EHWPOISON;
+		case -EFAULT: /* VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV */
+			return -EFAULT;
+		default:
+			pr_warn_once("%s: unhandled return value: %ld\n",
+				     __func__, pages);
+			fallthrough;
+		case -ENOMEM: /* No VMA or out of memory. */
+			return -ENOMEM;
 		}
-		start += pages * PAGE_SIZE;
 	}
 	return 0;
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 2/5] mm/madvise: add PMADV_SKIP_ERRORS process_madvise() flag
  2025-05-19 20:52 [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Lorenzo Stoakes
  2025-05-19 20:52 ` [RFC PATCH 1/5] mm: madvise: refactor madvise_populate() Lorenzo Stoakes
@ 2025-05-19 20:52 ` Lorenzo Stoakes
  2025-05-19 20:52 ` [RFC PATCH 3/5] mm/madvise: add PMADV_NO_ERROR_ON_UNMAPPED " Lorenzo Stoakes
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-19 20:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

Currently process_madvise() has an unused flags field. Make use of it to
modify madvise behaviour, starting with a flag to enable the ignoring of
errors in the process of applying advice across ranges.

Currently, should an error occur in a VMA (keeping in mind each individual
range specified to process_madvise() may span multiple VMAs), we abort the
operation.

In the case of madvise() we report the error should one occur, or -ENOMEM
should no error occur but a gap (unmapped memory) is contained in the
specified range.

In the case of process_madvise() we only report an error if absolutely none
of the specified ranges succeeded.

This patch introduces a new process_madvise() flag - PMADV_SKIP_ERRORS -
which changes this behaviour - if an operation fails, whether part of an
individual range or one of the ranges specified in the input vector, we
simply carry on and do a 'best effort' pass over the input VMAs.

Since process_madvise() cannot report errors should any operation succeed,
we cannot sensibly return an error code in this case without it being
unclear as to whether the whole operation failed, or if an individual one
failed (or perhaps an individual VMA within the range).

To provide sensible feedback, we instead report the number of iovec entries
specified by the user which had no skipping occur whether internally (at
least one VMA specified in the individual range) or as a while.

We also intentionally do not consider the (rather odd) case of an error
arising should a specified range contain unmapped memory - any unmapped
regions encountered will not change the reported number of ranges
successfully advised.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/uapi/asm-generic/mman-common.h |  3 +
 mm/madvise.c                           | 92 +++++++++++++++++++++-----
 2 files changed, 79 insertions(+), 16 deletions(-)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ef1c27fa3c57..a5e4e2f3e82d 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -91,4 +91,7 @@
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
 
+/* process_madvise() flags */
+#define PMADV_SKIP_ERRORS (1U << 0) /* Skip VMAs on errors, but carry on. */
+
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index 63cc69daa4c7..37ef1d6f4190 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -51,6 +51,8 @@ struct madvise_walk_private {
 struct madvise_behavior {
 	int behavior;
 	struct mmu_gather *tlb;
+	int flags;
+	bool saw_error;
 };
 
 /*
@@ -961,8 +963,10 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 }
 
 static long madvise_populate(struct mm_struct *mm, unsigned long start,
-		unsigned long end, int behavior)
+		unsigned long end, struct madvise_behavior *madv_behavior)
 {
+	int behavior = madv_behavior->behavior;
+	bool can_skip = madv_behavior->flags & PMADV_SKIP_ERRORS;
 	const bool write = behavior == MADV_POPULATE_WRITE;
 	int locked = 1;
 	long pages;
@@ -978,6 +982,13 @@ static long madvise_populate(struct mm_struct *mm, unsigned long start,
 		if (pages >= 0)
 			continue;
 
+		if (can_skip) {
+			/* Simply try the next page. */
+			pages = 1;
+			madv_behavior->saw_error = true;
+			continue;
+		}
+
 		switch (pages) {
 		case -EINTR:
 			return -EINTR;
@@ -1254,12 +1265,11 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
  * will handle splitting a vm area into separate areas, each area with its own
  * behavior.
  */
-static int madvise_vma_behavior(struct vm_area_struct *vma,
+static int __madvise_vma_behavior(struct vm_area_struct *vma,
 				struct vm_area_struct **prev,
 				unsigned long start, unsigned long end,
-				void *behavior_arg)
+				struct madvise_behavior *arg)
 {
-	struct madvise_behavior *arg = behavior_arg;
 	int behavior = arg->behavior;
 	int error;
 	struct anon_vma_name *anon_name;
@@ -1354,19 +1364,38 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 	return error;
 }
 
+static int madvise_vma_behavior(struct vm_area_struct *vma,
+				struct vm_area_struct **prev,
+				unsigned long start, unsigned long end,
+				void *behavior_arg)
+{
+	struct madvise_behavior *madv_behavior = behavior_arg;
+	int ret = __madvise_vma_behavior(vma, prev, start, end, madv_behavior);
+	bool can_skip = madv_behavior->flags & PMADV_SKIP_ERRORS;
+
+	/* We must propagate the restart no matter what. */
+	if (can_skip && ret && ret != -ERESTARTNOINTR) {
+		madv_behavior->saw_error = true;
+		return 0;
+	}
+
+	return ret;
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Error injection support for memory error handling.
  */
-static int madvise_inject_error(int behavior,
+static int madvise_inject_error(struct madvise_behavior *madv_behavior,
 		unsigned long start, unsigned long end)
 {
+	int behavior = madv_behavior->behavior;
+	bool can_skip = madv_behavior->flags & PMADV_SKIP_ERRORS;
 	unsigned long size;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-
 	for (; start < end; start += size) {
 		unsigned long pfn;
 		struct page *page;
@@ -1396,8 +1425,12 @@ static int madvise_inject_error(int behavior,
 				ret = 0;
 		}
 
-		if (ret)
-			return ret;
+		if (ret) {
+			if (can_skip)
+				madv_behavior->saw_error = true;
+			else
+				return ret;
+		}
 	}
 
 	return 0;
@@ -1416,7 +1449,7 @@ static bool is_memory_failure(int behavior)
 
 #else
 
-static int madvise_inject_error(int behavior,
+static int madvise_inject_error(struct madvise_behavior *madv_behavior,
 		unsigned long start, unsigned long end)
 {
 	return 0;
@@ -1721,13 +1754,14 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	int error;
 
 	if (is_memory_failure(behavior))
-		return madvise_inject_error(behavior, start, start + len_in);
+		return madvise_inject_error(madv_behavior, start,
+					    start + len_in);
 	start = untagged_addr_remote(mm, start);
 	end = start + PAGE_ALIGN(len_in);
 
 	blk_start_plug(&plug);
 	if (is_madvise_populate(behavior))
-		error = madvise_populate(mm, start, end, behavior);
+		error = madvise_populate(mm, start, end, madv_behavior);
 	else
 		error = madvise_walk_vmas(mm, start, end, madv_behavior,
 					  madvise_vma_behavior);
@@ -1836,7 +1870,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 
 /* Perform an madvise operation over a vector of addresses and lengths. */
 static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
-			      int behavior)
+			      int behavior, int flags)
 {
 	ssize_t ret = 0;
 	size_t total_len;
@@ -1844,7 +1878,10 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 	struct madvise_behavior madv_behavior = {
 		.behavior = behavior,
 		.tlb = &tlb,
+		.flags = flags,
 	};
+	bool can_skip = flags & PMADV_SKIP_ERRORS;
+	size_t skipped = 0;
 
 	total_len = iov_iter_count(iter);
 
@@ -1886,18 +1923,41 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 			madvise_init_tlb(&madv_behavior, mm);
 			continue;
 		}
-		if (ret < 0)
+		if (ret < 0 && !can_skip)
 			break;
+
+		/* All skip cases will return 0. */
+		if (can_skip && madv_behavior.saw_error) {
+			skipped++;
+			madv_behavior.saw_error = false;
+		}
+
 		iov_iter_advance(iter, iter_iov_len(iter));
 	}
 	madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
-	ret = (total_len - iov_iter_count(iter)) ? : ret;
+	/*
+	 * Since we ignore errors in this case, simply report the number of
+	 * entries in the iovec which were totally successful.
+	 */
+	if (can_skip)
+		return total_len - skipped;
 
+	ret = (total_len - iov_iter_count(iter)) ? : ret;
 	return ret;
 }
 
+static bool check_process_madvise_flags(unsigned int flags)
+{
+	unsigned int mask = PMADV_SKIP_ERRORS;
+
+	if (flags & ~mask)
+		return false;
+
+	return true;
+}
+
 SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		size_t, vlen, int, behavior, unsigned int, flags)
 {
@@ -1909,7 +1969,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 	struct mm_struct *mm;
 	unsigned int f_flags;
 
-	if (flags != 0) {
+	if (!check_process_madvise_flags(flags)) {
 		ret = -EINVAL;
 		goto out;
 	}
@@ -1950,7 +2010,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 		goto release_mm;
 	}
 
-	ret = vector_madvise(mm, &iter, behavior);
+	ret = vector_madvise(mm, &iter, behavior, flags);
 
 release_mm:
 	mmput(mm);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 3/5] mm/madvise: add PMADV_NO_ERROR_ON_UNMAPPED process_madvise() flag
  2025-05-19 20:52 [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Lorenzo Stoakes
  2025-05-19 20:52 ` [RFC PATCH 1/5] mm: madvise: refactor madvise_populate() Lorenzo Stoakes
  2025-05-19 20:52 ` [RFC PATCH 2/5] mm/madvise: add PMADV_SKIP_ERRORS process_madvise() flag Lorenzo Stoakes
@ 2025-05-19 20:52 ` Lorenzo Stoakes
  2025-05-19 20:52 ` [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT " Lorenzo Stoakes
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-19 20:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

madvise() has the peculiar behaviour of, should unmapped memory be
encountered in a specified range, carrying on regardless, but returning an
-ENOMEM error at the end of the operation.

This is problematic as one cannot tell if this error arose from there being
unmapped memory or some other cause. It's also rather odd behaviour.

However, we must maintain madvise() behaviour as-is for uAPI reasons.

When it comes to process_madvise(), things are a little different (and a
little more strange). Should ANY entry in the supplied iovec encounter no
errors whatsoever, then all errors encountered are swallowed and we instead
report the number of iovec ranges which succeeded.

However, if no entries succeeded, we simply report the error.

This is problematic in the case of gaps - as the -ENOMEM return value is
treated like an error and causes the entire operation to halt.

A user may very well not desire this behaviour, so we provide the
PMADV_NO_ERROR_ON_UNMAPPED option to cause this to not be treated as an
error at all.

Note that due to the way it is implemented, PMADV_SKIP_ERRORS implies
PMADV_NO_ERROR_ON_UNMAPPED.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/uapi/asm-generic/mman-common.h |  3 ++-
 mm/madvise.c                           | 15 ++++++++-------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index a5e4e2f3e82d..58c8a3fadf99 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -92,6 +92,7 @@
 				 PKEY_DISABLE_WRITE)
 
 /* process_madvise() flags */
-#define PMADV_SKIP_ERRORS (1U << 0) /* Skip VMAs on errors, but carry on. */
+#define PMADV_SKIP_ERRORS (1U << 0) /* Skip VMAs on errors, but carry on. Implies no error on unmapped. */
+#define PMADV_NO_ERROR_ON_UNMAPPED (1U << 1) /* Never report an error on unmapped ranges. */
 
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index 37ef1d6f4190..fd94ef27f909 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1530,7 +1530,7 @@ static bool process_madvise_remote_valid(int behavior)
  */
 static
 int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
-		      unsigned long end, void *arg,
+		      unsigned long end, bool err_on_unmapped, void *arg,
 		      int (*visit)(struct vm_area_struct *vma,
 				   struct vm_area_struct **prev, unsigned long start,
 				   unsigned long end, void *arg))
@@ -1584,7 +1584,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 			vma = find_vma(mm, start);
 	}
 
-	return unmapped_error;
+	return err_on_unmapped ? unmapped_error : 0;
 }
 
 #ifdef CONFIG_ANON_VMA_NAME
@@ -1632,8 +1632,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 	if (end == start)
 		return 0;
 
-	return madvise_walk_vmas(mm, start, end, anon_name,
-				 madvise_vma_anon_name);
+	return madvise_walk_vmas(mm, start, end, /* err_on_unmapped= */true,
+				 anon_name, madvise_vma_anon_name);
 }
 #endif /* CONFIG_ANON_VMA_NAME */
 
@@ -1752,6 +1752,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	struct blk_plug plug;
 	unsigned long end;
 	int error;
+	bool err_on_unmapped = !(madv_behavior->flags & PMADV_NO_ERROR_ON_UNMAPPED);
 
 	if (is_memory_failure(behavior))
 		return madvise_inject_error(madv_behavior, start,
@@ -1763,8 +1764,8 @@ static int madvise_do_behavior(struct mm_struct *mm,
 	if (is_madvise_populate(behavior))
 		error = madvise_populate(mm, start, end, madv_behavior);
 	else
-		error = madvise_walk_vmas(mm, start, end, madv_behavior,
-					  madvise_vma_behavior);
+		error = madvise_walk_vmas(mm, start, end, err_on_unmapped,
+					  madv_behavior, madvise_vma_behavior);
 	blk_finish_plug(&plug);
 	return error;
 }
@@ -1950,7 +1951,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 
 static bool check_process_madvise_flags(unsigned int flags)
 {
-	unsigned int mask = PMADV_SKIP_ERRORS;
+	unsigned int mask = PMADV_SKIP_ERRORS | PMADV_NO_ERROR_ON_UNMAPPED;
 
 	if (flags & ~mask)
 		return false;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
  2025-05-19 20:52 [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2025-05-19 20:52 ` [RFC PATCH 3/5] mm/madvise: add PMADV_NO_ERROR_ON_UNMAPPED " Lorenzo Stoakes
@ 2025-05-19 20:52 ` Lorenzo Stoakes
  2025-05-20  8:38   ` Pedro Falcato
  2025-05-20 22:26   ` Johannes Weiner
  2025-05-19 20:52 ` [RFC PATCH 5/5] mm/madvise: add PMADV_ENTIRE_ADDRESS_SPACE " Lorenzo Stoakes
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-19 20:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

It's useful in certain cases to be able to default-enable an madvise() flag
for all newly mapped VMAs, and for that to survive fork/exec.

The natural place to specify something like this is in an madvise()
invocation, and thus providing this functionality as a flag to
process_madvise() makes sense.

We intentionally limit this only to flags that we know should function
correctly without issue, and to be conservative about this, so we initially
limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.

We implement this functionality by using the mm_struct->def_flags field.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/uapi/asm-generic/mman-common.h |  1 +
 mm/madvise.c                           | 43 +++++++++++++++++++++++++-
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 58c8a3fadf99..6998ea0ecc6d 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -94,5 +94,6 @@
 /* process_madvise() flags */
 #define PMADV_SKIP_ERRORS (1U << 0) /* Skip VMAs on errors, but carry on. Implies no error on unmapped. */
 #define PMADV_NO_ERROR_ON_UNMAPPED (1U << 1) /* Never report an error on unmapped ranges. */
+#define PMADV_SET_FORK_EXEC_DEFAULT (1U << 2) /* Set the behavior as a default that survives fork/exec. */
 
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index fd94ef27f909..9ea36800de3c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1869,6 +1869,40 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	return do_madvise(current->mm, start, len_in, behavior);
 }
 
+/*
+ * Are we permitted to set an madvise() behavior by default across the virtual
+ * address space, surviving fork/exec?
+ */
+static bool can_set_default_behavior(int behavior)
+{
+	switch (behavior) {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	case MADV_HUGEPAGE:
+	case MADV_NOHUGEPAGE:
+		return true;
+#endif
+	default:
+		return false;
+	}
+}
+
+static void set_default_behavior(struct mm_struct *mm, int behavior)
+{
+	switch (behavior) {
+	case MADV_HUGEPAGE:
+		mm->def_flags &= ~VM_NOHUGEPAGE;
+		mm->def_flags |= VM_HUGEPAGE;
+		break;
+	case MADV_NOHUGEPAGE:
+		mm->def_flags &= ~VM_HUGEPAGE;
+		mm->def_flags |= VM_NOHUGEPAGE;
+		break;
+	default:
+		WARN_ON(1);
+		break;
+	}
+}
+
 /* Perform an madvise operation over a vector of addresses and lengths. */
 static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 			      int behavior, int flags)
@@ -1882,8 +1916,12 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		.flags = flags,
 	};
 	bool can_skip = flags & PMADV_SKIP_ERRORS;
+	bool set_default = flags & PMADV_SET_FORK_EXEC_DEFAULT;
 	size_t skipped = 0;
 
+	if (set_default && !can_set_default_behavior(behavior))
+		return -EINVAL;
+
 	total_len = iov_iter_count(iter);
 
 	ret = madvise_lock(mm, behavior);
@@ -1931,6 +1969,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		if (can_skip && madv_behavior.saw_error) {
 			skipped++;
 			madv_behavior.saw_error = false;
+		} else if (set_default) {
+			set_default_behavior(mm, behavior);
 		}
 
 		iov_iter_advance(iter, iter_iov_len(iter));
@@ -1951,7 +1991,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 
 static bool check_process_madvise_flags(unsigned int flags)
 {
-	unsigned int mask = PMADV_SKIP_ERRORS | PMADV_NO_ERROR_ON_UNMAPPED;
+	unsigned int mask = PMADV_SKIP_ERRORS | PMADV_NO_ERROR_ON_UNMAPPED |
+		PMADV_SET_FORK_EXEC_DEFAULT;
 
 	if (flags & ~mask)
 		return false;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH 5/5] mm/madvise: add PMADV_ENTIRE_ADDRESS_SPACE process_madvise() flag
  2025-05-19 20:52 [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2025-05-19 20:52 ` [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT " Lorenzo Stoakes
@ 2025-05-19 20:52 ` Lorenzo Stoakes
  2025-05-19 21:53 ` [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Jann Horn
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-19 20:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

For convenience, add the ability to specify the entire address space to
apply the madvise() flag to.

This is best used with PMADV_SKIP_ERRORS (this implies
PMADV_NO_ERROR_ON_UNMAPPED) to skip over any VMAs to which the flags do not
apply.

When this flag is specified, the input vec and vlen parameters must be set
to NULL and -1 respectively, as the user is requesting to perform the
action on the entire address space, e.g.:

process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE,
	PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS);

This can be used in conjunction with PMADV_SET_FORK_EXEC_DEFAULT for the
ability to both apply an madvise() behaviour to all VMAs in the process
address space but also to default set the relevant VMA flags for any new
mappings, e.g.:

process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE,
	PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS |
	PMADV_SET_FORK_EXEC_DEFAULT);

Which can be useful for ensuring that the flag in question is consistently
applied everywhere.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/uapi/asm-generic/mman-common.h |  1 +
 mm/madvise.c                           | 23 +++++++++++++++++++----
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6998ea0ecc6d..3d523db2f100 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -95,5 +95,6 @@
 #define PMADV_SKIP_ERRORS (1U << 0) /* Skip VMAs on errors, but carry on. Implies no error on unmapped. */
 #define PMADV_NO_ERROR_ON_UNMAPPED (1U << 1) /* Never report an error on unmapped ranges. */
 #define PMADV_SET_FORK_EXEC_DEFAULT (1U << 2) /* Set the behavior as a default that survives fork/exec. */
+#define PMADV_ENTIRE_ADDRESS_SPACE (1U << 3) /* Ignore input iovec and apply to entire address space. */
 
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index 9ea36800de3c..0fb8cd7fdc7a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1992,7 +1992,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 static bool check_process_madvise_flags(unsigned int flags)
 {
 	unsigned int mask = PMADV_SKIP_ERRORS | PMADV_NO_ERROR_ON_UNMAPPED |
-		PMADV_SET_FORK_EXEC_DEFAULT;
+		PMADV_SET_FORK_EXEC_DEFAULT | PMADV_ENTIRE_ADDRESS_SPACE;
 
 	if (flags & ~mask)
 		return false;
@@ -2010,15 +2010,30 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
 	struct task_struct *task;
 	struct mm_struct *mm;
 	unsigned int f_flags;
+	bool entire_address_space = flags & PMADV_ENTIRE_ADDRESS_SPACE;
 
 	if (!check_process_madvise_flags(flags)) {
 		ret = -EINVAL;
 		goto out;
 	}
 
-	ret = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
-	if (ret < 0)
-		goto out;
+	if (entire_address_space) {
+		/* The user must specify NULL, -1 vec, vlen parameters. */
+		if (vec != NULL || vlen != (size_t)-1)
+			return -EINVAL;
+
+		/*
+		 * Ignore the input and simply add a single entry spanning the
+		 * whole address space.
+		 */
+		iovstack[0].iov_base = 0;
+		iovstack[0].iov_len = TASK_SIZE_MAX;
+		iov_iter_init(&iter, ITER_DEST, iov, 1, 1);
+	} else {
+		ret = import_iovec(ITER_DEST, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
+		if (ret < 0)
+			goto out;
+	}
 
 	task = pidfd_get_task(pidfd, &f_flags);
 	if (IS_ERR(task)) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-19 20:52 [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Lorenzo Stoakes
                   ` (4 preceding siblings ...)
  2025-05-19 20:52 ` [RFC PATCH 5/5] mm/madvise: add PMADV_ENTIRE_ADDRESS_SPACE " Lorenzo Stoakes
@ 2025-05-19 21:53 ` Jann Horn
  2025-05-20  5:35   ` Lorenzo Stoakes
  2025-05-20 15:28 ` David Hildenbrand
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 57+ messages in thread
From: Jann Horn @ 2025-05-19 21:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Arnd Bergmann, Christian Brauner, linux-mm,
	linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Mon, May 19, 2025 at 10:53 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> 3. PMADV_SET_FORK_EXEC_DEFAULT
>
> It may be desirable for a user to specify that all VMAs mapped in a process
> address space default to having an madvise() behaviour established by
> default, in such a fashion as that this persists across fork/exec.

Settings that persist across exec are generally a bit dodgy and you
have to ask questions like "what happens on setuid execution, could
this somehow influence the behavior of a setuid binary in a
security-relevant way", and I'm not sure whether that is the case with
hugepage flags but I guess it could maybe influence the alignment with
which mappings are created or something like that? And if you add more
flags to this list later, you'll again have to think about annoying
setuid considerations each time.

For comparison, personality flags are explicitly supposed to persist
across execve, but they can be dangerous (stuff like READ_IMPLIES_EXEC
and ADDR_NO_RANDOMIZE), so we have PER_CLEAR_ON_SETID which gets
cleared only if the execution is privileged. (Annoyingly, the
PER_CLEAR_ON_SETID handling is currently implemented separately for
each type of privileged execution we can have
[setuid/setgid/fscaps/selinux transition/apparmor transition/smack
transition], but I guess you could probably gate it on
bprm->secureexec instead...).

It would be nice if you could either make this a property of the
mm_struct that does not persist across exec, or if that would break
your intended usecase, alternatively wipe it on privileged execution.

> Since this is a very powerful option that would make no sense for many
> advice modes, we explicitly only permit known-safe flags here (currently
> MADV_HUGEPAGE and MADV_NOHUGEPAGE only).

Ah, sort of like a more generic version of prctl(PR_SET_THP_DISABLE,
flag, 0, 0, 0) that also allows opt-in on top of opt-out.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-19 21:53 ` [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Jann Horn
@ 2025-05-20  5:35   ` Lorenzo Stoakes
  2025-05-20 16:04     ` Jann Horn
  0 siblings, 1 reply; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20  5:35 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Arnd Bergmann, Christian Brauner, linux-mm,
	linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Mon, May 19, 2025 at 11:53:43PM +0200, Jann Horn wrote:
> On Mon, May 19, 2025 at 10:53 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > 3. PMADV_SET_FORK_EXEC_DEFAULT
> >
> > It may be desirable for a user to specify that all VMAs mapped in a process
> > address space default to having an madvise() behaviour established by
> > default, in such a fashion as that this persists across fork/exec.
>
> Settings that persist across exec are generally a bit dodgy and you
> have to ask questions like "what happens on setuid execution, could
> this somehow influence the behavior of a setuid binary in a
> security-relevant way", and I'm not sure whether that is the case with
> hugepage flags but I guess it could maybe influence the alignment with
> which mappings are created or something like that? And if you add more
> flags to this list later, you'll again have to think about annoying
> setuid considerations each time.

I do absolutely agree they're dodgy, which is why this is significantly
restricted.

The intent is that we'd _only ever_ add anything to this list after such
careful consideration had been made.

And obviously then you can assert whatever checks make sense for each
permitted madvise() flag.

Obviously this change is motivated by Usama's series, looking at an
alternative approach (as well as - at the same time - expanding what we can
do with [process_]madvise() - an ongoing bugbear for some time).

So this provides a less 'tacked on' (and thus bit rotting) version of this
change, to enforce that behaviour is consistent, aligned with madvise()
generally, and also importantly, maintained by mm :)

>
> For comparison, personality flags are explicitly supposed to persist
> across execve, but they can be dangerous (stuff like READ_IMPLIES_EXEC
> and ADDR_NO_RANDOMIZE), so we have PER_CLEAR_ON_SETID which gets
> cleared only if the execution is privileged. (Annoyingly, the
> PER_CLEAR_ON_SETID handling is currently implemented separately for
> each type of privileged execution we can have
> [setuid/setgid/fscaps/selinux transition/apparmor transition/smack
> transition], but I guess you could probably gate it on
> bprm->secureexec instead...).
>
> It would be nice if you could either make this a property of the
> mm_struct that does not persist across exec, or if that would break
> your intended usecase, alternatively wipe it on privileged execution.

The use case specifically requires persistence, unfortunately (we are still
determining whether this makes sense however - it is by no means a 'done
deal' that we're accepting this as a thing).

I suppose wiping on privileged execution could be achieved by storing a
mask of these permitted flags and clearing that mask in mm->def_flags at
this point?

Mightn't a better solution be to just require ns_capable(CAP_SYS_ADMIN)?

Usama - would wiping these flags on privileged execution, or requiring a
privileged process to be able to do this break your use case?

>
> > Since this is a very powerful option that would make no sense for many
> > advice modes, we explicitly only permit known-safe flags here (currently
> > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
>
> Ah, sort of like a more generic version of prctl(PR_SET_THP_DISABLE,
> flag, 0, 0, 0) that also allows opt-in on top of opt-out.

Right.

Thanks for taking a look at this! Your input on the security side is
absolutely invaluable :)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
  2025-05-19 20:52 ` [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT " Lorenzo Stoakes
@ 2025-05-20  8:38   ` Pedro Falcato
  2025-05-20 10:21     ` Lorenzo Stoakes
  2025-05-20 16:11     ` Jann Horn
  2025-05-20 22:26   ` Johannes Weiner
  1 sibling, 2 replies; 57+ messages in thread
From: Pedro Falcato @ 2025-05-20  8:38 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Mon, May 19, 2025 at 09:52:41PM +0100, Lorenzo Stoakes wrote:
> It's useful in certain cases to be able to default-enable an madvise() flag
> for all newly mapped VMAs, and for that to survive fork/exec.
> 
> The natural place to specify something like this is in an madvise()
> invocation, and thus providing this functionality as a flag to
> process_madvise() makes sense.
> 
> We intentionally limit this only to flags that we know should function
> correctly without issue, and to be conservative about this, so we initially
> limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
> the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.
> 
> We implement this functionality by using the mm_struct->def_flags field.

This seems super specific. How about this:

- PMADV_FUTURE (mirrors MCL_FUTURE). This only applies the flag to future VMAs in the current process.
- PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)
- PMADV_INHERIT_EXEC. This makes it so the flag is propagated through the execve boundary
  (and this is where we'd filter for 'safe' flags, at least through the secureexec boundary). Does not imply
  FUTURE nor INHERIT_FORK.

and, while we're at it, rename PMADV_ENTIRE_ADDRESS_SPACE to PMADV_CURRENT, to align it with MCL_CURRENT.

Thoughts?

-- 
Pedro

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
  2025-05-20  8:38   ` Pedro Falcato
@ 2025-05-20 10:21     ` Lorenzo Stoakes
  2025-05-20 11:41       ` Pedro Falcato
  2025-05-20 16:11     ` Jann Horn
  1 sibling, 1 reply; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 10:21 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 09:38:50AM +0100, Pedro Falcato wrote:
> On Mon, May 19, 2025 at 09:52:41PM +0100, Lorenzo Stoakes wrote:
> > It's useful in certain cases to be able to default-enable an madvise() flag
> > for all newly mapped VMAs, and for that to survive fork/exec.
> >
> > The natural place to specify something like this is in an madvise()
> > invocation, and thus providing this functionality as a flag to
> > process_madvise() makes sense.
> >
> > We intentionally limit this only to flags that we know should function
> > correctly without issue, and to be conservative about this, so we initially
> > limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
> > the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.
> >
> > We implement this functionality by using the mm_struct->def_flags field.
>
> This seems super specific. How about this:
>
> - PMADV_FUTURE (mirrors MCL_FUTURE). This only applies the flag to future VMAs in the current process.
> - PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)
> - PMADV_INHERIT_EXEC. This makes it so the flag is propagated through the execve boundary
>   (and this is where we'd filter for 'safe' flags, at least through the secureexec boundary). Does not imply
>   FUTURE nor INHERIT_FORK.

I don't know how we could implement separate current process, fork, exec, fork/exec.
mm->def_flags is propagated this way automatically.

And again on the security stuff, I think the correct answer is to require sys
admin capability to be able to use this option _at all_. This simplifies
everything.

To have this kind of thing we'd have to add a whole new mechanism, literally
just for this, and I'd really rather not generate brand new mm_struct flags for
every possible mode (in fact that would probably makes the whole thing
intractible), or add a new field there for this.

The idea is that we get the advantages of an improved madvise interface, while
also providing the interface Usama wants without having to add some hideous
prctl() whose logic is disconnected from the rest of madvise(), while being, in
effect, a 'default madvise() for new mappings'.

So while specific to the case, nothing prevents us in future adding more
functionality if we want.

We could also potentially:

- add PMADV_SET_DEFAULT (I'm iffy about PMADV_FUTURE... but whichever we go with)
- add PMADV_INHERIT_FORK
- add PMADV_INHERIT_EXEC

And only support PMADV_SET_DEFAULT | PMADV_INHERIT_FORK | PMADV_INHERIT_EXEC for
now.

THen we could have the security semantics you specify (require cap sys admin on
PMADV_INHERIT_EXEC) but have that propagate to the only supported case.

What do you think?

>
> and, while we're at it, rename PMADV_ENTIRE_ADDRESS_SPACE to PMADV_CURRENT, to align it with MCL_CURRENT.

I'm not sure making the mlock()/madvise() stuff analagous is a good idea, as
they have different semantics. I'd rather keep these flags descriptive. Though
I'm open to alternative naming of course...

Also keep in mind these flags are for mlockall(), whose name already tells you
you're locking everything :)

>
> Thoughts?
>
> --
> Pedro

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 1/5] mm: madvise: refactor madvise_populate()
  2025-05-19 20:52 ` [RFC PATCH 1/5] mm: madvise: refactor madvise_populate() Lorenzo Stoakes
@ 2025-05-20 10:30   ` David Hildenbrand
  2025-05-20 10:36     ` Lorenzo Stoakes
  0 siblings, 1 reply; 57+ messages in thread
From: David Hildenbrand @ 2025-05-20 10:30 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, linux-mm, linux-arch, linux-kernel,
	SeongJae Park, Usama Arif

On 19.05.25 22:52, Lorenzo Stoakes wrote:
> Use a for-loop rather than a while with the update of the start argument at
> the end of the while-loop.
> 
> This is in preparation for a subsequent commit which modifies this
> function, we therefore separate the refactoring from the actual change
> cleanly by separating the two.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>   mm/madvise.c | 39 ++++++++++++++++++++-------------------
>   1 file changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 8433ac9b27e0..63cc69daa4c7 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -967,32 +967,33 @@ static long madvise_populate(struct mm_struct *mm, unsigned long start,
>   	int locked = 1;
>   	long pages;
>   
> -	while (start < end) {
> +	for (; start < end; start += pages * PAGE_SIZE) {
>   		/* Populate (prefault) page tables readable/writable. */
>   		pages = faultin_page_range(mm, start, end, write, &locked);
>   		if (!locked) {
>   			mmap_read_lock(mm);
>   			locked = 1;
>   		}
> -		if (pages < 0) {
> -			switch (pages) {
> -			case -EINTR:
> -				return -EINTR;
> -			case -EINVAL: /* Incompatible mappings / permissions. */
> -				return -EINVAL;
> -			case -EHWPOISON:
> -				return -EHWPOISON;
> -			case -EFAULT: /* VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV */
> -				return -EFAULT;
> -			default:
> -				pr_warn_once("%s: unhandled return value: %ld\n",
> -					     __func__, pages);
> -				fallthrough;
> -			case -ENOMEM: /* No VMA or out of memory. */
> -				return -ENOMEM;
> -			}
> +
> +		if (pages >= 0)
> +			continue;
> +
> +		switch (pages) {
> +		case -EINTR:
> +			return -EINTR;
> +		case -EINVAL: /* Incompatible mappings / permissions. */
> +			return -EINVAL;
> +		case -EHWPOISON:
> +			return -EHWPOISON;
> +		case -EFAULT: /* VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV */
> +			return -EFAULT;
> +		default:
> +			pr_warn_once("%s: unhandled return value: %ld\n",
> +				     __func__, pages);
> +			fallthrough;
> +		case -ENOMEM: /* No VMA or out of memory. */
> +			return -ENOMEM;

Can we limit it to what the patch description says? "Use a for-loop 
rather than a while", or will that be a problem for the follow-up patch?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 1/5] mm: madvise: refactor madvise_populate()
  2025-05-20 10:30   ` David Hildenbrand
@ 2025-05-20 10:36     ` Lorenzo Stoakes
  2025-05-20 10:42       ` David Hildenbrand
  0 siblings, 1 reply; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 10:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 12:30:24PM +0200, David Hildenbrand wrote:
> On 19.05.25 22:52, Lorenzo Stoakes wrote:
> > Use a for-loop rather than a while with the update of the start argument at
> > the end of the while-loop.
> >
> > This is in preparation for a subsequent commit which modifies this
> > function, we therefore separate the refactoring from the actual change
> > cleanly by separating the two.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >   mm/madvise.c | 39 ++++++++++++++++++++-------------------
> >   1 file changed, 20 insertions(+), 19 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 8433ac9b27e0..63cc69daa4c7 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -967,32 +967,33 @@ static long madvise_populate(struct mm_struct *mm, unsigned long start,
> >   	int locked = 1;
> >   	long pages;
> > -	while (start < end) {
> > +	for (; start < end; start += pages * PAGE_SIZE) {
> >   		/* Populate (prefault) page tables readable/writable. */
> >   		pages = faultin_page_range(mm, start, end, write, &locked);
> >   		if (!locked) {
> >   			mmap_read_lock(mm);
> >   			locked = 1;
> >   		}
> > -		if (pages < 0) {
> > -			switch (pages) {
> > -			case -EINTR:
> > -				return -EINTR;
> > -			case -EINVAL: /* Incompatible mappings / permissions. */
> > -				return -EINVAL;
> > -			case -EHWPOISON:
> > -				return -EHWPOISON;
> > -			case -EFAULT: /* VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV */
> > -				return -EFAULT;
> > -			default:
> > -				pr_warn_once("%s: unhandled return value: %ld\n",
> > -					     __func__, pages);
> > -				fallthrough;
> > -			case -ENOMEM: /* No VMA or out of memory. */
> > -				return -ENOMEM;
> > -			}
> > +
> > +		if (pages >= 0)
> > +			continue;
> > +
> > +		switch (pages) {
> > +		case -EINTR:
> > +			return -EINTR;
> > +		case -EINVAL: /* Incompatible mappings / permissions. */
> > +			return -EINVAL;
> > +		case -EHWPOISON:
> > +			return -EHWPOISON;
> > +		case -EFAULT: /* VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV */
> > +			return -EFAULT;
> > +		default:
> > +			pr_warn_once("%s: unhandled return value: %ld\n",
> > +				     __func__, pages);
> > +			fallthrough;
> > +		case -ENOMEM: /* No VMA or out of memory. */
> > +			return -ENOMEM;
>
> Can we limit it to what the patch description says? "Use a for-loop rather
> than a while", or will that be a problem for the follow-up patch?

Well, kind of the point is that we can remove a level of indentation also, which
then makes life easier in subsequent patch.

Happy to change description or break into two (but that seems a bit over the top
maybe? :>)

Idea is that we clearly separate out the refactoring bit from the actual change
to the logic so it's not a pain to bisect/review.

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 1/5] mm: madvise: refactor madvise_populate()
  2025-05-20 10:36     ` Lorenzo Stoakes
@ 2025-05-20 10:42       ` David Hildenbrand
  2025-05-22 12:32         ` Mike Rapoport
  0 siblings, 1 reply; 57+ messages in thread
From: David Hildenbrand @ 2025-05-20 10:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

On 20.05.25 12:36, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 12:30:24PM +0200, David Hildenbrand wrote:
>> On 19.05.25 22:52, Lorenzo Stoakes wrote:
>>> Use a for-loop rather than a while with the update of the start argument at
>>> the end of the while-loop.
>>>
>>> This is in preparation for a subsequent commit which modifies this
>>> function, we therefore separate the refactoring from the actual change
>>> cleanly by separating the two.
>>>
>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> ---
>>>    mm/madvise.c | 39 ++++++++++++++++++++-------------------
>>>    1 file changed, 20 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>> index 8433ac9b27e0..63cc69daa4c7 100644
>>> --- a/mm/madvise.c
>>> +++ b/mm/madvise.c
>>> @@ -967,32 +967,33 @@ static long madvise_populate(struct mm_struct *mm, unsigned long start,
>>>    	int locked = 1;
>>>    	long pages;
>>> -	while (start < end) {
>>> +	for (; start < end; start += pages * PAGE_SIZE) {
>>>    		/* Populate (prefault) page tables readable/writable. */
>>>    		pages = faultin_page_range(mm, start, end, write, &locked);
>>>    		if (!locked) {
>>>    			mmap_read_lock(mm);
>>>    			locked = 1;
>>>    		}
>>> -		if (pages < 0) {
>>> -			switch (pages) {
>>> -			case -EINTR:
>>> -				return -EINTR;
>>> -			case -EINVAL: /* Incompatible mappings / permissions. */
>>> -				return -EINVAL;
>>> -			case -EHWPOISON:
>>> -				return -EHWPOISON;
>>> -			case -EFAULT: /* VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV */
>>> -				return -EFAULT;
>>> -			default:
>>> -				pr_warn_once("%s: unhandled return value: %ld\n",
>>> -					     __func__, pages);
>>> -				fallthrough;
>>> -			case -ENOMEM: /* No VMA or out of memory. */
>>> -				return -ENOMEM;
>>> -			}
>>> +
>>> +		if (pages >= 0)
>>> +			continue;
>>> +
>>> +		switch (pages) {
>>> +		case -EINTR:
>>> +			return -EINTR;
>>> +		case -EINVAL: /* Incompatible mappings / permissions. */
>>> +			return -EINVAL;
>>> +		case -EHWPOISON:
>>> +			return -EHWPOISON;
>>> +		case -EFAULT: /* VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV */
>>> +			return -EFAULT;
>>> +		default:
>>> +			pr_warn_once("%s: unhandled return value: %ld\n",
>>> +				     __func__, pages);
>>> +			fallthrough;
>>> +		case -ENOMEM: /* No VMA or out of memory. */
>>> +			return -ENOMEM;
>>
>> Can we limit it to what the patch description says? "Use a for-loop rather
>> than a while", or will that be a problem for the follow-up patch?
> 
> Well, kind of the point is that we can remove a level of indentation also, which
> then makes life easier in subsequent patch.
> 
> Happy to change description or break into two (but that seems a bit over the top
> maybe? :>)

Probably just mention it, otherwise it looks a bit like unrelated churn :)

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
  2025-05-20 10:21     ` Lorenzo Stoakes
@ 2025-05-20 11:41       ` Pedro Falcato
  2025-05-20 13:39         ` Lorenzo Stoakes
  0 siblings, 1 reply; 57+ messages in thread
From: Pedro Falcato @ 2025-05-20 11:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 11:21:33AM +0100, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 09:38:50AM +0100, Pedro Falcato wrote:
> > On Mon, May 19, 2025 at 09:52:41PM +0100, Lorenzo Stoakes wrote:
> > > It's useful in certain cases to be able to default-enable an madvise() flag
> > > for all newly mapped VMAs, and for that to survive fork/exec.
> > >
> > > The natural place to specify something like this is in an madvise()
> > > invocation, and thus providing this functionality as a flag to
> > > process_madvise() makes sense.
> > >
> > > We intentionally limit this only to flags that we know should function
> > > correctly without issue, and to be conservative about this, so we initially
> > > limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
> > > the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.
> > >
> > > We implement this functionality by using the mm_struct->def_flags field.
> >
> > This seems super specific. How about this:
> >
> > - PMADV_FUTURE (mirrors MCL_FUTURE). This only applies the flag to future VMAs in the current process.
> > - PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)
> > - PMADV_INHERIT_EXEC. This makes it so the flag is propagated through the execve boundary
> >   (and this is where we'd filter for 'safe' flags, at least through the secureexec boundary). Does not imply
> >   FUTURE nor INHERIT_FORK.
> 
> I don't know how we could implement separate current process, fork, exec, fork/exec.
> mm->def_flags is propagated this way automatically.
> 
> And again on the security stuff, I think the correct answer is to require sys
> admin capability to be able to use this option _at all_. This simplifies
> everything.
> 
> To have this kind of thing we'd have to add a whole new mechanism, literally
> just for this, and I'd really rather not generate brand new mm_struct flags for
> every possible mode (in fact that would probably makes the whole thing
> intractible), or add a new field there for this.
> 
> The idea is that we get the advantages of an improved madvise interface, while
> also providing the interface Usama wants without having to add some hideous
> prctl() whose logic is disconnected from the rest of madvise(), while being, in
> effect, a 'default madvise() for new mappings'.
> 
> So while specific to the case, nothing prevents us in future adding more
> functionality if we want.
> 
> We could also potentially:
> 
> - add PMADV_SET_DEFAULT (I'm iffy about PMADV_FUTURE... but whichever we go with)
> - add PMADV_INHERIT_FORK
> - add PMADV_INHERIT_EXEC
> 
> And only support PMADV_SET_DEFAULT | PMADV_INHERIT_FORK | PMADV_INHERIT_EXEC for
> now.
>
> THen we could have the security semantics you specify (require cap sys admin on
> PMADV_INHERIT_EXEC) but have that propagate to the only supported case.
> 
> What do you think?
>

If you don't want to add new fields, this option seems fine.
And then if any other usecase pops up, we're ready.
 
> >
> > and, while we're at it, rename PMADV_ENTIRE_ADDRESS_SPACE to PMADV_CURRENT, to align it with MCL_CURRENT.
> 
> I'm not sure making the mlock()/madvise() stuff analagous is a good idea, as
> they have different semantics. I'd rather keep these flags descriptive. Though
> I'm open to alternative naming of course...

Semantics are similar I think? And I do think getting shorter names is a good
idea, however I won't insist too hard on this.

-- 
Pedro

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
  2025-05-20 11:41       ` Pedro Falcato
@ 2025-05-20 13:39         ` Lorenzo Stoakes
  0 siblings, 0 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 13:39 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 12:41:00PM +0100, Pedro Falcato wrote:
> On Tue, May 20, 2025 at 11:21:33AM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 09:38:50AM +0100, Pedro Falcato wrote:
> > > On Mon, May 19, 2025 at 09:52:41PM +0100, Lorenzo Stoakes wrote:
> > > > It's useful in certain cases to be able to default-enable an madvise() flag
> > > > for all newly mapped VMAs, and for that to survive fork/exec.
> > > >
> > > > The natural place to specify something like this is in an madvise()
> > > > invocation, and thus providing this functionality as a flag to
> > > > process_madvise() makes sense.
> > > >
> > > > We intentionally limit this only to flags that we know should function
> > > > correctly without issue, and to be conservative about this, so we initially
> > > > limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
> > > > the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.
> > > >
> > > > We implement this functionality by using the mm_struct->def_flags field.
> > >
> > > This seems super specific. How about this:
> > >
> > > - PMADV_FUTURE (mirrors MCL_FUTURE). This only applies the flag to future VMAs in the current process.
> > > - PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)
> > > - PMADV_INHERIT_EXEC. This makes it so the flag is propagated through the execve boundary
> > >   (and this is where we'd filter for 'safe' flags, at least through the secureexec boundary). Does not imply
> > >   FUTURE nor INHERIT_FORK.
> >
> > I don't know how we could implement separate current process, fork, exec, fork/exec.
> > mm->def_flags is propagated this way automatically.
> >
> > And again on the security stuff, I think the correct answer is to require sys
> > admin capability to be able to use this option _at all_. This simplifies
> > everything.
> >
> > To have this kind of thing we'd have to add a whole new mechanism, literally
> > just for this, and I'd really rather not generate brand new mm_struct flags for
> > every possible mode (in fact that would probably makes the whole thing
> > intractible), or add a new field there for this.
> >
> > The idea is that we get the advantages of an improved madvise interface, while
> > also providing the interface Usama wants without having to add some hideous
> > prctl() whose logic is disconnected from the rest of madvise(), while being, in
> > effect, a 'default madvise() for new mappings'.
> >
> > So while specific to the case, nothing prevents us in future adding more
> > functionality if we want.
> >
> > We could also potentially:
> >
> > - add PMADV_SET_DEFAULT (I'm iffy about PMADV_FUTURE... but whichever we go with)
> > - add PMADV_INHERIT_FORK
> > - add PMADV_INHERIT_EXEC
> >
> > And only support PMADV_SET_DEFAULT | PMADV_INHERIT_FORK | PMADV_INHERIT_EXEC for
> > now.
> >
> > THen we could have the security semantics you specify (require cap sys admin on
> > PMADV_INHERIT_EXEC) but have that propagate to the only supported case.
> >
> > What do you think?
> >
>
> If you don't want to add new fields, this option seems fine.
> And then if any other usecase pops up, we're ready.

Yeah sounds fair, will do on respin!

>
> > >
> > > and, while we're at it, rename PMADV_ENTIRE_ADDRESS_SPACE to PMADV_CURRENT, to align it with MCL_CURRENT.
> >
> > I'm not sure making the mlock()/madvise() stuff analagous is a good idea, as
> > they have different semantics. I'd rather keep these flags descriptive. Though
> > I'm open to alternative naming of course...
>
> Semantics are similar I think? And I do think getting shorter names is a good
> idea, however I won't insist too hard on this.

Yeah perhaps with _ALL_ thrown in to make this clear... :) warming to it... ;)

>
> --
> Pedro

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-19 20:52 [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Lorenzo Stoakes
                   ` (5 preceding siblings ...)
  2025-05-19 21:53 ` [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Jann Horn
@ 2025-05-20 15:28 ` David Hildenbrand
  2025-05-20 17:47   ` Lorenzo Stoakes
  2025-05-20 18:25 ` Shakeel Butt
  2025-05-22 12:12 ` Mike Rapoport
  8 siblings, 1 reply; 57+ messages in thread
From: David Hildenbrand @ 2025-05-20 15:28 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, linux-mm, linux-arch, linux-kernel,
	SeongJae Park, Usama Arif

On 19.05.25 22:52, Lorenzo Stoakes wrote:
> REVIEWERS NOTES:
> ================
> 
> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> 'putting it out there' for feedback. Any serious version of this will add a
> bunch of self-tests to assert correct behaviour and I will more carefully
> confirm everything's working.
> 
> This is based on discussion arising from Usama's series [0], SJ's input on
> the thread around process_madvise() behaviour [1] (and a subsequent
> response by me [2]) and prior discussion about a new madvise() interface
> [3].
> 
> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> 
> ================
> 
> Currently, we are rather restricted in how madvise() operations
> proceed. While effort has been put in to expanding what process_madvise()
> can do (that is - unrestricted application of advice to the local process
> alongside recent improvements on the efficiency of TLB operations over
> these batvches), we are still constrained by existing madvise() limitations
> and default behaviours.
> 
> This series makes use of the currently unused flags field in
> process_madvise() to provide more flexiblity.
> 

In general, sounds like an interesting approach.

> It introduces four flags:
> 
> 1. PMADV_SKIP_ERRORS
> 
> Currently, when an error arises applying advice in any individual VMA
> (keeping in mind that a range specified to madvise() or as part of the
> iovec passed to process_madvise()), the operation stops where it is and
> returns an error.
> 
> This might not be the desired behaviour of the user, who may wish instead
> for the operation to be 'best effort'. By setting this flag, that behaviour
> is obtained.
> 
> Since process_madvise() would trivially, if skipping errors, simply return
> the input vector size, we instead return the number of entries in the
> vector which completed successfully without error.

I would focus only on adding flags that we absolutely need to make the 
use case we have in mind work. We can always add other flags as we see 
fit for them (IOW, when really required ;) ).

Regarding MADV_HUGEPAGE / MADV_NOHUGEPAGE, this would not be required, 
right?

> 
> The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> 
> 2. PMADV_NO_ERROR_ON_UNMAPPED
> 
> Currently madvise() has the peculiar behaviour of, if the range specified
> to it contains unmapped range(s), completing the full operation, but
> ultimately returning -ENOMEM.
> 
> In the case of process_madvise(), this is fatal, as the operation will stop
> immediately upon this occurring.
> 
> By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> unmapped areas to simply be entirely ignored.

Again, is this really required? Couldn't we glue that to 
PMADV_ENTIRE_ADDRESS_SPACE for our use case? After all, I don't expect 
anybody to have something mapped into *the entire address space*.

Well, okay, maybe on 32bit, but still ... :)

> 
> 3. PMADV_SET_FORK_EXEC_DEFAULT
> 
> It may be desirable for a user to specify that all VMAs mapped in a process
> address space default to having an madvise() behaviour established by
> default, in such a fashion as that this persists across fork/exec.

This is very specific for MADV_HUGEPAGE only, so I wonder how we could 
either avoid that flag or just make it clear that it shall stick around ...

Having that sad, PMADV_SET_FORK_EXEC_DEFAULT is rather a suboptimal name :(

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20  5:35   ` Lorenzo Stoakes
@ 2025-05-20 16:04     ` Jann Horn
  2025-05-20 16:14       ` Lorenzo Stoakes
  0 siblings, 1 reply; 57+ messages in thread
From: Jann Horn @ 2025-05-20 16:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Arnd Bergmann, Christian Brauner, linux-mm,
	linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 7:36 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Mon, May 19, 2025 at 11:53:43PM +0200, Jann Horn wrote:
> > For comparison, personality flags are explicitly supposed to persist
> > across execve, but they can be dangerous (stuff like READ_IMPLIES_EXEC
> > and ADDR_NO_RANDOMIZE), so we have PER_CLEAR_ON_SETID which gets
> > cleared only if the execution is privileged. (Annoyingly, the
> > PER_CLEAR_ON_SETID handling is currently implemented separately for
> > each type of privileged execution we can have
> > [setuid/setgid/fscaps/selinux transition/apparmor transition/smack
> > transition], but I guess you could probably gate it on
> > bprm->secureexec instead...).
> >
> > It would be nice if you could either make this a property of the
> > mm_struct that does not persist across exec, or if that would break
> > your intended usecase, alternatively wipe it on privileged execution.
>
> The use case specifically requires persistence, unfortunately (we are still
> determining whether this makes sense however - it is by no means a 'done
> deal' that we're accepting this as a thing).
>
> I suppose wiping on privileged execution could be achieved by storing a
> mask of these permitted flags and clearing that mask in mm->def_flags at
> this point?

Oh, I see, we're already inheriting VM_NOHUGEPAGE on execve through
mm->def_flags, with the bitmask VM_INIT_DEF_MASK controlling what is
inheritable? Hmmmm... I guess turning hugepages _off_ should be
fine...

Yeah I guess I'd do this by adding another bitmask
VM_INIT_DEF_MASK_SECUREEXEC or something like that, and then applying
that bitmask on setuid execution.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
  2025-05-20  8:38   ` Pedro Falcato
  2025-05-20 10:21     ` Lorenzo Stoakes
@ 2025-05-20 16:11     ` Jann Horn
  2025-05-20 16:19       ` Lorenzo Stoakes
  1 sibling, 1 reply; 57+ messages in thread
From: Jann Horn @ 2025-05-20 16:11 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett,
	David Hildenbrand, Vlastimil Babka, Arnd Bergmann,
	Christian Brauner, linux-mm, linux-arch, linux-kernel,
	SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 10:38 AM Pedro Falcato <pfalcato@suse.de> wrote:
> - PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)

Do we think there will be flags settable through this API that we
explicitly _don't_ want to inherit on fork()? My understanding is that
sort of the default for fork() is to inherit everything, and things
that don't get inherited are weird special cases (like mlock() state).
(While the default for execve() is to inherit nothing about the MM.)

(I guess you could make a case that in a fork+exec sequence, the child
might not want to create hugepage between fork and exec... but this is
probably not the right place to control that?)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 16:04     ` Jann Horn
@ 2025-05-20 16:14       ` Lorenzo Stoakes
  0 siblings, 0 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 16:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Arnd Bergmann, Christian Brauner, linux-mm,
	linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 06:04:47PM +0200, Jann Horn wrote:
> On Tue, May 20, 2025 at 7:36 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Mon, May 19, 2025 at 11:53:43PM +0200, Jann Horn wrote:
> > > For comparison, personality flags are explicitly supposed to persist
> > > across execve, but they can be dangerous (stuff like READ_IMPLIES_EXEC
> > > and ADDR_NO_RANDOMIZE), so we have PER_CLEAR_ON_SETID which gets
> > > cleared only if the execution is privileged. (Annoyingly, the
> > > PER_CLEAR_ON_SETID handling is currently implemented separately for
> > > each type of privileged execution we can have
> > > [setuid/setgid/fscaps/selinux transition/apparmor transition/smack
> > > transition], but I guess you could probably gate it on
> > > bprm->secureexec instead...).
> > >
> > > It would be nice if you could either make this a property of the
> > > mm_struct that does not persist across exec, or if that would break
> > > your intended usecase, alternatively wipe it on privileged execution.
> >
> > The use case specifically requires persistence, unfortunately (we are still
> > determining whether this makes sense however - it is by no means a 'done
> > deal' that we're accepting this as a thing).
> >
> > I suppose wiping on privileged execution could be achieved by storing a
> > mask of these permitted flags and clearing that mask in mm->def_flags at
> > this point?
>
> Oh, I see, we're already inheriting VM_NOHUGEPAGE on execve through
> mm->def_flags, with the bitmask VM_INIT_DEF_MASK controlling what is
> inheritable? Hmmmm... I guess turning hugepages _off_ should be
> fine...
>
> Yeah I guess I'd do this by adding another bitmask
> VM_INIT_DEF_MASK_SECUREEXEC or something like that, and then applying
> that bitmask on setuid execution.

I guess we could do it this way, as it would only otherwise limit a non-sys
admin user, and we should try to keep things as flexible as possible.

Let me do this for v2 and see how it works.

As it seems there's some general traction here I can also write some
tests...

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
  2025-05-20 16:11     ` Jann Horn
@ 2025-05-20 16:19       ` Lorenzo Stoakes
  2025-05-20 16:35         ` David Hildenbrand
  0 siblings, 1 reply; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 16:19 UTC (permalink / raw)
  To: Jann Horn
  Cc: Pedro Falcato, Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Arnd Bergmann, Christian Brauner, linux-mm,
	linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 06:11:49PM +0200, Jann Horn wrote:
> On Tue, May 20, 2025 at 10:38 AM Pedro Falcato <pfalcato@suse.de> wrote:
> > - PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)
>
> Do we think there will be flags settable through this API that we
> explicitly _don't_ want to inherit on fork()? My understanding is that
> sort of the default for fork() is to inherit everything, and things
> that don't get inherited are weird special cases (like mlock() state).
> (While the default for execve() is to inherit nothing about the MM.)

Yeah this is true. It is the exception rather than the rule...

>
> (I guess you could make a case that in a fork+exec sequence, the child
> might not want to create hugepage between fork and exec... but this is
> probably not the right place to control that?)

From my point of view it simply reads better :)

But perhaps we can drop the fork bit and leave that implied.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
  2025-05-20 16:19       ` Lorenzo Stoakes
@ 2025-05-20 16:35         ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2025-05-20 16:35 UTC (permalink / raw)
  To: Lorenzo Stoakes, Jann Horn
  Cc: Pedro Falcato, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

On 20.05.25 18:19, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 06:11:49PM +0200, Jann Horn wrote:
>> On Tue, May 20, 2025 at 10:38 AM Pedro Falcato <pfalcato@suse.de> wrote:
>>> - PMADV_INHERIT_FORK. This makes it so the flag is propagated to child processes (does not imply PMADV_FUTURE)
>>
>> Do we think there will be flags settable through this API that we
>> explicitly _don't_ want to inherit on fork()? My understanding is that
>> sort of the default for fork() is to inherit everything, and things
>> that don't get inherited are weird special cases (like mlock() state).
>> (While the default for execve() is to inherit nothing about the MM.)
> 
> Yeah this is true. It is the exception rather than the rule...
> 
>>
>> (I guess you could make a case that in a fork+exec sequence, the child
>> might not want to create hugepage between fork and exec... but this is
>> probably not the right place to control that?)
> 
>  From my point of view it simply reads better :)
> 
> But perhaps we can drop the fork bit and leave that implied.

Yes, we should. Exec is where the "fun" begins.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 15:28 ` David Hildenbrand
@ 2025-05-20 17:47   ` Lorenzo Stoakes
  2025-05-20 18:24     ` Usama Arif
  2025-05-20 18:25     ` Lorenzo Stoakes
  0 siblings, 2 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 17:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
> On 19.05.25 22:52, Lorenzo Stoakes wrote:
> > REVIEWERS NOTES:
> > ================
> >
> > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > 'putting it out there' for feedback. Any serious version of this will add a
> > bunch of self-tests to assert correct behaviour and I will more carefully
> > confirm everything's working.
> >
> > This is based on discussion arising from Usama's series [0], SJ's input on
> > the thread around process_madvise() behaviour [1] (and a subsequent
> > response by me [2]) and prior discussion about a new madvise() interface
> > [3].
> >
> > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> >
> > ================
> >
> > Currently, we are rather restricted in how madvise() operations
> > proceed. While effort has been put in to expanding what process_madvise()
> > can do (that is - unrestricted application of advice to the local process
> > alongside recent improvements on the efficiency of TLB operations over
> > these batvches), we are still constrained by existing madvise() limitations
> > and default behaviours.
> >
> > This series makes use of the currently unused flags field in
> > process_madvise() to provide more flexiblity.
> >
>
> In general, sounds like an interesting approach.

Thanks!

If you agree this is workable, then I'll go ahead and put some more effort
into writing tests etc. on the next respin.

>
> > It introduces four flags:
> >
> > 1. PMADV_SKIP_ERRORS
> >
> > Currently, when an error arises applying advice in any individual VMA
> > (keeping in mind that a range specified to madvise() or as part of the
> > iovec passed to process_madvise()), the operation stops where it is and
> > returns an error.
> >
> > This might not be the desired behaviour of the user, who may wish instead
> > for the operation to be 'best effort'. By setting this flag, that behaviour
> > is obtained.
> >
> > Since process_madvise() would trivially, if skipping errors, simply return
> > the input vector size, we instead return the number of entries in the
> > vector which completed successfully without error.
>
> I would focus only on adding flags that we absolutely need to make the use
> case we have in mind work. We can always add other flags as we see fit for
> them (IOW, when really required ;) ).
>
> Regarding MADV_HUGEPAGE / MADV_NOHUGEPAGE, this would not be required,
> right?

Sure, we can restrict this to only supported flags to be conservative.

The idea was though that somebody might want to simply do a 'best effort'
change.

However at the same time it's possibly a wee bit dangerous...

>
> >
> > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> >
> > 2. PMADV_NO_ERROR_ON_UNMAPPED
> >
> > Currently madvise() has the peculiar behaviour of, if the range specified
> > to it contains unmapped range(s), completing the full operation, but
> > ultimately returning -ENOMEM.
> >
> > In the case of process_madvise(), this is fatal, as the operation will stop
> > immediately upon this occurring.
> >
> > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > unmapped areas to simply be entirely ignored.
>
> Again, is this really required? Couldn't we glue that to
> PMADV_ENTIRE_ADDRESS_SPACE for our use case? After all, I don't expect
> anybody to have something mapped into *the entire address space*.

Well, I think it's an ongoing issue that unmapped entries cause the whole thing
to break, I do think it makes sense to make this _generally_ available, actually.

Obviously we should probably make PMADV_ENTIRE_MAPPING imply
PMADV_NO_ERROR_ON_UNMAPPED in the same way that PMADV_SKIP_ERRORS implies
PMADV_NO_ERROR_ON_UNMAPPED.

And yes I don't think any sane person would map the entirety of the 64-bit
address space :P

>
> Well, okay, maybe on 32bit, but still ... :)

32 what? :P I deny its existence... (ugh ok I guess I have to ack it, but
even in that case it's not very likely either :)

>
> >
> > 3. PMADV_SET_FORK_EXEC_DEFAULT
> >
> > It may be desirable for a user to specify that all VMAs mapped in a process
> > address space default to having an madvise() behaviour established by
> > default, in such a fashion as that this persists across fork/exec.
>
> This is very specific for MADV_HUGEPAGE only, so I wonder how we could
> either avoid that flag or just make it clear that it shall stick around ...
>
> Having that sad, PMADV_SET_FORK_EXEC_DEFAULT is rather a suboptimal name :(

Yeah it's horrid, see Pedro's suggestions, e.g. PMADV_SET_DEFAULT |
PMADV_INHERIT_EXEC.

I even wonder about PMADV_, but that's probably fine. PRMADV sucks, PADV
sort of loses the mm bit, PMADV is probably best we can do!

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 17:47   ` Lorenzo Stoakes
@ 2025-05-20 18:24     ` Usama Arif
  2025-05-20 19:21       ` Lorenzo Stoakes
  2025-05-20 18:25     ` Lorenzo Stoakes
  1 sibling, 1 reply; 57+ messages in thread
From: Usama Arif @ 2025-05-20 18:24 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Johannes Weiner, Shakeel Butt,
	Zi Yan

On 20/05/2025 18:47, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
>> On 19.05.25 22:52, Lorenzo Stoakes wrote:
>>> REVIEWERS NOTES:
>>> ================
>>>
>>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
>>> 'putting it out there' for feedback. Any serious version of this will add a
>>> bunch of self-tests to assert correct behaviour and I will more carefully
>>> confirm everything's working.
>>>
>>> This is based on discussion arising from Usama's series [0], SJ's input on
>>> the thread around process_madvise() behaviour [1] (and a subsequent
>>> response by me [2]) and prior discussion about a new madvise() interface
>>> [3].
>>>
>>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
>>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
>>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
>>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
>>>
>>> ================
>>>
>>> Currently, we are rather restricted in how madvise() operations
>>> proceed. While effort has been put in to expanding what process_madvise()
>>> can do (that is - unrestricted application of advice to the local process
>>> alongside recent improvements on the efficiency of TLB operations over
>>> these batvches), we are still constrained by existing madvise() limitations
>>> and default behaviours.
>>>
>>> This series makes use of the currently unused flags field in
>>> process_madvise() to provide more flexiblity.
>>>
>>
>> In general, sounds like an interesting approach.
> 
> Thanks!
> 
> If you agree this is workable, then I'll go ahead and put some more effort
> into writing tests etc. on the next respin.
> 

So the prctl and process_madvise patches both are trying to accomplish a
similar end goal.

Would it make sense to discuss what would be the best way forward before we
continue developing the solutions? If we are not at that stage and a clear
picture has not formed yet, happy to continue refining the solutions.
But just thought I would check. 

I feel like changing process_madvise which was designed to work on an array
of iovec structures to have flags to skip errors and ignore the iovec
makes it function similar to a prctl call is not the right approach.
IMHO, prctl is a more direct solution to this.

I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise.
But if in the end, we want to have a THP auto way which is truly transparent,
would it not be better to just have this as prctl and not change the madvise
structure? Maybe in a few years we wont need any of this, and it will be lost
in prctl and madvise wouldn't have changed for this?

Again, this is just to have a discussion (and not an aggressive argument :)),
and would love to get feedback from everyone in the community.
If its too early to have this discussion, its completely fine and we can
still keep developing the RFCs :)

Thanks!
Usama

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-19 20:52 [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Lorenzo Stoakes
                   ` (6 preceding siblings ...)
  2025-05-20 15:28 ` David Hildenbrand
@ 2025-05-20 18:25 ` Shakeel Butt
  2025-05-20 18:45   ` Lorenzo Stoakes
  2025-05-22 12:12 ` Mike Rapoport
  8 siblings, 1 reply; 57+ messages in thread
From: Shakeel Butt @ 2025-05-20 18:25 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> REVIEWERS NOTES:
> ================
> 
> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> 'putting it out there' for feedback. Any serious version of this will add a
> bunch of self-tests to assert correct behaviour and I will more carefully
> confirm everything's working.
> 
> This is based on discussion arising from Usama's series [0], SJ's input on
> the thread around process_madvise() behaviour [1] (and a subsequent
> response by me [2]) and prior discussion about a new madvise() interface
> [3].
> 
> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> 
> ================
> 
> Currently, we are rather restricted in how madvise() operations
> proceed. While effort has been put in to expanding what process_madvise()
> can do (that is - unrestricted application of advice to the local process
> alongside recent improvements on the efficiency of TLB operations over
> these batvches), we are still constrained by existing madvise() limitations
> and default behaviours.
> 
> This series makes use of the currently unused flags field in
> process_madvise() to provide more flexiblity.
> 
> It introduces four flags:
> 
> 1. PMADV_SKIP_ERRORS
> 
> Currently, when an error arises applying advice in any individual VMA
> (keeping in mind that a range specified to madvise() or as part of the
> iovec passed to process_madvise()), the operation stops where it is and
> returns an error.
> 
> This might not be the desired behaviour of the user, who may wish instead
> for the operation to be 'best effort'. By setting this flag, that behaviour
> is obtained.
> 
> Since process_madvise() would trivially, if skipping errors, simply return
> the input vector size, we instead return the number of entries in the
> vector which completed successfully without error.
> 
> The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> 
> 2. PMADV_NO_ERROR_ON_UNMAPPED
> 
> Currently madvise() has the peculiar behaviour of, if the range specified
> to it contains unmapped range(s), completing the full operation, but
> ultimately returning -ENOMEM.
> 
> In the case of process_madvise(), this is fatal, as the operation will stop
> immediately upon this occurring.
> 
> By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> unmapped areas to simply be entirely ignored.

Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
PMADV_SKIP_ERRORS is not enough? I don't see a need for
PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?

> 
> 3. PMADV_SET_FORK_EXEC_DEFAULT
> 
> It may be desirable for a user to specify that all VMAs mapped in a process
> address space default to having an madvise() behaviour established by
> default, in such a fashion as that this persists across fork/exec.
> 
> Since this is a very powerful option that would make no sense for many
> advice modes, we explicitly only permit known-safe flags here (currently
> MADV_HUGEPAGE and MADV_NOHUGEPAGE only).

Other flags seems general enough but this one is just weird. This is
exactly the scenario for prctl() like interface. You are trying to make
process_madvise() like prctl() and I can see process_madvise() would be
included in all the hate that prctl() receives. 

Let me ask in a different way. Eventually we want to be in a state where
hugepages works out of the box for all workloads. In that state what
would the need for this flag unless you have use-cases other than
hugepages. To me, there is a general consensus that prctl is a hacky
interface, so having some intermediate solution through prctl until
hugepages are good out of the box seems more reasonable.

> 
> 4. PMADV_ENTIRE_ADDRESS_SPACE
> 
> It can be annoying, should a user wish to apply madvise() to all VMAs in an
> address space, to have to add a singular large entry to the input iovec.
> 
> So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified,
> we expect the user to pass NULL and -1 to the vec and vlen parameters
> respectively so they explicitly acknowledge that these will be ignored,
> e.g.:
> 
> 	process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE,
> 			PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS);
> 

I still don't see a need for this flag. Why not the following?

process_madvise(PIDFD_SELF, NULL, -1, advise, PMADV_SKIP_ERRORS)?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 17:47   ` Lorenzo Stoakes
  2025-05-20 18:24     ` Usama Arif
@ 2025-05-20 18:25     ` Lorenzo Stoakes
  2025-05-20 18:39       ` David Hildenbrand
  1 sibling, 1 reply; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 18:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 06:47:48PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
[snip]
> > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > >
> > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > address space default to having an madvise() behaviour established by
> > > default, in such a fashion as that this persists across fork/exec.
> >
> > This is very specific for MADV_HUGEPAGE only, so I wonder how we could
> > either avoid that flag or just make it clear that it shall stick around ...
> >

Sorry missed this bit.

I don't really like the idea of only for MADV_HUGEPAGE (and MADV_NOHUGEPAGE)
defaulting to doing this, I think that's far less clear than a user explicitly
asking for it, plus most users using process_madvise() wouldn't expect it.

So restricting this to these flags only and rejecting all others should cover
this off fine I think.

[snip]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 18:25     ` Lorenzo Stoakes
@ 2025-05-20 18:39       ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2025-05-20 18:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

On 20.05.25 20:25, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 06:47:48PM +0100, Lorenzo Stoakes wrote:
>> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
> [snip]
>>>> 3. PMADV_SET_FORK_EXEC_DEFAULT
>>>>
>>>> It may be desirable for a user to specify that all VMAs mapped in a process
>>>> address space default to having an madvise() behaviour established by
>>>> default, in such a fashion as that this persists across fork/exec.
>>>
>>> This is very specific for MADV_HUGEPAGE only, so I wonder how we could
>>> either avoid that flag or just make it clear that it shall stick around ...
>>>
> 
> Sorry missed this bit.
> 
> I don't really like the idea of only for MADV_HUGEPAGE (and MADV_NOHUGEPAGE)
> defaulting to doing this, I think that's far less clear than a user explicitly
> asking for it, plus most users using process_madvise() wouldn't expect it.

The thing is that PMADV_ENTIRE_ADDRESS_SPACE already adds something 
unexpected: suddenly also *new* mappings will be affected.

But maybe that'd be covered by the PMADV_SET_DEFAULT idea.

Maybe

PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SET_DEFAULT | PMADV_SET_EXEC_DEFAULT

... hm, but "PMADV_ENTIRE_ADDRESS_SPACE" can then simply be "pass in the 
entire address space as a range" ... and ignore errors.

I'd probably have to see a revised proposal after the current discussions.


Anyhow, limiting the flags to a bare minimum for now usually makes 
sense, as long as it is expandable.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 18:25 ` Shakeel Butt
@ 2025-05-20 18:45   ` Lorenzo Stoakes
  2025-05-20 19:49     ` Shakeel Butt
  0 siblings, 1 reply; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 18:45 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > REVIEWERS NOTES:
> > ================
> >
> > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > 'putting it out there' for feedback. Any serious version of this will add a
> > bunch of self-tests to assert correct behaviour and I will more carefully
> > confirm everything's working.
> >
> > This is based on discussion arising from Usama's series [0], SJ's input on
> > the thread around process_madvise() behaviour [1] (and a subsequent
> > response by me [2]) and prior discussion about a new madvise() interface
> > [3].
> >
> > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> >
> > ================
> >
> > Currently, we are rather restricted in how madvise() operations
> > proceed. While effort has been put in to expanding what process_madvise()
> > can do (that is - unrestricted application of advice to the local process
> > alongside recent improvements on the efficiency of TLB operations over
> > these batvches), we are still constrained by existing madvise() limitations
> > and default behaviours.
> >
> > This series makes use of the currently unused flags field in
> > process_madvise() to provide more flexiblity.
> >
> > It introduces four flags:
> >
> > 1. PMADV_SKIP_ERRORS
> >
> > Currently, when an error arises applying advice in any individual VMA
> > (keeping in mind that a range specified to madvise() or as part of the
> > iovec passed to process_madvise()), the operation stops where it is and
> > returns an error.
> >
> > This might not be the desired behaviour of the user, who may wish instead
> > for the operation to be 'best effort'. By setting this flag, that behaviour
> > is obtained.
> >
> > Since process_madvise() would trivially, if skipping errors, simply return
> > the input vector size, we instead return the number of entries in the
> > vector which completed successfully without error.
> >
> > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> >
> > 2. PMADV_NO_ERROR_ON_UNMAPPED
> >
> > Currently madvise() has the peculiar behaviour of, if the range specified
> > to it contains unmapped range(s), completing the full operation, but
> > ultimately returning -ENOMEM.
> >
> > In the case of process_madvise(), this is fatal, as the operation will stop
> > immediately upon this occurring.
> >
> > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > unmapped areas to simply be entirely ignored.
>
> Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
> PMADV_SKIP_ERRORS is not enough? I don't see a need for
> PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
> PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?

I thought I already explained this above:

	"In the case of process_madvise(), this is fatal, as the operation
	 will stop immediately upon this occurring."

This is somewhat bizarre behaviour. You specify multiple vector entries
spanning different ranges, but one spans some unmapped space and now the
whole process_madvise() operation grinds to a halt, except the vector entry
containing ranges including unmapped space is completed.

This is strange behaviour, and it makes sense to me to optionally disable
this.

If you were looping around doing an madvise(), this would _not_ occur, you
could filter out the -ENOMEM's. It's a really weird peculiarity in
process_madvise().

Moreover, you might not want an error reported, that possibly duplicates
_real_ -ENOMEM errors, when you simply encounter unmapped addresses.

Finally, if you perform an operation across the entire address space as
proposed you may wish to stop on actual error but not on the (inevitable at
least in 64-bit space) gaps you'll encounter.

>
> >
> > 3. PMADV_SET_FORK_EXEC_DEFAULT
> >
> > It may be desirable for a user to specify that all VMAs mapped in a process
> > address space default to having an madvise() behaviour established by
> > default, in such a fashion as that this persists across fork/exec.
> >
> > Since this is a very powerful option that would make no sense for many
> > advice modes, we explicitly only permit known-safe flags here (currently
> > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
>
> Other flags seems general enough but this one is just weird. This is
> exactly the scenario for prctl() like interface. You are trying to make
> process_madvise() like prctl() and I can see process_madvise() would be
> included in all the hate that prctl() receives.

I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
sure what a 'prctl() like interface' means here, and you're not explaining
what you mean by that.

Presumably you mean you find this odd as you feel it sits outside the realm
of madvise() behaviour.

But I'd suggest it does not - the idea is to align _everything_ with
madvise(). Rather than adding an entirely arbitrary function in prctl(), we
are going the other way - keeping everything relating to madvise()-like
modification of memory _in mm_ and _in madvise()_, rather than bitrotting
away in kernel/sys.c.

So we get alignment in the fact that we're saying 'we establish a _default_
madvise flag for a process'.

We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
you guys at meta want while also opening the door to doing so in future if
it makes sense to.

This couldn't be more different than putting some arbitrary code relating
to mm in the 'netherrealm' of prctl().


>
> Let me ask in a different way. Eventually we want to be in a state where
> hugepages works out of the box for all workloads. In that state what
> would the need for this flag unless you have use-cases other than
> hugepages. To me, there is a general consensus that prctl is a hacky
> interface, so having some intermediate solution through prctl until
> hugepages are good out of the box seems more reasonable.

No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
have to be supported. In a future where things are automatic, these may be
no-ops in 'auto' mode.

But the requirement to have these flags will still exist, the requirement
to do madvise() operations will still exist, we're simply expanding this
functionality.

The problem arises the other way around when we shovel mm stuff in
kernel/sys.c.

>
> >
> > 4. PMADV_ENTIRE_ADDRESS_SPACE
> >
> > It can be annoying, should a user wish to apply madvise() to all VMAs in an
> > address space, to have to add a singular large entry to the input iovec.
> >
> > So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified,
> > we expect the user to pass NULL and -1 to the vec and vlen parameters
> > respectively so they explicitly acknowledge that these will be ignored,
> > e.g.:
> >
> > 	process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE,
> > 			PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS);
> >
>
> I still don't see a need for this flag. Why not the following?
>
> process_madvise(PIDFD_SELF, NULL, -1, advise, PMADV_SKIP_ERRORS)?

I don't think iovec=NULL, vlen=-1 means 'everything' in any other interface
anywhere else in the kernel? Why would we reasonably expect anybody to
know/assume that here?

It's surely far better to very explicitly say as much? The idea with NULL,
-1 is that you're making sure the user knows any provided iovec, vlen will
be ignored, and better to just disallow the user providing those at all
(similar to mmap requiring -1 for fd for anon mappings).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 18:24     ` Usama Arif
@ 2025-05-20 19:21       ` Lorenzo Stoakes
  2025-05-20 19:42         ` Usama Arif
  0 siblings, 1 reply; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 19:21 UTC (permalink / raw)
  To: Usama Arif
  Cc: David Hildenbrand, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park,
	Johannes Weiner, Shakeel Butt, Zi Yan

On Tue, May 20, 2025 at 07:24:04PM +0100, Usama Arif wrote:
>
>
> On 20/05/2025 18:47, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
> >> On 19.05.25 22:52, Lorenzo Stoakes wrote:
> >>> REVIEWERS NOTES:
> >>> ================
> >>>
> >>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> >>> 'putting it out there' for feedback. Any serious version of this will add a
> >>> bunch of self-tests to assert correct behaviour and I will more carefully
> >>> confirm everything's working.
> >>>
> >>> This is based on discussion arising from Usama's series [0], SJ's input on
> >>> the thread around process_madvise() behaviour [1] (and a subsequent
> >>> response by me [2]) and prior discussion about a new madvise() interface
> >>> [3].
> >>>
> >>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> >>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> >>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> >>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> >>>
> >>> ================
> >>>
> >>> Currently, we are rather restricted in how madvise() operations
> >>> proceed. While effort has been put in to expanding what process_madvise()
> >>> can do (that is - unrestricted application of advice to the local process
> >>> alongside recent improvements on the efficiency of TLB operations over
> >>> these batvches), we are still constrained by existing madvise() limitations
> >>> and default behaviours.
> >>>
> >>> This series makes use of the currently unused flags field in
> >>> process_madvise() to provide more flexiblity.
> >>>
> >>
> >> In general, sounds like an interesting approach.
> >
> > Thanks!
> >
> > If you agree this is workable, then I'll go ahead and put some more effort
> > into writing tests etc. on the next respin.
> >
>
> So the prctl and process_madvise patches both are trying to accomplish a
> similar end goal.
>
> Would it make sense to discuss what would be the best way forward before we
> continue developing the solutions? If we are not at that stage and a clear
> picture has not formed yet, happy to continue refining the solutions.
> But just thought I would check.

As stated in the thread (I went out of my way to link everything above),
and stated with you cc'd in every discussion (I think!), this idea arose as
a concept that came out of my brainstorming whether we could better align
this concept with madvise().

This arose because of discussions had in-thread where we agreed it made
most sense to have this feature in effect perform a 'default madvise()'.

At this point, we are essentially _duplicating what madvise does_ in the
prctl() with your approach, only now all of the code that does that is
bitrotting in kernel/sys.c.

This approach was an attempt to avoid that.

It started as a 'crazy idea' I was throwing out there, as an RFC. The idea
was we could compare and contrast with the prctl() RFC.

Obviously this is gaining some traction now as a concept and your respin
was a little rushed out so needs rework.

So, apologies if it seems this is an attempt to supercede your series or
such, it truly wasn't intended that way, rather I have been working 12 hour
days trying to find the best way possible to _make what you want happen_
while also doing what's best for mm and madvise (among many other tasks of
course :)

So the idea is we can:

a. solve long-standing problems with madvise() that prevent it being used
   for certain things (e.g. the error on gaps which breaks
   process_madvise() and hides real errors)

b. Also provide this functionality in a way that the absolute most minimal
   delta from existing logic...

c. ...While keeping all this logic in mm and avoiding bitrot in
   kernel/sys.c.

So obviously my view is that this approach is superior to the prctl() one.

However the intent was that you would probably take a little longer in
spinning up an RFC, and then we could compare the two approaches, if people
didn't reject my 'crazy idea' :)

Obviously you respan your (kinda ;) RFC way quicker than expected, and then
there were a ton of bugs, which added workload and made it perhaps a little
more pressing as to deciding which should be pursued.

I would suggest holding off on your series until we see whether this one
works on this basis. But obviously this is simply my point of view.

To be clear however, this was not a planned series of events...

I mean equally, we are seeing several other series from Yafang, Nico and
Dev in relation to [m]THP and even a obliquely-related series from Barry,
so it seems we are in contention across many planes here :)

>
> I feel like changing process_madvise which was designed to work on an array
> of iovec structures to have flags to skip errors and ignore the iovec
> makes it function similar to a prctl call is not the right approach.
> IMHO, prctl is a more direct solution to this.

I don't know what you mean by 'function similar to a prctl call', or what
you mean by it being a more direct solution.

The problem with prctl() is there is no pattern, it's a dumping ground for
arbitrary stuff and a prime place for bitrot. It also means we give up
maintainership of mm-specific code.

It encourages misalignment with other interfaces and APIs.

What is being discussed here is an effort to _better align what you want
with an existing interface_ - if we treat this as a 'default madvise()'
plus having additional madvise functionality (apply to all, ignoring errors
e.g.) - and put all this code _alongside existing madvise code_ - this
makes this vastly more maintainable, futureproof and robust.

And keep in mind the proposed flags are _flags_ not default behaviours,
ones which we will carefully restrict to known flags, starting with the
flags you guys need.

>
> I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise.
> But if in the end, we want to have a THP auto way which is truly transparent,
> would it not be better to just have this as prctl and not change the madvise
> structure? Maybe in a few years we wont need any of this, and it will be lost
> in prctl and madvise wouldn't have changed for this?

This really sounds a lot like your colleague Shakeel's objection, so I
don't want to duplicate my response too much, but as I said to him, at this
stage we would set THP mode to 'auto', but still have to support
MADV_[NO]HUGEPAGE.

We may then wish to set these as no-ops in auto mode right? But they'd
still have to exist for uAPI reasons.

So would it be better to do this in mm/madvise.c alongside literally all
other madvise() code and the existing handling for MADV_[NO]HUGEPAGE, or
would it be better to throw it in kernel/sys.c and hope people remember to
update it?

I think it's pretty clear what the answer to that is.

>
> Again, this is just to have a discussion (and not an aggressive argument :)),
> and would love to get feedback from everyone in the community.
> If its too early to have this discussion, its completely fine and we can
> still keep developing the RFCs :)

I try hard never to be aggressive, I am firm when I feel it is appropriate
to be so (it's important to push back when necessarily I feel), but I
always try to maintain civility as well as I can.

Of course I am imperfect, so apologies if it comes across any other way, I
really do try very hard to maintain a high standard of professionalism
on-list.

Again as I said, I really did not intend to supercede or interfere with
your series, this was just unfortunate timing and throwing out ideas to
reach the best solution.

My objection to prctl() is due to bit-rot, mm code in inappropriate places,
the fact it by nature lends itself to violating conventions and practices
that exist in other mm APIs, not some subjective sense of evil.

It is historically a place where 'things that people don't really care
about but don't quite want to NACK' go also, and to me suggests that the
problem hasn't necessarily been thought through to see how it might be
implemented by extending existing APIs or finding ways to achieve the same
thing that better align with existing intefaces.

To be clear - this concept is _all_ ultimately a product of your series and
your ideas leading the discussion within the community to this point where
a potential alternative has arisen - without which this would not have
occurred.

So the idea here is to simply explore what the best solution is that aligns
with what best serves the community.

>
> Thanks!
> Usama

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 19:21       ` Lorenzo Stoakes
@ 2025-05-20 19:42         ` Usama Arif
  2025-05-20 20:15           ` Lorenzo Stoakes
  0 siblings, 1 reply; 57+ messages in thread
From: Usama Arif @ 2025-05-20 19:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park,
	Johannes Weiner, Shakeel Butt, Zi Yan



On 20/05/2025 20:21, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 07:24:04PM +0100, Usama Arif wrote:
>>
>>
>> On 20/05/2025 18:47, Lorenzo Stoakes wrote:
>>> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
>>>> On 19.05.25 22:52, Lorenzo Stoakes wrote:
>>>>> REVIEWERS NOTES:
>>>>> ================
>>>>>
>>>>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
>>>>> 'putting it out there' for feedback. Any serious version of this will add a
>>>>> bunch of self-tests to assert correct behaviour and I will more carefully
>>>>> confirm everything's working.
>>>>>
>>>>> This is based on discussion arising from Usama's series [0], SJ's input on
>>>>> the thread around process_madvise() behaviour [1] (and a subsequent
>>>>> response by me [2]) and prior discussion about a new madvise() interface
>>>>> [3].
>>>>>
>>>>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
>>>>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
>>>>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
>>>>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
>>>>>
>>>>> ================
>>>>>
>>>>> Currently, we are rather restricted in how madvise() operations
>>>>> proceed. While effort has been put in to expanding what process_madvise()
>>>>> can do (that is - unrestricted application of advice to the local process
>>>>> alongside recent improvements on the efficiency of TLB operations over
>>>>> these batvches), we are still constrained by existing madvise() limitations
>>>>> and default behaviours.
>>>>>
>>>>> This series makes use of the currently unused flags field in
>>>>> process_madvise() to provide more flexiblity.
>>>>>
>>>>
>>>> In general, sounds like an interesting approach.
>>>
>>> Thanks!
>>>
>>> If you agree this is workable, then I'll go ahead and put some more effort
>>> into writing tests etc. on the next respin.
>>>
>>
>> So the prctl and process_madvise patches both are trying to accomplish a
>> similar end goal.
>>
>> Would it make sense to discuss what would be the best way forward before we
>> continue developing the solutions? If we are not at that stage and a clear
>> picture has not formed yet, happy to continue refining the solutions.
>> But just thought I would check.
> 
> As stated in the thread (I went out of my way to link everything above),
> and stated with you cc'd in every discussion (I think!), this idea arose as
> a concept that came out of my brainstorming whether we could better align
> this concept with madvise().
> 
> This arose because of discussions had in-thread where we agreed it made
> most sense to have this feature in effect perform a 'default madvise()'.
> 
> At this point, we are essentially _duplicating what madvise does_ in the
> prctl() with your approach, only now all of the code that does that is
> bitrotting in kernel/sys.c.
> 
> This approach was an attempt to avoid that.
> 
> It started as a 'crazy idea' I was throwing out there, as an RFC. The idea
> was we could compare and contrast with the prctl() RFC.
> 
> Obviously this is gaining some traction now as a concept and your respin
> was a little rushed out so needs rework.
> 
> So, apologies if it seems this is an attempt to supercede your series or
> such, it truly wasn't intended that way, rather I have been working 12 hour
> days trying to find the best way possible to _make what you want happen_
> while also doing what's best for mm and madvise (among many other tasks of
> course :)

Please don't burn out spending 12 hour days on this! Your feedback is very
amazing! (Thanks again!) It will take time to come up with a solution right
for the kernel!

> 
> So the idea is we can:
> 
> a. solve long-standing problems with madvise() that prevent it being used
>    for certain things (e.g. the error on gaps which breaks
>    process_madvise() and hides real errors)
> 
> b. Also provide this functionality in a way that the absolute most minimal
>    delta from existing logic...
> 
> c. ...While keeping all this logic in mm and avoiding bitrot in
>    kernel/sys.c.
> 
> So obviously my view is that this approach is superior to the prctl() one.
> 
> However the intent was that you would probably take a little longer in
> spinning up an RFC, and then we could compare the two approaches, if people
> didn't reject my 'crazy idea' :)
> 
> Obviously you respan your (kinda ;) RFC way quicker than expected, and then
> there were a ton of bugs, which added workload and made it perhaps a little
> more pressing as to deciding which should be pursued.
> 

I feel like a ton (i.e. a thousand) sounds like a lot, there are a couple of bugs
in a series I meant to send as an RFC (mistakes happen :)). I will wait a couple
of days to see how things develop and send a prctl RFC (will remember the tag this
time) and we can have a better comparison of what this would look like.

> I would suggest holding off on your series until we see whether this one
> works on this basis. But obviously this is simply my point of view.
> 
> To be clear however, this was not a planned series of events...
> 
> I mean equally, we are seeing several other series from Yafang, Nico and
> Dev in relation to [m]THP and even a obliquely-related series from Barry,
> so it seems we are in contention across many planes here :)
> 
>>
>> I feel like changing process_madvise which was designed to work on an array
>> of iovec structures to have flags to skip errors and ignore the iovec
>> makes it function similar to a prctl call is not the right approach.
>> IMHO, prctl is a more direct solution to this.
> 
> I don't know what you mean by 'function similar to a prctl call', or what
> you mean by it being a more direct solution.

So I thought I was going a bit crazy, but I am glad someone else raised this as well.

If we look at the man page for prctl it says it manipulates various aspects
of the behavior of the calling thread or process.

If we look at the man page for prcocess_madvise, it was for providing advice for the
address ranges described by iovec and n

What I mean (and I assume Shakeel meant as well) is this is changing what
process_madvise was designed for into changing  the behaviour of a process (i.e. prctl).
This is what I mean by 'function similar to a prctl call'.


> 
> The problem with prctl() is there is no pattern, it's a dumping ground for
> arbitrary stuff and a prime place for bitrot. It also means we give up
> maintainership of mm-specific code.
> 
> It encourages misalignment with other interfaces and APIs.
> 
> What is being discussed here is an effort to _better align what you want
> with an existing interface_ - if we treat this as a 'default madvise()'
> plus having additional madvise functionality (apply to all, ignoring errors
> e.g.) - and put all this code _alongside existing madvise code_ - this
> makes this vastly more maintainable, futureproof and robust.
> 
> And keep in mind the proposed flags are _flags_ not default behaviours,
> ones which we will carefully restrict to known flags, starting with the
> flags you guys need.
> 
>>
>> I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise.
>> But if in the end, we want to have a THP auto way which is truly transparent,
>> would it not be better to just have this as prctl and not change the madvise
>> structure? Maybe in a few years we wont need any of this, and it will be lost
>> in prctl and madvise wouldn't have changed for this?
> 
> This really sounds a lot like your colleague Shakeel's objection, so I
> don't want to duplicate my response too much, but as I said to him, at this
> stage we would set THP mode to 'auto', but still have to support
> MADV_[NO]HUGEPAGE.
> 
> We may then wish to set these as no-ops in auto mode right? But they'd
> still have to exist for uAPI reasons.
> 
> So would it be better to do this in mm/madvise.c alongside literally all
> other madvise() code and the existing handling for MADV_[NO]HUGEPAGE, or
> would it be better to throw it in kernel/sys.c and hope people remember to
> update it?
> 
> I think it's pretty clear what the answer to that is.
> 
>>
>> Again, this is just to have a discussion (and not an aggressive argument :)),
>> and would love to get feedback from everyone in the community.
>> If its too early to have this discussion, its completely fine and we can
>> still keep developing the RFCs :)
> 
> I try hard never to be aggressive, I am firm when I feel it is appropriate
> to be so (it's important to push back when necessarily I feel), but I
> always try to maintain civility as well as I can.
> 
> Of course I am imperfect, so apologies if it comes across any other way, I
> really do try very hard to maintain a high standard of professionalism
> on-list.
> 
> Again as I said, I really did not intend to supercede or interfere with
> your series, this was just unfortunate timing and throwing out ideas to
> reach the best solution.
> 
> My objection to prctl() is due to bit-rot, mm code in inappropriate places,
> the fact it by nature lends itself to violating conventions and practices
> that exist in other mm APIs, not some subjective sense of evil.
> 
> It is historically a place where 'things that people don't really care
> about but don't quite want to NACK' go also, and to me suggests that the
> problem hasn't necessarily been thought through to see how it might be
> implemented by extending existing APIs or finding ways to achieve the same
> thing that better align with existing intefaces.
> 
> To be clear - this concept is _all_ ultimately a product of your series and
> your ideas leading the discussion within the community to this point where
> a potential alternative has arisen - without which this would not have
> occurred.
> 
> So the idea here is to simply explore what the best solution is that aligns
> with what best serves the community.
> 
>>
>> Thanks!
>> Usama
> 
> Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 18:45   ` Lorenzo Stoakes
@ 2025-05-20 19:49     ` Shakeel Butt
  2025-05-20 20:39       ` Lorenzo Stoakes
  2025-05-21  2:16       ` Liam R. Howlett
  0 siblings, 2 replies; 57+ messages in thread
From: Shakeel Butt @ 2025-05-20 19:49 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > > REVIEWERS NOTES:
> > > ================
> > >
> > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > > 'putting it out there' for feedback. Any serious version of this will add a
> > > bunch of self-tests to assert correct behaviour and I will more carefully
> > > confirm everything's working.
> > >
> > > This is based on discussion arising from Usama's series [0], SJ's input on
> > > the thread around process_madvise() behaviour [1] (and a subsequent
> > > response by me [2]) and prior discussion about a new madvise() interface
> > > [3].
> > >
> > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > >
> > > ================
> > >
> > > Currently, we are rather restricted in how madvise() operations
> > > proceed. While effort has been put in to expanding what process_madvise()
> > > can do (that is - unrestricted application of advice to the local process
> > > alongside recent improvements on the efficiency of TLB operations over
> > > these batvches), we are still constrained by existing madvise() limitations
> > > and default behaviours.
> > >
> > > This series makes use of the currently unused flags field in
> > > process_madvise() to provide more flexiblity.
> > >
> > > It introduces four flags:
> > >
> > > 1. PMADV_SKIP_ERRORS
> > >
> > > Currently, when an error arises applying advice in any individual VMA
> > > (keeping in mind that a range specified to madvise() or as part of the
> > > iovec passed to process_madvise()), the operation stops where it is and
> > > returns an error.
> > >
> > > This might not be the desired behaviour of the user, who may wish instead
> > > for the operation to be 'best effort'. By setting this flag, that behaviour
> > > is obtained.
> > >
> > > Since process_madvise() would trivially, if skipping errors, simply return
> > > the input vector size, we instead return the number of entries in the
> > > vector which completed successfully without error.
> > >
> > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> > >
> > > 2. PMADV_NO_ERROR_ON_UNMAPPED
> > >
> > > Currently madvise() has the peculiar behaviour of, if the range specified
> > > to it contains unmapped range(s), completing the full operation, but
> > > ultimately returning -ENOMEM.
> > >
> > > In the case of process_madvise(), this is fatal, as the operation will stop
> > > immediately upon this occurring.
> > >
> > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > > unmapped areas to simply be entirely ignored.
> >
> > Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
> > PMADV_SKIP_ERRORS is not enough? I don't see a need for
> > PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
> > PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?
> 
> I thought I already explained this above:
> 
> 	"In the case of process_madvise(), this is fatal, as the operation
> 	 will stop immediately upon this occurring."
> 
> This is somewhat bizarre behaviour. You specify multiple vector entries
> spanning different ranges, but one spans some unmapped space and now the
> whole process_madvise() operation grinds to a halt, except the vector entry
> containing ranges including unmapped space is completed.
> 
> This is strange behaviour, and it makes sense to me to optionally disable
> this.
> 
> If you were looping around doing an madvise(), this would _not_ occur, you
> could filter out the -ENOMEM's. It's a really weird peculiarity in
> process_madvise().
> 
> Moreover, you might not want an error reported, that possibly duplicates
> _real_ -ENOMEM errors, when you simply encounter unmapped addresses.
> 
> Finally, if you perform an operation across the entire address space as
> proposed you may wish to stop on actual error but not on the (inevitable at
> least in 64-bit space) gaps you'll encounter.

So, we *may* wish to stop on actual error, do you have a more concrete
example? We should not add an API on a case which may be needed. We can
always add stuff later when the actual concrete use-cases come up.

> 
> >
> > >
> > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > >
> > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > address space default to having an madvise() behaviour established by
> > > default, in such a fashion as that this persists across fork/exec.
> > >
> > > Since this is a very powerful option that would make no sense for many
> > > advice modes, we explicitly only permit known-safe flags here (currently
> > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> >
> > Other flags seems general enough but this one is just weird. This is
> > exactly the scenario for prctl() like interface. You are trying to make
> > process_madvise() like prctl() and I can see process_madvise() would be
> > included in all the hate that prctl() receives.
> 
> I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
> sure what a 'prctl() like interface' means here, and you're not explaining
> what you mean by that.

I meant it applies a property at the task or process level and has
examples where those properties are inherited to children.

> 
> Presumably you mean you find this odd as you feel it sits outside the realm
> of madvise() behaviour.

The persistence across exec seems weird.

> 
> But I'd suggest it does not - the idea is to align _everything_ with
> madvise(). Rather than adding an entirely arbitrary function in prctl(), we
> are going the other way - keeping everything relating to madvise()-like
> modification of memory _in mm_ and _in madvise()_, rather than bitrotting
> away in kernel/sys.c.

The above para seems like you are talking about code which can be moved
to mm.

> 
> So we get alignment in the fact that we're saying 'we establish a _default_
> madvise flag for a process'.

I think this is an important point. So, we want to introduce a way to
set a process level property which can be inherited through fork/exec.
With that in mind, is process_madvise() (or even madvise()) really a
good interface for it? There is no need for address ranges for such
use-case.

> 
> We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
> you guys at meta want while also opening the door to doing so in future if
> it makes sense to.

Please drop the "you guys at meta". We should be aiming for what is good
for all (or most) linux users. Whatever is done here will be
incorporated in systemd which will be used very widely.

> 
> This couldn't be more different than putting some arbitrary code relating
> to mm in the 'netherrealm' of prctl().
> 
> 
> >
> > Let me ask in a different way. Eventually we want to be in a state where
> > hugepages works out of the box for all workloads. In that state what
> > would the need for this flag unless you have use-cases other than
> > hugepages. To me, there is a general consensus that prctl is a hacky
> > interface, so having some intermediate solution through prctl until
> > hugepages are good out of the box seems more reasonable.
> 
> No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
> have to be supported. In a future where things are automatic, these may be
> no-ops in 'auto' mode.
> 
> But the requirement to have these flags will still exist, the requirement
> to do madvise() operations will still exist, we're simply expanding this
> functionality.
> 
> The problem arises the other way around when we shovel mm stuff in
> kernel/sys.c.

I think you mixing the location of the code and the interface which will
remain relevant long term. I don't see process_madvise (or madvise) good
interface for this specific functionality (i.e. process level property
that gets inherited through fork/exec). Now we can add a new syscall for
this but to me, particularly for hugepage, this functionality is needed
temporarily until hugepages are good out of the box. However if there is
a use-case other than hugepages then yes, why not a new syscall.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 19:42         ` Usama Arif
@ 2025-05-20 20:15           ` Lorenzo Stoakes
  0 siblings, 0 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 20:15 UTC (permalink / raw)
  To: Usama Arif
  Cc: David Hildenbrand, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park,
	Johannes Weiner, Shakeel Butt, Zi Yan

On Tue, May 20, 2025 at 08:42:50PM +0100, Usama Arif wrote:
>
>
> On 20/05/2025 20:21, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 07:24:04PM +0100, Usama Arif wrote:
> >>
> >>
> >> On 20/05/2025 18:47, Lorenzo Stoakes wrote:
> >>> On Tue, May 20, 2025 at 05:28:35PM +0200, David Hildenbrand wrote:
> >>>> On 19.05.25 22:52, Lorenzo Stoakes wrote:
> >>>>> REVIEWERS NOTES:
> >>>>> ================
> >>>>>
> >>>>> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> >>>>> 'putting it out there' for feedback. Any serious version of this will add a
> >>>>> bunch of self-tests to assert correct behaviour and I will more carefully
> >>>>> confirm everything's working.
> >>>>>
> >>>>> This is based on discussion arising from Usama's series [0], SJ's input on
> >>>>> the thread around process_madvise() behaviour [1] (and a subsequent
> >>>>> response by me [2]) and prior discussion about a new madvise() interface
> >>>>> [3].
> >>>>>
> >>>>> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> >>>>> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> >>>>> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> >>>>> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> >>>>>
> >>>>> ================
> >>>>>
> >>>>> Currently, we are rather restricted in how madvise() operations
> >>>>> proceed. While effort has been put in to expanding what process_madvise()
> >>>>> can do (that is - unrestricted application of advice to the local process
> >>>>> alongside recent improvements on the efficiency of TLB operations over
> >>>>> these batvches), we are still constrained by existing madvise() limitations
> >>>>> and default behaviours.
> >>>>>
> >>>>> This series makes use of the currently unused flags field in
> >>>>> process_madvise() to provide more flexiblity.
> >>>>>
> >>>>
> >>>> In general, sounds like an interesting approach.
> >>>
> >>> Thanks!
> >>>
> >>> If you agree this is workable, then I'll go ahead and put some more effort
> >>> into writing tests etc. on the next respin.
> >>>
> >>
> >> So the prctl and process_madvise patches both are trying to accomplish a
> >> similar end goal.
> >>
> >> Would it make sense to discuss what would be the best way forward before we
> >> continue developing the solutions? If we are not at that stage and a clear
> >> picture has not formed yet, happy to continue refining the solutions.
> >> But just thought I would check.
> >
> > As stated in the thread (I went out of my way to link everything above),
> > and stated with you cc'd in every discussion (I think!), this idea arose as
> > a concept that came out of my brainstorming whether we could better align
> > this concept with madvise().
> >
> > This arose because of discussions had in-thread where we agreed it made
> > most sense to have this feature in effect perform a 'default madvise()'.
> >
> > At this point, we are essentially _duplicating what madvise does_ in the
> > prctl() with your approach, only now all of the code that does that is
> > bitrotting in kernel/sys.c.
> >
> > This approach was an attempt to avoid that.
> >
> > It started as a 'crazy idea' I was throwing out there, as an RFC. The idea
> > was we could compare and contrast with the prctl() RFC.
> >
> > Obviously this is gaining some traction now as a concept and your respin
> > was a little rushed out so needs rework.
> >
> > So, apologies if it seems this is an attempt to supercede your series or
> > such, it truly wasn't intended that way, rather I have been working 12 hour
> > days trying to find the best way possible to _make what you want happen_
> > while also doing what's best for mm and madvise (among many other tasks of
> > course :)
>
> Please don't burn out spending 12 hour days on this! Your feedback is very
> amazing! (Thanks again!) It will take time to come up with a solution right
> for the kernel!

Thanks I do try, the point is that everything is in good faith here... I'm just
trying to do what's right for the kernel.

>
> >
> > So the idea is we can:
> >
> > a. solve long-standing problems with madvise() that prevent it being used
> >    for certain things (e.g. the error on gaps which breaks
> >    process_madvise() and hides real errors)
> >
> > b. Also provide this functionality in a way that the absolute most minimal
> >    delta from existing logic...
> >
> > c. ...While keeping all this logic in mm and avoiding bitrot in
> >    kernel/sys.c.
> >
> > So obviously my view is that this approach is superior to the prctl() one.
> >
> > However the intent was that you would probably take a little longer in
> > spinning up an RFC, and then we could compare the two approaches, if people
> > didn't reject my 'crazy idea' :)
> >
> > Obviously you respan your (kinda ;) RFC way quicker than expected, and then
> > there were a ton of bugs, which added workload and made it perhaps a little
> > more pressing as to deciding which should be pursued.
> >
>
> I feel like a ton (i.e. a thousand) sounds like a lot, there are a couple of bugs
> in a series I meant to send as an RFC (mistakes happen :)). I will wait a couple
> of days to see how things develop and send a prctl RFC (will remember the tag this
> time) and we can have a better comparison of what this would look like.

Figure of speech ;) I didn't mean to disparage it, rather it felt rushed. Sorry
maybe should have more carefully phrased that. Text is a bad medium etc.

>
> > I would suggest holding off on your series until we see whether this one
> > works on this basis. But obviously this is simply my point of view.
> >
> > To be clear however, this was not a planned series of events...
> >
> > I mean equally, we are seeing several other series from Yafang, Nico and
> > Dev in relation to [m]THP and even a obliquely-related series from Barry,
> > so it seems we are in contention across many planes here :)
> >
> >>
> >> I feel like changing process_madvise which was designed to work on an array
> >> of iovec structures to have flags to skip errors and ignore the iovec
> >> makes it function similar to a prctl call is not the right approach.
> >> IMHO, prctl is a more direct solution to this.
> >
> > I don't know what you mean by 'function similar to a prctl call', or what
> > you mean by it being a more direct solution.
>
> So I thought I was going a bit crazy, but I am glad someone else raised this as well.
>
> If we look at the man page for prctl it says it manipulates various aspects
> of the behavior of the calling thread or process.
>
> If we look at the man page for prcocess_madvise, it was for providing advice for the
> address ranges described by iovec and n

And your proposal is to do precisely the same only also setting the default.

>
> What I mean (and I assume Shakeel meant as well) is this is changing what
> process_madvise was designed for into changing  the behaviour of a process (i.e. prctl).
> This is what I mean by 'function similar to a prctl call'.

Yeah I'm aware of the definition in the man page, it's really nebulous. I mean
even 'process' is unclear, do you mean task, do you mean thread leader? Do you
mean the shared virtual address space?

prctl() already does VMA-specific stuff. The overlap is enormous, and in
practice this is not a distinction that is in any real sense maintained.

prctl() is, de facto, a 'place where random stuff is put'.

And unavoidably you'll end up with mm-specific code (which now _must be
exported_) in kernel/sys.c.

At any rate, clearly there is still debate to be had here, so let's work on
both these series and see where things end up.

Obviously I'm open to feedback from ohers, but clearly I have a fairly
strong view on this :)

>
>
> >
> > The problem with prctl() is there is no pattern, it's a dumping ground for
> > arbitrary stuff and a prime place for bitrot. It also means we give up
> > maintainership of mm-specific code.
> >
> > It encourages misalignment with other interfaces and APIs.
> >
> > What is being discussed here is an effort to _better align what you want
> > with an existing interface_ - if we treat this as a 'default madvise()'
> > plus having additional madvise functionality (apply to all, ignoring errors
> > e.g.) - and put all this code _alongside existing madvise code_ - this
> > makes this vastly more maintainable, futureproof and robust.
> >
> > And keep in mind the proposed flags are _flags_ not default behaviours,
> > ones which we will carefully restrict to known flags, starting with the
> > flags you guys need.
> >
> >>
> >> I know that Lonenzo doesn't like prctl and wants to unify this in process_madvise.
> >> But if in the end, we want to have a THP auto way which is truly transparent,
> >> would it not be better to just have this as prctl and not change the madvise
> >> structure? Maybe in a few years we wont need any of this, and it will be lost
> >> in prctl and madvise wouldn't have changed for this?
> >
> > This really sounds a lot like your colleague Shakeel's objection, so I
> > don't want to duplicate my response too much, but as I said to him, at this
> > stage we would set THP mode to 'auto', but still have to support
> > MADV_[NO]HUGEPAGE.
> >
> > We may then wish to set these as no-ops in auto mode right? But they'd
> > still have to exist for uAPI reasons.
> >
> > So would it be better to do this in mm/madvise.c alongside literally all
> > other madvise() code and the existing handling for MADV_[NO]HUGEPAGE, or
> > would it be better to throw it in kernel/sys.c and hope people remember to
> > update it?
> >
> > I think it's pretty clear what the answer to that is.
> >
> >>
> >> Again, this is just to have a discussion (and not an aggressive argument :)),
> >> and would love to get feedback from everyone in the community.
> >> If its too early to have this discussion, its completely fine and we can
> >> still keep developing the RFCs :)
> >
> > I try hard never to be aggressive, I am firm when I feel it is appropriate
> > to be so (it's important to push back when necessarily I feel), but I
> > always try to maintain civility as well as I can.
> >
> > Of course I am imperfect, so apologies if it comes across any other way, I
> > really do try very hard to maintain a high standard of professionalism
> > on-list.
> >
> > Again as I said, I really did not intend to supercede or interfere with
> > your series, this was just unfortunate timing and throwing out ideas to
> > reach the best solution.
> >
> > My objection to prctl() is due to bit-rot, mm code in inappropriate places,
> > the fact it by nature lends itself to violating conventions and practices
> > that exist in other mm APIs, not some subjective sense of evil.
> >
> > It is historically a place where 'things that people don't really care
> > about but don't quite want to NACK' go also, and to me suggests that the
> > problem hasn't necessarily been thought through to see how it might be
> > implemented by extending existing APIs or finding ways to achieve the same
> > thing that better align with existing intefaces.
> >
> > To be clear - this concept is _all_ ultimately a product of your series and
> > your ideas leading the discussion within the community to this point where
> > a potential alternative has arisen - without which this would not have
> > occurred.
> >
> > So the idea here is to simply explore what the best solution is that aligns
> > with what best serves the community.
> >
> >>
> >> Thanks!
> >> Usama
> >
> > Thanks, Lorenzo
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 19:49     ` Shakeel Butt
@ 2025-05-20 20:39       ` Lorenzo Stoakes
  2025-05-20 22:02         ` Shakeel Butt
  2025-05-21  2:16       ` Liam R. Howlett
  1 sibling, 1 reply; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-20 20:39 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 12:49:24PM -0700, Shakeel Butt wrote:
> On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> > > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > > > REVIEWERS NOTES:
> > > > ================
> > > >
> > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > > > 'putting it out there' for feedback. Any serious version of this will add a
> > > > bunch of self-tests to assert correct behaviour and I will more carefully
> > > > confirm everything's working.
> > > >
> > > > This is based on discussion arising from Usama's series [0], SJ's input on
> > > > the thread around process_madvise() behaviour [1] (and a subsequent
> > > > response by me [2]) and prior discussion about a new madvise() interface
> > > > [3].
> > > >
> > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > > >
> > > > ================
> > > >
> > > > Currently, we are rather restricted in how madvise() operations
> > > > proceed. While effort has been put in to expanding what process_madvise()
> > > > can do (that is - unrestricted application of advice to the local process
> > > > alongside recent improvements on the efficiency of TLB operations over
> > > > these batvches), we are still constrained by existing madvise() limitations
> > > > and default behaviours.
> > > >
> > > > This series makes use of the currently unused flags field in
> > > > process_madvise() to provide more flexiblity.
> > > >
> > > > It introduces four flags:
> > > >
> > > > 1. PMADV_SKIP_ERRORS
> > > >
> > > > Currently, when an error arises applying advice in any individual VMA
> > > > (keeping in mind that a range specified to madvise() or as part of the
> > > > iovec passed to process_madvise()), the operation stops where it is and
> > > > returns an error.
> > > >
> > > > This might not be the desired behaviour of the user, who may wish instead
> > > > for the operation to be 'best effort'. By setting this flag, that behaviour
> > > > is obtained.
> > > >
> > > > Since process_madvise() would trivially, if skipping errors, simply return
> > > > the input vector size, we instead return the number of entries in the
> > > > vector which completed successfully without error.
> > > >
> > > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> > > >
> > > > 2. PMADV_NO_ERROR_ON_UNMAPPED
> > > >
> > > > Currently madvise() has the peculiar behaviour of, if the range specified
> > > > to it contains unmapped range(s), completing the full operation, but
> > > > ultimately returning -ENOMEM.
> > > >
> > > > In the case of process_madvise(), this is fatal, as the operation will stop
> > > > immediately upon this occurring.
> > > >
> > > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > > > unmapped areas to simply be entirely ignored.
> > >
> > > Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
> > > PMADV_SKIP_ERRORS is not enough? I don't see a need for
> > > PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
> > > PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?
> >
> > I thought I already explained this above:
> >
> > 	"In the case of process_madvise(), this is fatal, as the operation
> > 	 will stop immediately upon this occurring."
> >
> > This is somewhat bizarre behaviour. You specify multiple vector entries
> > spanning different ranges, but one spans some unmapped space and now the
> > whole process_madvise() operation grinds to a halt, except the vector entry
> > containing ranges including unmapped space is completed.
> >
> > This is strange behaviour, and it makes sense to me to optionally disable
> > this.
> >
> > If you were looping around doing an madvise(), this would _not_ occur, you
> > could filter out the -ENOMEM's. It's a really weird peculiarity in
> > process_madvise().
> >
> > Moreover, you might not want an error reported, that possibly duplicates
> > _real_ -ENOMEM errors, when you simply encounter unmapped addresses.
> >
> > Finally, if you perform an operation across the entire address space as
> > proposed you may wish to stop on actual error but not on the (inevitable at
> > least in 64-bit space) gaps you'll encounter.
>
> So, we *may* wish to stop on actual error, do you have a more concrete
> example? We should not add an API on a case which may be needed. We can
> always add stuff later when the actual concrete use-cases come up.

I feel like I just gave a concrete example?

It's useful to not have to be absolutely sure that the range specified
includes no unmapped ranges.

It's required for the MADV_[NO]HUGEPAGE use case proposed, specifically
applying the operation to the entire address space.

But I think it might make this a lot more concrete when I write tests - as
these will 'concretise' the interface and provide examples.

We can also choose to 'hide' this from users and add it back in as you say.

Perhaps worth just waiting for a respin with tests to see this in action.

>
> >
> > >
> > > >
> > > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > > >
> > > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > > address space default to having an madvise() behaviour established by
> > > > default, in such a fashion as that this persists across fork/exec.
> > > >
> > > > Since this is a very powerful option that would make no sense for many
> > > > advice modes, we explicitly only permit known-safe flags here (currently
> > > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> > >
> > > Other flags seems general enough but this one is just weird. This is
> > > exactly the scenario for prctl() like interface. You are trying to make
> > > process_madvise() like prctl() and I can see process_madvise() would be
> > > included in all the hate that prctl() receives.
> >
> > I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
> > sure what a 'prctl() like interface' means here, and you're not explaining
> > what you mean by that.
>
> I meant it applies a property at the task or process level and has
> examples where those properties are inherited to children.

But de-facto several prctl() interfaces do not do this, and mm interfaces
like mlockall() for instance do do this?

_In practice_ prctl() is a random hodge-podge of stuff, subject to bitrot.

>
> >
> > Presumably you mean you find this odd as you feel it sits outside the realm
> > of madvise() behaviour.
>
> The persistence across exec seems weird.

OK I'm not quite sure how to quantify 'weird'?

As I argue below, the idea here is we're doing 'madvise by default'. So you
can either have prctl() invoke madvise() for some stuff, and then establish
some 'madvise by default' logic, or we can do it the other way, by doing
_as much as possible_ madvise() stuff in madvise, and add the
default-across-exec there as a highly controlled, very clear flag.

I continue to believe the latter is cleaner, more maintainable, and less
subject to bitrot.

And I would argue invoking madvise() from prctl() is similarly odd (though
pretty much everything that happens in prctl() is, by de-facto definition,
sort of odd :)

>
> >
> > But I'd suggest it does not - the idea is to align _everything_ with
> > madvise(). Rather than adding an entirely arbitrary function in prctl(), we
> > are going the other way - keeping everything relating to madvise()-like
> > modification of memory _in mm_ and _in madvise()_, rather than bitrotting
> > away in kernel/sys.c.
>
> The above para seems like you are talking about code which can be moved
> to mm.
>
> >
> > So we get alignment in the fact that we're saying 'we establish a _default_
> > madvise flag for a process'.
>
> I think this is an important point. So, we want to introduce a way to
> set a process level property which can be inherited through fork/exec.
> With that in mind, is process_madvise() (or even madvise()) really a
> good interface for it? There is no need for address ranges for such
> use-case.
>
> >
> > We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
> > you guys at meta want while also opening the door to doing so in future if
> > it makes sense to.
>
> Please drop the "you guys at meta". We should be aiming for what is good
> for all (or most) linux users. Whatever is done here will be
> incorporated in systemd which will be used very widely.

...!

I've spent several hours doing review of Usama's series and proposing this
idea precisely to serve the community. I would ask you to please respect
that.

The point I'm making here re: you guys, is that we can both serve the
community and solve your problem - because aiming at both is _the only way
this change will get merged_.

I am doing my absolute best to try to reach this end.

Re: systemd, I'm not sure what you mean - has there been an indication that
they will using this? I'm not sure they make use of every prctl() interface
do they?

>
> >
> > This couldn't be more different than putting some arbitrary code relating
> > to mm in the 'netherrealm' of prctl().
> >
> >
> > >
> > > Let me ask in a different way. Eventually we want to be in a state where
> > > hugepages works out of the box for all workloads. In that state what
> > > would the need for this flag unless you have use-cases other than
> > > hugepages. To me, there is a general consensus that prctl is a hacky
> > > interface, so having some intermediate solution through prctl until
> > > hugepages are good out of the box seems more reasonable.
> >
> > No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
> > have to be supported. In a future where things are automatic, these may be
> > no-ops in 'auto' mode.
> >
> > But the requirement to have these flags will still exist, the requirement
> > to do madvise() operations will still exist, we're simply expanding this
> > functionality.
> >
> > The problem arises the other way around when we shovel mm stuff in
> > kernel/sys.c.
>
> I think you mixing the location of the code and the interface which will
> remain relevant long term. I don't see process_madvise (or madvise) good
> interface for this specific functionality (i.e. process level property
> that gets inherited through fork/exec). Now we can add a new syscall for
> this but to me, particularly for hugepage, this functionality is needed
> temporarily until hugepages are good out of the box. However if there is
> a use-case other than hugepages then yes, why not a new syscall.
>

I disagree on several levels. Firstly of course 'process' is a vague term
here, you mean address space, rather mm_struct.

And really you're saying 'it's ok to associate mm_strut and madvise
specific stuff with prctl() but not ok to associate mm_struct stuff with
madvise code' which not quite compelling to me.

Overall you can view this approach as - 'we are making an madvise() flag a
default, optionally'.

And this justifies us therefore doing this via an madvise() API call.

It also further allows us to expand the capabilities of madvise() for free
- we can address long-standing issues around what is possible with these
system calls while also providing this interface. The idea is it's win/win.

Otherwise we're simply doing a bunch of madvise() stuff from prctl() in the
same way that PR_SET_VMA_ANON_NAME for instance (hey - that's a VMA-level
feature! What's that doing in prctl() :) - where we have a bunch of mm code
living over there, and we have to export mm-internal stuff to kernel/sys.c
and it's a mess.

As to the temporary nature of this stuff, you seem to have disregarded what
I've said here rather in relation to the persistence of the
MADV_[NO]HUGEPAGE flags which are required as uAPI. They therefore must
remain, and thus I don't find the argument compelling re: prctl() being
better suited. It seems more likely things will bitrot there?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 20:39       ` Lorenzo Stoakes
@ 2025-05-20 22:02         ` Shakeel Butt
  2025-05-21  4:21           ` Lorenzo Stoakes
  0 siblings, 1 reply; 57+ messages in thread
From: Shakeel Butt @ 2025-05-20 22:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 09:39:37PM +0100, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 12:49:24PM -0700, Shakeel Butt wrote:
> > On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote:
> > > On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> > > > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > > > > REVIEWERS NOTES:
> > > > > ================
> > > > >
> > > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > > > > 'putting it out there' for feedback. Any serious version of this will add a
> > > > > bunch of self-tests to assert correct behaviour and I will more carefully
> > > > > confirm everything's working.
> > > > >
> > > > > This is based on discussion arising from Usama's series [0], SJ's input on
> > > > > the thread around process_madvise() behaviour [1] (and a subsequent
> > > > > response by me [2]) and prior discussion about a new madvise() interface
> > > > > [3].
> > > > >
> > > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > > > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > > > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > > > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > > > >
> > > > > ================
> > > > >
> > > > > Currently, we are rather restricted in how madvise() operations
> > > > > proceed. While effort has been put in to expanding what process_madvise()
> > > > > can do (that is - unrestricted application of advice to the local process
> > > > > alongside recent improvements on the efficiency of TLB operations over
> > > > > these batvches), we are still constrained by existing madvise() limitations
> > > > > and default behaviours.
> > > > >
> > > > > This series makes use of the currently unused flags field in
> > > > > process_madvise() to provide more flexiblity.
> > > > >
> > > > > It introduces four flags:
> > > > >
> > > > > 1. PMADV_SKIP_ERRORS
> > > > >
> > > > > Currently, when an error arises applying advice in any individual VMA
> > > > > (keeping in mind that a range specified to madvise() or as part of the
> > > > > iovec passed to process_madvise()), the operation stops where it is and
> > > > > returns an error.
> > > > >
> > > > > This might not be the desired behaviour of the user, who may wish instead
> > > > > for the operation to be 'best effort'. By setting this flag, that behaviour
> > > > > is obtained.
> > > > >
> > > > > Since process_madvise() would trivially, if skipping errors, simply return
> > > > > the input vector size, we instead return the number of entries in the
> > > > > vector which completed successfully without error.
> > > > >
> > > > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> > > > >
> > > > > 2. PMADV_NO_ERROR_ON_UNMAPPED
> > > > >
> > > > > Currently madvise() has the peculiar behaviour of, if the range specified
> > > > > to it contains unmapped range(s), completing the full operation, but
> > > > > ultimately returning -ENOMEM.
> > > > >
> > > > > In the case of process_madvise(), this is fatal, as the operation will stop
> > > > > immediately upon this occurring.
> > > > >
> > > > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> > > > > unmapped areas to simply be entirely ignored.
> > > >
> > > > Why do we need PMADV_NO_ERROR_ON_UNMAPPED explicitly and why
> > > > PMADV_SKIP_ERRORS is not enough? I don't see a need for
> > > > PMADV_NO_ERROR_ON_UNMAPPED. Do you envision a use-case where
> > > > PMADV_NO_ERROR_ON_UNMAPPED makes more sense than PMADV_SKIP_ERRORS?
> > >
> > > I thought I already explained this above:
> > >
> > > 	"In the case of process_madvise(), this is fatal, as the operation
> > > 	 will stop immediately upon this occurring."
> > >
> > > This is somewhat bizarre behaviour. You specify multiple vector entries
> > > spanning different ranges, but one spans some unmapped space and now the
> > > whole process_madvise() operation grinds to a halt, except the vector entry
> > > containing ranges including unmapped space is completed.
> > >
> > > This is strange behaviour, and it makes sense to me to optionally disable
> > > this.
> > >
> > > If you were looping around doing an madvise(), this would _not_ occur, you
> > > could filter out the -ENOMEM's. It's a really weird peculiarity in
> > > process_madvise().
> > >
> > > Moreover, you might not want an error reported, that possibly duplicates
> > > _real_ -ENOMEM errors, when you simply encounter unmapped addresses.
> > >
> > > Finally, if you perform an operation across the entire address space as
> > > proposed you may wish to stop on actual error but not on the (inevitable at
> > > least in 64-bit space) gaps you'll encounter.
> >
> > So, we *may* wish to stop on actual error, do you have a more concrete
> > example? We should not add an API on a case which may be needed. We can
> > always add stuff later when the actual concrete use-cases come up.
> 
> I feel like I just gave a concrete example?
> 
> It's useful to not have to be absolutely sure that the range specified
> includes no unmapped ranges.
> 
> It's required for the MADV_[NO]HUGEPAGE use case proposed, specifically
> applying the operation to the entire address space.
> 
> But I think it might make this a lot more concrete when I write tests - as
> these will 'concretise' the interface and provide examples.
> 
> We can also choose to 'hide' this from users and add it back in as you say.
> 
> Perhaps worth just waiting for a respin with tests to see this in action.

Yes, that makes sense.

> 
> >
> > >
> > > >
> > > > >
> > > > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > > > >
> > > > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > > > address space default to having an madvise() behaviour established by
> > > > > default, in such a fashion as that this persists across fork/exec.
> > > > >
> > > > > Since this is a very powerful option that would make no sense for many
> > > > > advice modes, we explicitly only permit known-safe flags here (currently
> > > > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> > > >
> > > > Other flags seems general enough but this one is just weird. This is
> > > > exactly the scenario for prctl() like interface. You are trying to make
> > > > process_madvise() like prctl() and I can see process_madvise() would be
> > > > included in all the hate that prctl() receives.
> > >
> > > I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
> > > sure what a 'prctl() like interface' means here, and you're not explaining
> > > what you mean by that.
> >
> > I meant it applies a property at the task or process level and has
> > examples where those properties are inherited to children.
> 
> But de-facto several prctl() interfaces do not do this, and mm interfaces
> like mlockall() for instance do do this?
> 
> _In practice_ prctl() is a random hodge-podge of stuff, subject to bitrot.
> 

I don't disagree.

> >
> > >
> > > Presumably you mean you find this odd as you feel it sits outside the realm
> > > of madvise() behaviour.
> >
> > The persistence across exec seems weird.
> 
> OK I'm not quite sure how to quantify 'weird'?
> 
> As I argue below, the idea here is we're doing 'madvise by default'. So you
> can either have prctl() invoke madvise() for some stuff, and then establish
> some 'madvise by default' logic, or we can do it the other way, by doing
> _as much as possible_ madvise() stuff in madvise, and add the
> default-across-exec there as a highly controlled, very clear flag.
> 
> I continue to believe the latter is cleaner, more maintainable, and less
> subject to bitrot.
> 
> And I would argue invoking madvise() from prctl() is similarly odd (though
> pretty much everything that happens in prctl() is, by de-facto definition,
> sort of odd :)
> 

Yes, prctl() is weird as well but see my point at the end.

> >
> > >
> > > But I'd suggest it does not - the idea is to align _everything_ with
> > > madvise(). Rather than adding an entirely arbitrary function in prctl(), we
> > > are going the other way - keeping everything relating to madvise()-like
> > > modification of memory _in mm_ and _in madvise()_, rather than bitrotting
> > > away in kernel/sys.c.
> >
> > The above para seems like you are talking about code which can be moved
> > to mm.
> >
> > >
> > > So we get alignment in the fact that we're saying 'we establish a _default_
> > > madvise flag for a process'.
> >
> > I think this is an important point. So, we want to introduce a way to
> > set a process level property which can be inherited through fork/exec.
> > With that in mind, is process_madvise() (or even madvise()) really a
> > good interface for it? There is no need for address ranges for such
> > use-case.
> >
> > >
> > > We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
> > > you guys at meta want while also opening the door to doing so in future if
> > > it makes sense to.
> >
> > Please drop the "you guys at meta". We should be aiming for what is good
> > for all (or most) linux users. Whatever is done here will be
> > incorporated in systemd which will be used very widely.
> 
> ...!
> 
> I've spent several hours doing review of Usama's series and proposing this
> idea precisely to serve the community. I would ask you to please respect
> that.
> 
> The point I'm making here re: you guys, is that we can both serve the
> community and solve your problem - because aiming at both is _the only way
> this change will get merged_.
> 
> I am doing my absolute best to try to reach this end.

Thanks for doing that. My point was having a robust long term solution
is preferrable over any specific company's preferred solution. So, it is
ok to push back if you feel that we are pushing a solution which will
benefit only a single or small set of users.

> 
> Re: systemd, I'm not sure what you mean - has there been an indication that
> they will using this? I'm not sure they make use of every prctl() interface
> do they?

We plan to add this in systemd which already has similar support for
ksm.

> 
> >
> > >
> > > This couldn't be more different than putting some arbitrary code relating
> > > to mm in the 'netherrealm' of prctl().
> > >
> > >
> > > >
> > > > Let me ask in a different way. Eventually we want to be in a state where
> > > > hugepages works out of the box for all workloads. In that state what
> > > > would the need for this flag unless you have use-cases other than
> > > > hugepages. To me, there is a general consensus that prctl is a hacky
> > > > interface, so having some intermediate solution through prctl until
> > > > hugepages are good out of the box seems more reasonable.
> > >
> > > No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
> > > have to be supported. In a future where things are automatic, these may be
> > > no-ops in 'auto' mode.
> > >
> > > But the requirement to have these flags will still exist, the requirement
> > > to do madvise() operations will still exist, we're simply expanding this
> > > functionality.
> > >
> > > The problem arises the other way around when we shovel mm stuff in
> > > kernel/sys.c.
> >
> > I think you mixing the location of the code and the interface which will
> > remain relevant long term. I don't see process_madvise (or madvise) good
> > interface for this specific functionality (i.e. process level property
> > that gets inherited through fork/exec). Now we can add a new syscall for
> > this but to me, particularly for hugepage, this functionality is needed
> > temporarily until hugepages are good out of the box. However if there is
> > a use-case other than hugepages then yes, why not a new syscall.
> >
> 
> I disagree on several levels. Firstly of course 'process' is a vague term
> here, you mean address space, rather mm_struct.
> 
> And really you're saying 'it's ok to associate mm_strut and madvise
> specific stuff with prctl() but not ok to associate mm_struct stuff with
> madvise code' which not quite compelling to me.
> 
> Overall you can view this approach as - 'we are making an madvise() flag a
> default, optionally'.
> 
> And this justifies us therefore doing this via an madvise() API call.
> 
> It also further allows us to expand the capabilities of madvise() for free
> - we can address long-standing issues around what is possible with these
> system calls while also providing this interface. The idea is it's win/win.
> 
> Otherwise we're simply doing a bunch of madvise() stuff from prctl() in the
> same way that PR_SET_VMA_ANON_NAME for instance (hey - that's a VMA-level
> feature! What's that doing in prctl() :) - where we have a bunch of mm code
> living over there, and we have to export mm-internal stuff to kernel/sys.c
> and it's a mess.
> 
> As to the temporary nature of this stuff, you seem to have disregarded what
> I've said here rather in relation to the persistence of the
> MADV_[NO]HUGEPAGE flags which are required as uAPI. They therefore must
> remain, and thus I don't find the argument compelling re: prctl() being
> better suited. It seems more likely things will bitrot there?

I think we are talking past each other (and I blame myself for that). Let
me try again. (Please keep aside prctl or process_madvise). We need a
way to change the policy of a process (job) and at the moment we are
aligned that the job loader (like systemd) will set that policy and load
the target (fork/exec), so the policy persist across fork/exec. (If
someone has a better way to set the policy without race, please let us
know).

My argument is that process_madvise() is not a good interface to set
that policy because of its address range like interface. So, if not
process_madvise() then what? Should we add a new syscall? (BTW we had
very similar discussion on process_madvise(DONTNEED) on a remote process
vs a new syscall i.e. process_mrelease()).

Adding a new syscall requires that it should be generally useful and
hopefully have more use-cases. Now going back to the current specific
use-case where we want to override the hugepage related policy of a job,
do we expect to use this override forever? I believe this is temporary
because the only reason we need this is because hugepages are not yet
ready for prime time (many applications do not handle them well). In
future when hugepages will be awesome, do we still need this "override
the hugepage policy" syscall?

Now if we can show that this specific functionality is useful more than
hugepages then I think new syscall seems like the best way forward.
However if we think this functionality is only needed temporarily then
shoving it in prctl() seems reasonable to me. If we really don't want
prctl() based solution, I would recommend to discuss the new syscall
approach and see if we can comeup with a more general solution.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
  2025-05-19 20:52 ` [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT " Lorenzo Stoakes
  2025-05-20  8:38   ` Pedro Falcato
@ 2025-05-20 22:26   ` Johannes Weiner
  2025-05-29 14:46     ` Lorenzo Stoakes
  1 sibling, 1 reply; 57+ messages in thread
From: Johannes Weiner @ 2025-05-20 22:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Mon, May 19, 2025 at 09:52:41PM +0100, Lorenzo Stoakes wrote:
> We intentionally limit this only to flags that we know should function
> correctly without issue, and to be conservative about this, so we initially
> limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
> the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.

Hm, but do we actually expect more to show up? Looking at the current
list of advisories, we have the conventional ones:

       MADV_NORMAL
              No special treatment.  This is the default.

       MADV_RANDOM
              Expect page references in random order.  (Hence, read ahead may be less useful than normally.)

       MADV_SEQUENTIAL
              Expect page references in sequential order.  (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)

       MADV_WILLNEED
              Expect access in the near future.  (Hence, it might be a good idea to read some pages ahead.)

       MADV_DONTNEED
              Do not expect access in the near future.  (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.)

where only RANDOM and SEQUENTIAL are actual policies. But since those
apply to file mappings only, they don't seem to make much sense for a
process-wide setting.

For Linux-specific advisories we have

	MADV_REMOVE		- action
	MADV_DONTFORK		- policy, but not sure how this could work as a process-wide thing
	MADV_HWPOISON		- action
	MADV_MERGEABLE		- policy, but we have a prctl() for process-wide settings
	MADV_SOFTOFFLINE	- action
	MADV_HUGEPAGE		- policy, but we have a prctl() for process-wide disabling
	MADV_COLLAPSE		- action
	MADV_DONTDUMP		- policy, but there is /proc/<pid>/coredump_filter for process-wide settings
	MADV_FREE		- action
	MADV_WIPEONFORK		- policy, but similar to DONTFORK, not sure how this would work process-wide
	MADV_COLD		- action
	MADV_PAGEOUT		- action
	MADV_POPULATE_READ	- action
	MADV_POPULATE_WRITE	- action
	MADV_GUARD_INSTALL	- action

So the vast majority of advisories are either one-off actions, or they
are policies that naturally only make sense for very specific ranges.

KSM and THP really seem like the notable exception[1], rather than a
rule. And we already *have* prctl() ops to modify per-process policies
for these. So I'm reluctant to agree we should drill open and expand
madvise() to cover them - especially with the goal or the expectation
that THP per-process policies shouldn't matter that much down the line.

I will admit, I don't hate prctl() as much as you do. It *is* a fairly
broad interface - setting per-process policies. So it's bound to pick
odds and ends of various subsystems that don't quite fit elsewhere.

Is it the right choice everywhere? Of course not. And can its
broadness be abused to avoid real interface design? Absolutely.

I don't think that makes it inherently bad, though. As long as we make
an honest effort to find the best home for new knobs.

But IMO, in this case it is. The inheritance-along-the-process-tree
behavior that we want here is an established pattern in prctl().

And *because* that propagation is a security-sensitive pattern
(although I don't think the THP policy specifically *is* a security
issue), to me it makes more sense to keep this behavior confined to
prctl() at least, and not add it to madvise too.

[1] Maybe we should have added sys_andrea() to cover those, as well as
    the SECCOMP prctl()s ;)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 19:49     ` Shakeel Butt
  2025-05-20 20:39       ` Lorenzo Stoakes
@ 2025-05-21  2:16       ` Liam R. Howlett
  1 sibling, 0 replies; 57+ messages in thread
From: Liam R. Howlett @ 2025-05-21  2:16 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Lorenzo Stoakes, Andrew Morton, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

* Shakeel Butt <shakeel.butt@linux.dev> [250520 15:49]:
> On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> > > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > > > REVIEWERS NOTES:
> > > > ================
> > > >
> > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > > > 'putting it out there' for feedback. Any serious version of this will add a
> > > > bunch of self-tests to assert correct behaviour and I will more carefully
> > > > confirm everything's working.
> > > >
> > > > This is based on discussion arising from Usama's series [0], SJ's input on
> > > > the thread around process_madvise() behaviour [1] (and a subsequent
> > > > response by me [2]) and prior discussion about a new madvise() interface
> > > > [3].
> > > >
> > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > > >
> > > > ================

...

> > > >
> > > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > > >
> > > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > > address space default to having an madvise() behaviour established by
> > > > default, in such a fashion as that this persists across fork/exec.
> > > >
> > > > Since this is a very powerful option that would make no sense for many
> > > > advice modes, we explicitly only permit known-safe flags here (currently
> > > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> > >
> > > Other flags seems general enough but this one is just weird. This is
> > > exactly the scenario for prctl() like interface. You are trying to make
> > > process_madvise() like prctl() and I can see process_madvise() would be
> > > included in all the hate that prctl() receives.
> > 
> > I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
> > sure what a 'prctl() like interface' means here, and you're not explaining
> > what you mean by that.
> 
> I meant it applies a property at the task or process level and has
> examples where those properties are inherited to children.

Okay, I don't think we should have any illusions of how clear cut either
of these things are. prctl has PR_SET_VMA, which isn't to to do with
the process.

madvise() can be called with a range covering the entire address space
and may or may not fail on gaps (depending on the call).

process_madvise() was to madvise on a process range - including all of
it, so I don't see either as the wrong place.  Or, really, the best
place.

> 
> > 
> > Presumably you mean you find this odd as you feel it sits outside the realm
> > of madvise() behaviour.
> 
> The persistence across exec seems weird.
> 
> > 
> > But I'd suggest it does not - the idea is to align _everything_ with
> > madvise(). Rather than adding an entirely arbitrary function in prctl(), we
> > are going the other way - keeping everything relating to madvise()-like
> > modification of memory _in mm_ and _in madvise()_, rather than bitrotting
> > away in kernel/sys.c.
> 
> The above para seems like you are talking about code which can be moved
> to mm.
> 
> > 
> > So we get alignment in the fact that we're saying 'we establish a _default_
> > madvise flag for a process'.
> 
> I think this is an important point. So, we want to introduce a way to
> set a process level property which can be inherited through fork/exec.
> With that in mind, is process_madvise() (or even madvise()) really a
> good interface for it? There is no need for address ranges for such
> use-case.
> 
> > 
> > We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
> > you guys at meta want while also opening the door to doing so in future if
> > it makes sense to.
> 
> Please drop the "you guys at meta". We should be aiming for what is good
> for all (or most) linux users. Whatever is done here will be
> incorporated in systemd which will be used very widely.

Agreed, the aim is to provide the most flexible interface for the
majority of the users.  In my view, there isn't a clear way forward yet:
I don't want to make process_madvise() a dumping ground and I don't want
to make prctl() a bigger mess.

But right now, it seems there is more than one user arguing for this
particular implementation, but in fact there two employees at the same
company and that isn't clear in this conversation to the casual reader.
And the opinions are being cross-referenced as evidence of how others
view it.  I have no idea if you two know each other, or even if you know
that you are both at meta - but I think if I chimed in from some other
email address saying how crazy prctl() is over this other way, then
you'd probably feel compelled to say something about where I work (I
hope you would tbh)?

You could add a (meta) to your email address or signature so that it is
implicit that there may be other influences on opinion.  I'm not saying
there is an influence on opinion or that this opinion is wrong, but it
is not fully transparent and complicates the conversation.

I don't think it is unreasonable for people to point out that everyone
arguing for one solution is at the same company when it isn't obvious,
do you?

> 
> > 
> > This couldn't be more different than putting some arbitrary code relating
> > to mm in the 'netherrealm' of prctl().
> > 
> > 
> > >
> > > Let me ask in a different way. Eventually we want to be in a state where
> > > hugepages works out of the box for all workloads. In that state what
> > > would the need for this flag unless you have use-cases other than
> > > hugepages. To me, there is a general consensus that prctl is a hacky
> > > interface, so having some intermediate solution through prctl until
> > > hugepages are good out of the box seems more reasonable.

(oh my, I'm going to sound like a jerk here.. sorry)

This is a terrible argument.  This intermediate solution may outlive us
all.. it may even shorten some of our lives due to stress dealing with
it.  Will the removal of the prctl() call coincide with the year of the
Linux desktop?  Or maybe the removal of the mmap lock... wait.

The hacky interface of prctl() is one of my main gripes with that
proposal, it's not a reason to go with it.

> > 
> > No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
> > have to be supported. In a future where things are automatic, these may be
> > no-ops in 'auto' mode.
> > 
> > But the requirement to have these flags will still exist, the requirement
> > to do madvise() operations will still exist, we're simply expanding this
> > functionality.
> > 
> > The problem arises the other way around when we shovel mm stuff in
> > kernel/sys.c.
> 
> I think you mixing the location of the code and the interface which will
> remain relevant long term. I don't see process_madvise (or madvise) good
> interface for this specific functionality (i.e. process level property
> that gets inherited through fork/exec).

Process is literally in the name and we are applying an madvise flag
across the entire range.

The inherited through fork/exec is a much stronger reason that I see for
not doing it in process_madvise().

> Now we can add a new syscall for
> this but to me, particularly for hugepage, this functionality is needed
> temporarily until hugepages are good out of the box.

I'm not holding my breath.

> However if there is
> a use-case other than hugepages then yes, why not a new syscall.
> 

We also have Barry looking at prctl() for avoiding LRU updates on exit
[1], which I guess is process related.. or page related.. really it is
mm flags related in his implementation, sort of like this feature.  But
it will probably only be used by android, so...

[1]. https://lore.kernel.org/all/20250514070820.51793-1-21cnbao@gmail.com/

Thanks,
Liam


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-20 22:02         ` Shakeel Butt
@ 2025-05-21  4:21           ` Lorenzo Stoakes
  2025-05-21 16:28             ` Shakeel Butt
  2025-05-21 17:32             ` Johannes Weiner
  0 siblings, 2 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-21  4:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:

[snip for clarity]

> I think we are talking past each other (and I blame myself for that). Let
> me try again. (Please keep aside prctl or process_madvise). We need a
> way to change the policy of a process (job) and at the moment we are
> aligned that the job loader (like systemd) will set that policy and load
> the target (fork/exec), so the policy persist across fork/exec. (If
> someone has a better way to set the policy without race, please let us
> know).

Ack, totally agree the kernel currently lacks a cohesive story for this
'adjust POLICY of a process and descendents', we have cgroups, but they're
more general than a process, we have namespaces, but that's for restricting
resources...

I think we all want the same thing here, ultimately.

>
> My argument is that process_madvise() is not a good interface to set
> that policy because of its address range like interface. So, if not
> process_madvise() then what? Should we add a new syscall? (BTW we had
> very similar discussion on process_madvise(DONTNEED) on a remote process
> vs a new syscall i.e. process_mrelease()).

Sure, and generally both proposed interfaces are at least _awkward_, for me
prctl() is a no-go unless we have no other choice, I won't go over my
objections to it yet again (and Liam has also raised his of course).

>
> Adding a new syscall requires that it should be generally useful and
> hopefully have more use-cases. Now going back to the current specific
> use-case where we want to override the hugepage related policy of a job,
> do we expect to use this override forever? I believe this is temporary
> because the only reason we need this is because hugepages are not yet
> ready for prime time (many applications do not handle them well). In
> future when hugepages will be awesome, do we still need this "override
> the hugepage policy" syscall?

As argued previously, I am not so sure it'll be temporary, given the
proposed future 'auto' mode will be a _mode_ and we will need to support
VM_[NO]HUGEPAGE scenarios forever (deep, deep sigh).

Also if you add it into systemd it definitely won't be right? There's no
'throwaway' here, and scouring through prctl() (what is actually documented
:), I am not sure anything ever is, frankly.

So the idea is to try to make this as generic as possible and to have it
sit with code it makes sense to sit with.

>
> Now if we can show that this specific functionality is useful more than
> hugepages then I think new syscall seems like the best way forward.
> However if we think this functionality is only needed temporarily then
> shoving it in prctl() seems reasonable to me. If we really don't want
> prctl() based solution, I would recommend to discuss the new syscall
> approach and see if we can comeup with a more general solution.
>

So, something Liam mentioned off-list was the beautifully named
'mmadvise()'. Idea being that we have a system call _explicitly for_
mm-wide modifications.

With Barry's series doing a prctl() for something similar, and a whole host
of mm->flags existing for modifying behaviour, it would seem a natural fit.

I could do a respin that does something like this instead.

What's a pity to me re: going away from process_madvise() is losing the
opportunity to be able to modify the, frankly broken, gaps handling and
also being able to do 'best effort' madvise ranges.

But I suppose those can always be separate series... :)

I guess let me work that up so we can see how that looks?

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21  4:21           ` Lorenzo Stoakes
@ 2025-05-21 16:28             ` Shakeel Butt
  2025-05-21 16:49               ` Lorenzo Stoakes
  2025-05-21 16:57               ` Usama Arif
  2025-05-21 17:32             ` Johannes Weiner
  1 sibling, 2 replies; 57+ messages in thread
From: Shakeel Butt @ 2025-05-21 16:28 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:
> 
[...]
> 
> So, something Liam mentioned off-list was the beautifully named
> 'mmadvise()'. Idea being that we have a system call _explicitly for_
> mm-wide modifications.
> 
> With Barry's series doing a prctl() for something similar, and a whole host
> of mm->flags existing for modifying behaviour, it would seem a natural fit.
> 
> I could do a respin that does something like this instead.
> 

Please let's first get consensus on this before starting the work.

Usama, David, Johannes and others, WDYT?


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21 16:28             ` Shakeel Butt
@ 2025-05-21 16:49               ` Lorenzo Stoakes
  2025-05-21 17:39                 ` Shakeel Butt
  2025-05-21 16:57               ` Usama Arif
  1 sibling, 1 reply; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-21 16:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Wed, May 21, 2025 at 09:28:43AM -0700, Shakeel Butt wrote:
> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:
> >
> [...]
> >
> > So, something Liam mentioned off-list was the beautifully named
> > 'mmadvise()'. Idea being that we have a system call _explicitly for_
> > mm-wide modifications.
> >
> > With Barry's series doing a prctl() for something similar, and a whole host
> > of mm->flags existing for modifying behaviour, it would seem a natural fit.
> >
> > I could do a respin that does something like this instead.
> >
>
> Please let's first get consensus on this before starting the work.

With respect Shakeel, I'll work on whatever I want, whenever I want.

These are RFC's, that is, Requests for Comment, intended to explore new
ideas and to show how they might look in practice so we can make the right
choice.

I feel like I've said this more than once now...!

>
> Usama, David, Johannes and others, WDYT?
>

But obviously it would be good to get some input from others, more so from
the actual maintainer (David) and reviewers though obviously everybody's
opinion matters.

Overall my view is prctl() is really something we should avoid here unless
we have absolutely no other choice. I've gone into detail as to why already
so won't belabour the point.

Thanks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21 16:28             ` Shakeel Butt
  2025-05-21 16:49               ` Lorenzo Stoakes
@ 2025-05-21 16:57               ` Usama Arif
  2025-05-21 17:39                 ` Lorenzo Stoakes
  1 sibling, 1 reply; 57+ messages in thread
From: Usama Arif @ 2025-05-21 16:57 UTC (permalink / raw)
  To: Shakeel Butt, Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park



On 21/05/2025 17:28, Shakeel Butt wrote:
> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
>> On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:
>>
> [...]
>>
>> So, something Liam mentioned off-list was the beautifully named
>> 'mmadvise()'. Idea being that we have a system call _explicitly for_
>> mm-wide modifications.
>>
>> With Barry's series doing a prctl() for something similar, and a whole host
>> of mm->flags existing for modifying behaviour, it would seem a natural fit.
>>
>> I could do a respin that does something like this instead.
>>
> 
> Please let's first get consensus on this before starting the work.
> 
> Usama, David, Johannes and others, WDYT?
> 

I would like that. Introducing another method might make the conversation a
lot more complex than it already is?

I have addressed the feedback from Lorenzo for the prctl series, but am
holding back sending it as RFC v4.
The v3 has a NACK on it so I would imagine it would discourage people
from reviewing it. If we are still progressing with sending patches, would it
make sense for me to wait a couple of days to see if there are any more comments
on it and send the RFC v4?




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21  4:21           ` Lorenzo Stoakes
  2025-05-21 16:28             ` Shakeel Butt
@ 2025-05-21 17:32             ` Johannes Weiner
  2025-05-21 18:11               ` Lorenzo Stoakes
                                 ` (2 more replies)
  1 sibling, 3 replies; 57+ messages in thread
From: Johannes Weiner @ 2025-05-21 17:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Shakeel Butt, Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> So, something Liam mentioned off-list was the beautifully named
> 'mmadvise()'. Idea being that we have a system call _explicitly for_
> mm-wide modifications.
> 
> With Barry's series doing a prctl() for something similar, and a whole host
> of mm->flags existing for modifying behaviour, it would seem a natural fit.

That's an interesting idea.

So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it
might also be a good fit for the coredump stuff and ksm if we wanted
to incorporate them into that (although it would duplicate the
existing proc/prctl knobs). The other MMF_s are internal AFAICS.

I think my main concern would be making something very generic and
versatile without having sufficiently broad/popular usecases for it.

But no strong feelings either way. Like I said, I don't have a strong
dislike for prctl(), but this idea would obviously be cleaner if we
think there is enough of a demand for a new syscall.

> I guess let me work that up so we can see how that looks?

I think it's worth exploring!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21 16:49               ` Lorenzo Stoakes
@ 2025-05-21 17:39                 ` Shakeel Butt
  2025-05-22 13:05                   ` David Hildenbrand
  0 siblings, 1 reply; 57+ messages in thread
From: Shakeel Butt @ 2025-05-21 17:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote:
[...]
> >
> > Please let's first get consensus on this before starting the work.
> 
> With respect Shakeel, I'll work on whatever I want, whenever I want.

I fail to understand why you would respond like that.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21 16:57               ` Usama Arif
@ 2025-05-21 17:39                 ` Lorenzo Stoakes
  2025-05-21 18:25                   ` Usama Arif
  0 siblings, 1 reply; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-21 17:39 UTC (permalink / raw)
  To: Usama Arif
  Cc: Shakeel Butt, Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park

On Wed, May 21, 2025 at 05:57:48PM +0100, Usama Arif wrote:
>
>
> On 21/05/2025 17:28, Shakeel Butt wrote:
> > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> >> On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:
> >>
> > [...]
> >>
> >> So, something Liam mentioned off-list was the beautifully named
> >> 'mmadvise()'. Idea being that we have a system call _explicitly for_
> >> mm-wide modifications.
> >>
> >> With Barry's series doing a prctl() for something similar, and a whole host
> >> of mm->flags existing for modifying behaviour, it would seem a natural fit.
> >>
> >> I could do a respin that does something like this instead.
> >>
> >
> > Please let's first get consensus on this before starting the work.
> >
> > Usama, David, Johannes and others, WDYT?
> >
>
> I would like that. Introducing another method might make the conversation a
> lot more complex than it already is?

The argument is about prctl() vs. another method.

Another method was proposed, arguments were made that it didn't seem
suitable, so an alternative was proposed.

I'm really not sure what complexity this adds?

And what better means of comparing approaches than actual code? Isn't an
entirely theoretical discussion less helpful?

This feels a little like dismissing my input (and I note, Liam's points
remain unanswered), and I have to admit, that is a little upsetting.

But I suppose one has to have a thick skin in the kernel.

>
> I have addressed the feedback from Lorenzo for the prctl series, but am
> holding back sending it as RFC v4.
> The v3 has a NACK on it so I would imagine it would discourage people
> from reviewing it. If we are still progressing with sending patches, would it
> make sense for me to wait a couple of days to see if there are any more comments
> on it and send the RFC v4?

I've no problem with you respinning RFC"s, as I've said more than once. The
NACK has been explained to you, it's regrettable, but necessary I feel when
explicit agreed-upon review has not been actioned (obviously I realise it
was a mistake - but this doesn't make it less important to be clear like
this).

As to stopping and having a conversation about which way forward makes most
sense - I feel like that's what I've been trying to do the whole time? I
would like to think my input is of value, it is a pity that it seems that
it apparently is not.

I am obviously happy to hear other people's input. This is what I've been
working very hard to try to establish, partly by providing _actual code_ so
we can see how things compare.

It seems generally people don't have much of a strong opinion about the
interface, other than Liam (forgive me if I am misinterpreting you Liam and
please correct if so) and myself very strongly disliking prctl() for
numerous aforementioned reasons.

I would suggest that in that case, and since we maintain madvise()
interfaces, it would make sense for us therefore to pursue an alternative
approach.

But for the absolute best means of determining a way forward I'd suggest an
RFC is best.

Thanks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21 17:32             ` Johannes Weiner
@ 2025-05-21 18:11               ` Lorenzo Stoakes
  2025-05-22 12:45               ` David Hildenbrand
  2025-05-22 15:32               ` Mike Rapoport
  2 siblings, 0 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-21 18:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shakeel Butt, Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Wed, May 21, 2025 at 01:32:00PM -0400, Johannes Weiner wrote:
> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> > So, something Liam mentioned off-list was the beautifully named
> > 'mmadvise()'. Idea being that we have a system call _explicitly for_
> > mm-wide modifications.
> >
> > With Barry's series doing a prctl() for something similar, and a whole host
> > of mm->flags existing for modifying behaviour, it would seem a natural fit.
>
> That's an interesting idea.

Thanks!

>
> So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it
> might also be a good fit for the coredump stuff and ksm if we wanted
> to incorporate them into that (although it would duplicate the
> existing proc/prctl knobs). The other MMF_s are internal AFAICS.
>
> I think my main concern would be making something very generic and
> versatile without having sufficiently broad/popular usecases for it.

Ack, the main argument here is to keep mm stuff together without bitrot.

The process_madvise() proposal advocated for the addition of flags that
would be useful in other circumstances such as eliminating the broken
behaviour around gaps etc which fulfilled this requirement naturally.

However it's a fair point that the fork/exec mm-scope stuff is awkwardly
placed (but equally so in prctl()).

It's an absolutely fair point as to specificity - but I would argue that
it's a _general_ thing to want to have mm-level state changes, and while
this might be specific _now_, in future the next specific thing can go
here, and the next etc.

Things that would each have been their own sort of special case in prctl()
can now live somewhere maintained by mm people, using core mm code and
avoiding bitrot.

I realise I (ugh mea culpa) missed a fairly eventful THP meeting, I think
David suggested 'mcontrol()' as a name for this? :)

In any event absolutely love to hear input from there from anybody on that
also!

>
> But no strong feelings either way. Like I said, I don't have a strong
> dislike for prctl(), but this idea would obviously be cleaner if we
> think there is enough of a demand for a new syscall.

I won't belabour the point by repeating the arguments in this area :)
generally I worry about seeing mm code proliferate in non-mm places.

>
> > I guess let me work that up so we can see how that looks?
>
> I think it's worth exploring!

Thanks!

If time permits and there isn't too much push back I"ll try spinning up an
RFC.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21 17:39                 ` Lorenzo Stoakes
@ 2025-05-21 18:25                   ` Usama Arif
  2025-05-21 18:40                     ` Lorenzo Stoakes
  0 siblings, 1 reply; 57+ messages in thread
From: Usama Arif @ 2025-05-21 18:25 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Shakeel Butt, Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park



On 21/05/2025 18:39, Lorenzo Stoakes wrote:
> On Wed, May 21, 2025 at 05:57:48PM +0100, Usama Arif wrote:
>>
>>
>> On 21/05/2025 17:28, Shakeel Butt wrote:
>>> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
>>>> On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:
>>>>
>>> [...]
>>>>
>>>> So, something Liam mentioned off-list was the beautifully named
>>>> 'mmadvise()'. Idea being that we have a system call _explicitly for_
>>>> mm-wide modifications.
>>>>
>>>> With Barry's series doing a prctl() for something similar, and a whole host
>>>> of mm->flags existing for modifying behaviour, it would seem a natural fit.
>>>>
>>>> I could do a respin that does something like this instead.
>>>>
>>>
>>> Please let's first get consensus on this before starting the work.
>>>
>>> Usama, David, Johannes and others, WDYT?
>>>
>>
>> I would like that. Introducing another method might make the conversation a
>> lot more complex than it already is?
> 
> The argument is about prctl() vs. another method.
> 

So there are a few things that have been on the mailing list prctl, process_madvise, 
bpf, cgroup based (although that was shot down). I meant adding another one to all
of them. But please see below.

> Another method was proposed, arguments were made that it didn't seem
> suitable, so an alternative was proposed.
> 
> I'm really not sure what complexity this adds?
> 
> And what better means of comparing approaches than actual code? Isn't an
> entirely theoretical discussion less helpful?

Sounds good!

> This feels a little like dismissing my input (and I note, Liam's points
> remain unanswered), and I have to admit, that is a little upsetting.
> 

The current prctl version is a lot lot better only because of your input.
The previous version had flags2, MMF flags, was breaking vm merge.
All of it was fixed only because of your valuable input.

So I have never dismissed your input, and try my best to reply quickly
to your and everyone elses reviews. We disagree on the best way forward
(whether prctl is appropriate) and I think disagreeing is definitely part
of good engineering. Please always assume I am acting in good faith.

As to not replying to Liams review, if its
https://lore.kernel.org/all/ass5hz6l26jc6xhbtybmgdjf55hmb3v4vvhrxhqampv6ohl67u@qi6iacwzwfk5/
I had already done it several hours ago.
If there was a email bashing prctl on how it is a bitrot and where everything
goes to die, I really dont know how to reply to that anymore.
Otherwise I might have missed it. I really hope one thing is
clear from our interactions over the past week, that I always try
and address feedback.

And for the upsetting part, from all the WTF reactions on my v2,
and all the rest of discussions we have had, I feel you have always
been upset with whatever I have sent (patches/replies), which I really
really don't understand.
We will have disagreements, its part of the dev process. The current version
wasnt going to get any acks, and was not going to get merged in any form,
so I really dont get it. Again, please always assume I am acting in good faith.

> But I suppose one has to have a thick skin in the kernel.
> 
>>
>> I have addressed the feedback from Lorenzo for the prctl series, but am
>> holding back sending it as RFC v4.
>> The v3 has a NACK on it so I would imagine it would discourage people
>> from reviewing it. If we are still progressing with sending patches, would it
>> make sense for me to wait a couple of days to see if there are any more comments
>> on it and send the RFC v4?
> 
> I've no problem with you respinning RFC"s, as I've said more than once. The
> NACK has been explained to you, it's regrettable, but necessary I feel when
> explicit agreed-upon review has not been actioned (obviously I realise it
> was a mistake - but this doesn't make it less important to be clear like
> this).
> 
> As to stopping and having a conversation about which way forward makes most
> sense - I feel like that's what I've been trying to do the whole time? I
> would like to think my input is of value, it is a pity that it seems that
> it apparently is not.
> 
> I am obviously happy to hear other people's input. This is what I've been
> working very hard to try to establish, partly by providing _actual code_ so
> we can see how things compare.
> 
> It seems generally people don't have much of a strong opinion about the
> interface, other than Liam (forgive me if I am misinterpreting you Liam and
> please correct if so) and myself very strongly disliking prctl() for
> numerous aforementioned reasons.
> 
> I would suggest that in that case, and since we maintain madvise()
> interfaces, it would make sense for us therefore to pursue an alternative
> approach.
> 
> But for the absolute best means of determining a way forward I'd suggest an
> RFC is best.

Sounds good to me.
> 
> Thanks.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21 18:25                   ` Usama Arif
@ 2025-05-21 18:40                     ` Lorenzo Stoakes
  2025-05-21 18:45                       ` Usama Arif
  0 siblings, 1 reply; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-21 18:40 UTC (permalink / raw)
  To: Usama Arif
  Cc: Shakeel Butt, Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park

I feel this will get circular. So let's not get lost in the weeds here.

Let's see what others think, and if not too much push-back I'll put out
another RFC for the mcontrol() concept and we can compare to your RFC and
use that to reach consensus if that works for you?

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21 18:40                     ` Lorenzo Stoakes
@ 2025-05-21 18:45                       ` Usama Arif
  0 siblings, 0 replies; 57+ messages in thread
From: Usama Arif @ 2025-05-21 18:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Shakeel Butt, Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park



On 21/05/2025 19:40, Lorenzo Stoakes wrote:
> I feel this will get circular. So let's not get lost in the weeds here.
> 
> Let's see what others think, and if not too much push-back I'll put out
> another RFC for the mcontrol() concept and we can compare to your RFC and
> use that to reach consensus if that works for you?
> 
> Thanks, Lorenzo

Sure sounds good, Thanks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-19 20:52 [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Lorenzo Stoakes
                   ` (7 preceding siblings ...)
  2025-05-20 18:25 ` Shakeel Butt
@ 2025-05-22 12:12 ` Mike Rapoport
  8 siblings, 0 replies; 57+ messages in thread
From: Mike Rapoport @ 2025-05-22 12:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif,
	linux-api

(cc'ing linux-api)

On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> REVIEWERS NOTES:
> ================
> 
> This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> 'putting it out there' for feedback. Any serious version of this will add a
> bunch of self-tests to assert correct behaviour and I will more carefully
> confirm everything's working.
> 
> This is based on discussion arising from Usama's series [0], SJ's input on
> the thread around process_madvise() behaviour [1] (and a subsequent
> response by me [2]) and prior discussion about a new madvise() interface
> [3].
> 
> [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> 
> ================
> 
> Currently, we are rather restricted in how madvise() operations
> proceed. While effort has been put in to expanding what process_madvise()
> can do (that is - unrestricted application of advice to the local process
> alongside recent improvements on the efficiency of TLB operations over
> these batvches), we are still constrained by existing madvise() limitations
> and default behaviours.
> 
> This series makes use of the currently unused flags field in
> process_madvise() to provide more flexiblity.
> 
> It introduces four flags:
> 
> 1. PMADV_SKIP_ERRORS
> 
> Currently, when an error arises applying advice in any individual VMA
> (keeping in mind that a range specified to madvise() or as part of the
> iovec passed to process_madvise()), the operation stops where it is and
> returns an error.
> 
> This might not be the desired behaviour of the user, who may wish instead
> for the operation to be 'best effort'. By setting this flag, that behaviour
> is obtained.
> 
> Since process_madvise() would trivially, if skipping errors, simply return
> the input vector size, we instead return the number of entries in the
> vector which completed successfully without error.
> 
> The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED.
> 
> 2. PMADV_NO_ERROR_ON_UNMAPPED
> 
> Currently madvise() has the peculiar behaviour of, if the range specified
> to it contains unmapped range(s), completing the full operation, but
> ultimately returning -ENOMEM.
> 
> In the case of process_madvise(), this is fatal, as the operation will stop
> immediately upon this occurring.
> 
> By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes
> unmapped areas to simply be entirely ignored.
> 
> 3. PMADV_SET_FORK_EXEC_DEFAULT
> 
> It may be desirable for a user to specify that all VMAs mapped in a process
> address space default to having an madvise() behaviour established by
> default, in such a fashion as that this persists across fork/exec.
> 
> Since this is a very powerful option that would make no sense for many
> advice modes, we explicitly only permit known-safe flags here (currently
> MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> 
> 4. PMADV_ENTIRE_ADDRESS_SPACE
> 
> It can be annoying, should a user wish to apply madvise() to all VMAs in an
> address space, to have to add a singular large entry to the input iovec.
> 
> So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified,
> we expect the user to pass NULL and -1 to the vec and vlen parameters
> respectively so they explicitly acknowledge that these will be ignored,
> e.g.:
> 
> 	process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE,
> 			PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS);
> 
> Usually a user ought to prefer setting PMADV_SKIP_ERRORS here as it may
> well be the case that incompatible VMAs will be encountered that ought to
> be skipped.
> 
> If this is not set, the PMADV_NO_ERROR_ON_UNMAPPED (which was otherwise
> implicitly implied by PMADV_SKIP_ERRORS) ought to be set as of course, the
> entire address space spans at least some gaps.
> 
> Lorenzo Stoakes (5):
>   mm: madvise: refactor madvise_populate()
>   mm/madvise: add PMADV_SKIP_ERRORS process_madvise() flag
>   mm/madvise: add PMADV_NO_ERROR_ON_UNMAPPED process_madvise() flag
>   mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
>   mm/madvise: add PMADV_ENTIRE_ADDRESS_SPACE process_madvise() flag
> 
>  include/uapi/asm-generic/mman-common.h |   6 +
>  mm/madvise.c                           | 206 +++++++++++++++++++------
>  2 files changed, 168 insertions(+), 44 deletions(-)
> 
> --
> 2.49.0
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 1/5] mm: madvise: refactor madvise_populate()
  2025-05-20 10:42       ` David Hildenbrand
@ 2025-05-22 12:32         ` Mike Rapoport
  0 siblings, 0 replies; 57+ messages in thread
From: Mike Rapoport @ 2025-05-22 12:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 12:42:33PM +0200, David Hildenbrand wrote:
> On 20.05.25 12:36, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 12:30:24PM +0200, David Hildenbrand wrote:
> > > On 19.05.25 22:52, Lorenzo Stoakes wrote:
> > > > Use a for-loop rather than a while with the update of the start argument at
> > > > the end of the while-loop.
> > > > 
> > > > This is in preparation for a subsequent commit which modifies this
> > > > function, we therefore separate the refactoring from the actual change
> > > > cleanly by separating the two.
> > > > 
> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > ---
> > > >    mm/madvise.c | 39 ++++++++++++++++++++-------------------
> > > >    1 file changed, 20 insertions(+), 19 deletions(-)
> > > > 
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 8433ac9b27e0..63cc69daa4c7 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -967,32 +967,33 @@ static long madvise_populate(struct mm_struct *mm, unsigned long start,
> > > >    	int locked = 1;
> > > >    	long pages;
> > > > -	while (start < end) {
> > > > +	for (; start < end; start += pages * PAGE_SIZE) {
> > > >    		/* Populate (prefault) page tables readable/writable. */
> > > >    		pages = faultin_page_range(mm, start, end, write, &locked);
> > > >    		if (!locked) {
> > > >    			mmap_read_lock(mm);
> > > >    			locked = 1;
> > > >    		}
> > > > -		if (pages < 0) {
> > > > -			switch (pages) {
> > > > -			case -EINTR:
> > > > -				return -EINTR;
> > > > -			case -EINVAL: /* Incompatible mappings / permissions. */
> > > > -				return -EINVAL;
> > > > -			case -EHWPOISON:
> > > > -				return -EHWPOISON;
> > > > -			case -EFAULT: /* VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV */
> > > > -				return -EFAULT;
> > > > -			default:
> > > > -				pr_warn_once("%s: unhandled return value: %ld\n",
> > > > -					     __func__, pages);
> > > > -				fallthrough;
> > > > -			case -ENOMEM: /* No VMA or out of memory. */
> > > > -				return -ENOMEM;
> > > > -			}
> > > > +
> > > > +		if (pages >= 0)
> > > > +			continue;
> > > > +
> > > > +		switch (pages) {
> > > > +		case -EINTR:
> > > > +			return -EINTR;
> > > > +		case -EINVAL: /* Incompatible mappings / permissions. */
> > > > +			return -EINVAL;
> > > > +		case -EHWPOISON:
> > > > +			return -EHWPOISON;
> > > > +		case -EFAULT: /* VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV */
> > > > +			return -EFAULT;
> > > > +		default:
> > > > +			pr_warn_once("%s: unhandled return value: %ld\n",
> > > > +				     __func__, pages);
> > > > +			fallthrough;
> > > > +		case -ENOMEM: /* No VMA or out of memory. */
> > > > +			return -ENOMEM;
> > > 
> > > Can we limit it to what the patch description says? "Use a for-loop rather
> > > than a while", or will that be a problem for the follow-up patch?
> > 
> > Well, kind of the point is that we can remove a level of indentation also, which
> > then makes life easier in subsequent patch.
> > 
> > Happy to change description or break into two (but that seems a bit over the top
> > maybe? :>)
> 
> Probably just mention it, otherwise it looks a bit like unrelated churn :)

And for refactoring patches it's always useful to mention "no functional
change" ;-)

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
 
> Acked-by: David Hildenbrand <david@redhat.com>
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21 17:32             ` Johannes Weiner
  2025-05-21 18:11               ` Lorenzo Stoakes
@ 2025-05-22 12:45               ` David Hildenbrand
  2025-05-22 13:49                 ` Lorenzo Stoakes
  2025-05-22 15:32               ` Mike Rapoport
  2 siblings, 1 reply; 57+ messages in thread
From: David Hildenbrand @ 2025-05-22 12:45 UTC (permalink / raw)
  To: Johannes Weiner, Lorenzo Stoakes
  Cc: Shakeel Butt, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

On 21.05.25 19:32, Johannes Weiner wrote:
> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
>> So, something Liam mentioned off-list was the beautifully named
>> 'mmadvise()'. Idea being that we have a system call _explicitly for_
>> mm-wide modifications.

As stated elsewhere (e.g., THP cabal yesterday): mctrl() or sth like 
that might be better.

... or anything else that doesn't (ab)use the "advise" terminology in an 
interface that will not only consume advises.

>>
>> With Barry's series doing a prctl() for something similar, and a whole host
>> of mm->flags existing for modifying behaviour, it would seem a natural fit.
> 
> That's an interesting idea.
> 
> So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it
> might also be a good fit for the coredump stuff and ksm if we wanted
> to incorporate them into that (although it would duplicate the
> existing proc/prctl knobs). The other MMF_s are internal AFAICS.
> 
> I think my main concern would be making something very generic and
> versatile without having sufficiently broad/popular usecases for it.
> 
> But no strong feelings either way. Like I said, I don't have a strong
> dislike for prctl(), but this idea would obviously be cleaner if we
> think there is enough of a demand for a new syscall.

Same here. I am not 100% sure process_madvise() is really the right 
thing to extend, but I do enjoy the SET_DEFAULT_EXEC option. I am also 
not a big fan of prctl() ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21 17:39                 ` Shakeel Butt
@ 2025-05-22 13:05                   ` David Hildenbrand
  2025-05-22 13:21                     ` Lorenzo Stoakes
  2025-05-22 20:53                     ` Shakeel Butt
  0 siblings, 2 replies; 57+ messages in thread
From: David Hildenbrand @ 2025-05-22 13:05 UTC (permalink / raw)
  To: Shakeel Butt, Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

On 21.05.25 19:39, Shakeel Butt wrote:
> On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote:
> [...]
>>>
>>> Please let's first get consensus on this before starting the work.
>>
>> With respect Shakeel, I'll work on whatever I want, whenever I want.
> 
> I fail to understand why you would respond like that.

Relax guys ... :) Really nothing to be fighting about.

Lorenzo has a lot of energy to play with things, to see how it would 
look. I wish I would have that much energy, but I have no idea where it 
went ... (well, okay, I have a suspicion) :P

At the same time, I hope (and assume :) ) that Lorenzo will get Usama 
involved in the development once we know what we want.

To summarize my current view:

1) ebpf: most people are are not a fan of that, and I agree, at least
    for this purpose. If we were talking about making better *placement*
    decisions using epbf, it would be a different story.

2) prctl(): the unloved child, and I can understand why. Maybe now is
    the right time to stop adding new MM things that feel weird in there.
    Maybe we should already have done that with the KSM toggle (guess who
    was involved in that ;) ).

3) process_madvise(): I think it's an interesting extension, but
    probably we should just have something that applies to the whole
    address space naturally. At least my take for now.

4) new syscall: worth exploring how it would look. I'm especially
    interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could
    make them only apply to selected controls.

An API prototype of 4), not necessarily with the code yet, might be 
valuable.

In general, the "always/madvise/never" policies are really horrible. We 
should instead be prioritizing who gets THPs -- and only disable them 
for selected workloads.

Because splitting THPs up because a process is not allowed to use them, 
thereby increasing memory fragmentation, is really absolutely suboptimal.

But we don't have anything better right now.

So I would hope that we can at least turn the "always/VM_HUGEPAGE" into 
a "prioritize for largest (m)THPs possible" in a distant future.

If only changing the semantics of VM_NOHUGEPAGE to mean "deprioritize 
for THPs" couldn't break userfaultfd ... :( But maybe that can be worked 
around in the future somehow (e.g., when we detect userfaultfd usage, 
not sure, ...).

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-22 13:05                   ` David Hildenbrand
@ 2025-05-22 13:21                     ` Lorenzo Stoakes
  2025-05-22 20:53                     ` Shakeel Butt
  1 sibling, 0 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-22 13:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Shakeel Butt, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

TL;DR - action item on below is I'll put together a proposed API (without
code) and cc people here when I've done so, so we can take a look at how
mctl() or mmadvise() or whatever we call it might look :)

On Thu, May 22, 2025 at 03:05:30PM +0200, David Hildenbrand wrote:
> On 21.05.25 19:39, Shakeel Butt wrote:
> > On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote:
> > [...]
> > > >
> > > > Please let's first get consensus on this before starting the work.
> > >
> > > With respect Shakeel, I'll work on whatever I want, whenever I want.
> >
> > I fail to understand why you would respond like that.
>
> Relax guys ... :) Really nothing to be fighting about.

Agreed...!

>
> Lorenzo has a lot of energy to play with things, to see how it would look. I
> wish I would have that much energy, but I have no idea where it went ...
> (well, okay, I have a suspicion) :P

We have cats rather than kids which might explain a bit ;)

>
> At the same time, I hope (and assume :) ) that Lorenzo will get Usama
> involved in the development once we know what we want.
>
>
> To summarize my current view:
>
> 1) ebpf: most people are are not a fan of that, and I agree, at least
>    for this purpose. If we were talking about making better *placement*
>    decisions using epbf, it would be a different story.

Yeah, I think overall we have a situation that is _bad_ in terms of
interface. We need something more fine-grained, but it's chicken and egg, and
there are genuine needs users have _now_.

So the whole discussion is about this.

>
> 2) prctl(): the unloved child, and I can understand why. Maybe now is
>    the right time to stop adding new MM things that feel weird in there.
>    Maybe we should already have done that with the KSM toggle (guess who
>    was involved in that ;) ).

I won't belabour this point, at this point I might get a reputation as
prctl()'s biggest hater otherwise :P

But one thing I will say is - systemd makes these things permanent (hey
that KSM thing that breaks VMA merging is literally an option in systemd,
wasn't aware :)

>
> 3) process_madvise(): I think it's an interesting extension, but
>    probably we should just have something that applies to the whole
>    address space naturally. At least my take for now.

Yeah that's the point of view I've come to, I mean the point was to try to
make this more generic in a way that _also_ got us improved control over
madvise() - sort of win/win.

But the 'default the process' thing is, as Shakeel and Liam rightly say,
just really out of band or doesn't quite fit this interface.

I may still put forward a patch to add flags for e.g. not breaking on gaps
but as a separate thing I think, I still think that'd be valuable (but I'll
provide solid at least self tests to make the point).

>
> 4) new syscall: worth exploring how it would look. I'm especially
>    interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could
>    make them only apply to selected controls.

Yeah, this is exactly what I want to play with.

>
>
> An API prototype of 4), not necessarily with the code yet, might be
> valuable.

ACK, though I really find it valuable to code things up because so often
you figure out what works by trying to make it work in practice.

This is how guard regions happened for instance, we had a ton of
conversation like this, loads of back and forth, nobody quite knew, then I
wrote some prototype code and it became apparent that this thing was
doable.

I never intend the RFC to be _the work_ rather it's a 'proof of concept'
for discussion.

However, as we're still fairly vague on the API bit, I think in this case
it'll be valuable to do exactly what you suggest and simply prototype an
API around this without code.

So I'll do that and come up with something as a separate mail, cc'ing
people here.

>
> In general, the "always/madvise/never" policies are really horrible. We
> should instead be prioritizing who gets THPs -- and only disable them for
> selected workloads.

I couldn't agree more.

>
> Because splitting THPs up because a process is not allowed to use them,
> thereby increasing memory fragmentation, is really absolutely suboptimal.

Yes, there's a disconnect here between - a global resource (-ish :P) - and
process requirements.

>
> But we don't have anything better right now.

I feel like all this turmoil brings us closer to longer term solutions, if
perhaps via pain-inspired development (a new programming philosophy I
intend to trademark ;)

>
> So I would hope that we can at least turn the "always/VM_HUGEPAGE" into a
> "prioritize for largest (m)THPs possible" in a distant future.

I suspect we might still require some legacy settings so people don't
panic. Aren't uAPIs fun?

>
> If only changing the semantics of VM_NOHUGEPAGE to mean "deprioritize for
> THPs" couldn't break userfaultfd ... :( But maybe that can be worked around
> in the future somehow (e.g., when we detect userfaultfd usage, not sure,
> ...).

I hate how uffd is implemented (I like the concept of what it provides
though!) on multiple levels. It's crept into so much and the idea it's
putting restrictions on core stuff is just horrid.

I do feel though we may want to introduce something new for this though, as
'never' or 'no' suddenly not being no but 'deprioritise' could be pretty
concerning for people.

But on the other hand, this is a resource for the kernel to determine how
to manage as it sees fit so, perhaps we shouldn't care...

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-22 12:45               ` David Hildenbrand
@ 2025-05-22 13:49                 ` Lorenzo Stoakes
  0 siblings, 0 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-22 13:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Johannes Weiner, Shakeel Butt, Andrew Morton, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Thu, May 22, 2025 at 02:45:30PM +0200, David Hildenbrand wrote:
> On 21.05.25 19:32, Johannes Weiner wrote:
> > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> > > So, something Liam mentioned off-list was the beautifully named
> > > 'mmadvise()'. Idea being that we have a system call _explicitly for_
> > > mm-wide modifications.
>
> As stated elsewhere (e.g., THP cabal yesterday): mctrl() or sth like that
> might be better.
>
> ... or anything else that doesn't (ab)use the "advise" terminology in an
> interface that will not only consume advises.

Ack, as per my other reply, will work up some pseudocode/API exploration (not
yet code) and mail it round to participants here so we can explore this idea.

>
> > >
> > > With Barry's series doing a prctl() for something similar, and a whole host
> > > of mm->flags existing for modifying behaviour, it would seem a natural fit.
> >
> > That's an interesting idea.
> >
> > So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it
> > might also be a good fit for the coredump stuff and ksm if we wanted
> > to incorporate them into that (although it would duplicate the
> > existing proc/prctl knobs). The other MMF_s are internal AFAICS.
> >
> > I think my main concern would be making something very generic and
> > versatile without having sufficiently broad/popular usecases for it.
> >
> > But no strong feelings either way. Like I said, I don't have a strong
> > dislike for prctl(), but this idea would obviously be cleaner if we
> > think there is enough of a demand for a new syscall.
>
> Same here. I am not 100% sure process_madvise() is really the right thing to
> extend, but I do enjoy the SET_DEFAULT_EXEC option. I am also not a big fan
> of prctl() ...

Yeah agreed on both actually! :)

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-21 17:32             ` Johannes Weiner
  2025-05-21 18:11               ` Lorenzo Stoakes
  2025-05-22 12:45               ` David Hildenbrand
@ 2025-05-22 15:32               ` Mike Rapoport
  2025-05-22 15:47                 ` Lorenzo Stoakes
  2 siblings, 1 reply; 57+ messages in thread
From: Mike Rapoport @ 2025-05-22 15:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Lorenzo Stoakes, Shakeel Butt, Andrew Morton, Liam R . Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, linux-mm, linux-arch, linux-kernel,
	SeongJae Park, Usama Arif

On Wed, May 21, 2025 at 01:32:00PM -0400, Johannes Weiner wrote:
> On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> > So, something Liam mentioned off-list was the beautifully named
> > 'mmadvise()'. Idea being that we have a system call _explicitly for_
> > mm-wide modifications.
> > 
> > With Barry's series doing a prctl() for something similar, and a whole host
> > of mm->flags existing for modifying behaviour, it would seem a natural fit.
> 
> That's an interesting idea.
> 
> So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it
> might also be a good fit for the coredump stuff and ksm if we wanted
> to incorporate them into that (although it would duplicate the
> existing proc/prctl knobs). The other MMF_s are internal AFAICS.
> 
> I think my main concern would be making something very generic and
> versatile without having sufficiently broad/popular usecases for it.
> 
> But no strong feelings either way. Like I said, I don't have a strong
> dislike for prctl(), but this idea would obviously be cleaner if we
> think there is enough of a demand for a new syscall.

To me it seems like having a "global mm control" system call makes much
more sense that adding more arms to prctl or overloading process_madvise().

With a dedicated syscall it's much clearer that the operation targets an mm
and it works for the entire mm.
And two usescase seem enough to me to justify a new syscall.

And it's easier to reason about a dedicated syscall designed for a certain
operation that for multiplexed ioctl() style controls.

> > I guess let me work that up so we can see how that looks?
> 
> I think it's worth exploring!
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-22 15:32               ` Mike Rapoport
@ 2025-05-22 15:47                 ` Lorenzo Stoakes
  0 siblings, 0 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-22 15:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Johannes Weiner, Shakeel Butt, Andrew Morton, Liam R . Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, linux-mm, linux-arch, linux-kernel,
	SeongJae Park, Usama Arif

On Thu, May 22, 2025 at 06:32:11PM +0300, Mike Rapoport wrote:
> On Wed, May 21, 2025 at 01:32:00PM -0400, Johannes Weiner wrote:
> > On Wed, May 21, 2025 at 05:21:19AM +0100, Lorenzo Stoakes wrote:
> > > So, something Liam mentioned off-list was the beautifully named
> > > 'mmadvise()'. Idea being that we have a system call _explicitly for_
> > > mm-wide modifications.
> > >
> > > With Barry's series doing a prctl() for something similar, and a whole host
> > > of mm->flags existing for modifying behaviour, it would seem a natural fit.
> >
> > That's an interesting idea.
> >
> > So we'd have THP policies and Barry's FADE_ON_DEATH to start; and it
> > might also be a good fit for the coredump stuff and ksm if we wanted
> > to incorporate them into that (although it would duplicate the
> > existing proc/prctl knobs). The other MMF_s are internal AFAICS.
> >
> > I think my main concern would be making something very generic and
> > versatile without having sufficiently broad/popular usecases for it.
> >
> > But no strong feelings either way. Like I said, I don't have a strong
> > dislike for prctl(), but this idea would obviously be cleaner if we
> > think there is enough of a demand for a new syscall.
>
> To me it seems like having a "global mm control" system call makes much
> more sense that adding more arms to prctl or overloading process_madvise().

Agreed yeah.

>
> With a dedicated syscall it's much clearer that the operation targets an mm
> and it works for the entire mm.
> And two usescase seem enough to me to justify a new syscall.

Yes I think so!

>
> And it's easier to reason about a dedicated syscall designed for a certain
> operation that for multiplexed ioctl() style controls.

Yes!

>
> > > I guess let me work that up so we can see how that looks?
> >
> > I think it's worth exploring!
> >
>
> --
> Sincerely yours,
> Mike.

Thanks, I will distribute a proposed mctl() prototype API to people here (+ cc
you + linux-api also) so we can use that as a basis to assess this approach.

Thanks!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-22 13:05                   ` David Hildenbrand
  2025-05-22 13:21                     ` Lorenzo Stoakes
@ 2025-05-22 20:53                     ` Shakeel Butt
  2025-05-26 12:57                       ` David Hildenbrand
  1 sibling, 1 reply; 57+ messages in thread
From: Shakeel Butt @ 2025-05-22 20:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

On Thu, May 22, 2025 at 03:05:30PM +0200, David Hildenbrand wrote:
> On 21.05.25 19:39, Shakeel Butt wrote:
> > On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote:
> > [...]
> > > > 
> > > > Please let's first get consensus on this before starting the work.
> > > 
> > > With respect Shakeel, I'll work on whatever I want, whenever I want.
> > 
> > I fail to understand why you would respond like that.
> 
> Relax guys ... :) Really nothing to be fighting about.

Agreed.

[...]

> 
> 
> To summarize my current view:
> 
> 1) ebpf: most people are are not a fan of that, and I agree, at least
>    for this purpose. If we were talking about making better *placement*
>    decisions using epbf, it would be a different story.

From placement decisions, do you mean placement between memory
tiers/nodes or something else?

> 
> 2) prctl(): the unloved child, and I can understand why. Maybe now is
>    the right time to stop adding new MM things that feel weird in there.
>    Maybe we should already have done that with the KSM toggle (guess who
>    was involved in that ;) ).

At the moment systemd is the user I know of and I think it would very
easy to migrate it to whatever new thing we decide here.

> 
> 3) process_madvise(): I think it's an interesting extension, but
>    probably we should just have something that applies to the whole
>    address space naturally. At least my take for now.
> 
> 4) new syscall: worth exploring how it would look. I'm especially
>    interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could
>    make them only apply to selected controls.

Were there any previous discussion on SET_DEFAULT_EXEC? First time I am
hearing about it.

Overall I agree with your assessment and thus I was requesting to at
least discuss the new syscall option as well.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
  2025-05-22 20:53                     ` Shakeel Butt
@ 2025-05-26 12:57                       ` David Hildenbrand
  0 siblings, 0 replies; 57+ messages in thread
From: David Hildenbrand @ 2025-05-26 12:57 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Arnd Bergmann, Christian Brauner, linux-mm, linux-arch,
	linux-kernel, SeongJae Park, Usama Arif

>>
>> To summarize my current view:
>>
>> 1) ebpf: most people are are not a fan of that, and I agree, at least
>>     for this purpose. If we were talking about making better *placement*
>>     decisions using epbf, it would be a different story.
> 
>  From placement decisions, do you mean placement between memory
> tiers/nodes or something else?

More like: which size to place, but it could be extended to other 
policies, maybe.

Assume we have a page fault and have to decide which size to place.

For a process that we really want to use THPs (VM_HUEPAGE?), we could 
use the largest free folio possible.

For a process that we don't want to spend valuable THPs on (VM_HUEPAGE 
not set?), we could use the smallest free folio possible.

Such a possibly might be encoded in an ebpf program I assume.

The hints (prioritize regions/processes, deprioritize 
regions/processes), such as VM_HUGEPAGE, inputs into such a program.
> 
>>
>> 2) prctl(): the unloved child, and I can understand why. Maybe now is
>>     the right time to stop adding new MM things that feel weird in there.
>>     Maybe we should already have done that with the KSM toggle (guess who
>>     was involved in that ;) ).
> 
> At the moment systemd is the user I know of and I think it would very
> easy to migrate it to whatever new thing we decide here.

Agreed.

> 
>>
>> 3) process_madvise(): I think it's an interesting extension, but
>>     probably we should just have something that applies to the whole
>>     address space naturally. At least my take for now.
>>
>> 4) new syscall: worth exploring how it would look. I'm especially
>>     interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could
>>     make them only apply to selected controls.
> 
> Were there any previous discussion on SET_DEFAULT_EXEC? First time I am
> hearing about it.

I think it evolved in the discussion here from PMADV_SET_FORK_EXEC_DEFAULT.

> 
> Overall I agree with your assessment and thus I was requesting to at
> least discuss the new syscall option as well.

Yes.

I am still not sure if having a new "process" [1] mode would be a 
reasonable alternative to setting the VM_HUGEPAGE/VM_NOHUGEPAGE default. 
Assuming we would have a "process" mode, we could (a) set the policy 
per-process using the new syscall we discuss here, and options to (B) 
set the policy to use for the exec child and (c) maybe an option to seal 
the policy (depending on who is allowed to set the policy in the first 
place).

On the + side, we don't lose hints/instructions from the app 
(VM_HUGEPAGE/VM_NOHUGEPAGE) when changing the policy on an already 
running process.

The problem I see with the "process" policy is that people might want 
different "default" policies for processes, which means that we will 
have to add yet another toggle.

How I hate THP toggles. :)

[1] 
https://lore.kernel.org/all/CALOAHbB-KQ4+z-Lupv7RcxArfjX7qtWcrboMDdT4LdpoTXOMyw@mail.gmail.com/

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag
  2025-05-20 22:26   ` Johannes Weiner
@ 2025-05-29 14:46     ` Lorenzo Stoakes
  0 siblings, 0 replies; 57+ messages in thread
From: Lorenzo Stoakes @ 2025-05-29 14:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	linux-mm, linux-arch, linux-kernel, SeongJae Park, Usama Arif

On Tue, May 20, 2025 at 06:26:09PM -0400, Johannes Weiner wrote:
> On Mon, May 19, 2025 at 09:52:41PM +0100, Lorenzo Stoakes wrote:
> > We intentionally limit this only to flags that we know should function
> > correctly without issue, and to be conservative about this, so we initially
> > limit ourselves only to MADV_HUGEPAGE, MADV_NOHUGEPAGE, that is - setting
> > the VM_HUGEPAGE, VM_NOHUGEPAGE VMA flags.
>
> Hm, but do we actually expect more to show up? Looking at the current
> list of advisories, we have the conventional ones:
>
>        MADV_NORMAL
>               No special treatment.  This is the default.
>
>        MADV_RANDOM
>               Expect page references in random order.  (Hence, read ahead may be less useful than normally.)
>
>        MADV_SEQUENTIAL
>               Expect page references in sequential order.  (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)
>
>        MADV_WILLNEED
>               Expect access in the near future.  (Hence, it might be a good idea to read some pages ahead.)
>
>        MADV_DONTNEED
>               Do not expect access in the near future.  (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.)
>
> where only RANDOM and SEQUENTIAL are actual policies. But since those
> apply to file mappings only, they don't seem to make much sense for a
> process-wide setting.
>
> For Linux-specific advisories we have
>
> 	MADV_REMOVE		- action
> 	MADV_DONTFORK		- policy, but not sure how this could work as a process-wide thing
> 	MADV_HWPOISON		- action
> 	MADV_MERGEABLE		- policy, but we have a prctl() for process-wide settings
> 	MADV_SOFTOFFLINE	- action
> 	MADV_HUGEPAGE		- policy, but we have a prctl() for process-wide disabling
> 	MADV_COLLAPSE		- action
> 	MADV_DONTDUMP		- policy, but there is /proc/<pid>/coredump_filter for process-wide settings
> 	MADV_FREE		- action
> 	MADV_WIPEONFORK		- policy, but similar to DONTFORK, not sure how this would work process-wide
> 	MADV_COLD		- action
> 	MADV_PAGEOUT		- action
> 	MADV_POPULATE_READ	- action
> 	MADV_POPULATE_WRITE	- action
> 	MADV_GUARD_INSTALL	- action
>
> So the vast majority of advisories are either one-off actions, or they
> are policies that naturally only make sense for very specific ranges.
>
> KSM and THP really seem like the notable exception[1], rather than a
> rule. And we already *have* prctl() ops to modify per-process policies
> for these. So I'm reluctant to agree we should drill open and expand
> madvise() to cover them - especially with the goal or the expectation
> that THP per-process policies shouldn't matter that much down the line.
>
> I will admit, I don't hate prctl() as much as you do. It *is* a fairly
> broad interface - setting per-process policies. So it's bound to pick
> odds and ends of various subsystems that don't quite fit elsewhere.
>
> Is it the right choice everywhere? Of course not. And can its
> broadness be abused to avoid real interface design? Absolutely.
>
> I don't think that makes it inherently bad, though. As long as we make
> an honest effort to find the best home for new knobs.
>
> But IMO, in this case it is. The inheritance-along-the-process-tree
> behavior that we want here is an established pattern in prctl().
>
> And *because* that propagation is a security-sensitive pattern
> (although I don't think the THP policy specifically *is* a security
> issue), to me it makes more sense to keep this behavior confined to
> prctl() at least, and not add it to madvise too.
>
> [1] Maybe we should have added sys_andrea() to cover those, as well as
>     the SECCOMP prctl()s ;)

So sorry to have missed you really excellent and well thought-out reply
here Johannes (grr mail etc.).

You make a really good point, and I think this does overall, sadly, suggest
the concept of a 'general' madvise() call while, in a sense, 'neat',
doesn't quite fit.

I do think it aligns well with a dedicated mctl() API call, have sent out
an RFC discussion thread about this so we can determine if this makes
sense.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2025-05-29 14:46 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-19 20:52 [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Lorenzo Stoakes
2025-05-19 20:52 ` [RFC PATCH 1/5] mm: madvise: refactor madvise_populate() Lorenzo Stoakes
2025-05-20 10:30   ` David Hildenbrand
2025-05-20 10:36     ` Lorenzo Stoakes
2025-05-20 10:42       ` David Hildenbrand
2025-05-22 12:32         ` Mike Rapoport
2025-05-19 20:52 ` [RFC PATCH 2/5] mm/madvise: add PMADV_SKIP_ERRORS process_madvise() flag Lorenzo Stoakes
2025-05-19 20:52 ` [RFC PATCH 3/5] mm/madvise: add PMADV_NO_ERROR_ON_UNMAPPED " Lorenzo Stoakes
2025-05-19 20:52 ` [RFC PATCH 4/5] mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT " Lorenzo Stoakes
2025-05-20  8:38   ` Pedro Falcato
2025-05-20 10:21     ` Lorenzo Stoakes
2025-05-20 11:41       ` Pedro Falcato
2025-05-20 13:39         ` Lorenzo Stoakes
2025-05-20 16:11     ` Jann Horn
2025-05-20 16:19       ` Lorenzo Stoakes
2025-05-20 16:35         ` David Hildenbrand
2025-05-20 22:26   ` Johannes Weiner
2025-05-29 14:46     ` Lorenzo Stoakes
2025-05-19 20:52 ` [RFC PATCH 5/5] mm/madvise: add PMADV_ENTIRE_ADDRESS_SPACE " Lorenzo Stoakes
2025-05-19 21:53 ` [RFC PATCH 0/5] add process_madvise() flags to modify behaviour Jann Horn
2025-05-20  5:35   ` Lorenzo Stoakes
2025-05-20 16:04     ` Jann Horn
2025-05-20 16:14       ` Lorenzo Stoakes
2025-05-20 15:28 ` David Hildenbrand
2025-05-20 17:47   ` Lorenzo Stoakes
2025-05-20 18:24     ` Usama Arif
2025-05-20 19:21       ` Lorenzo Stoakes
2025-05-20 19:42         ` Usama Arif
2025-05-20 20:15           ` Lorenzo Stoakes
2025-05-20 18:25     ` Lorenzo Stoakes
2025-05-20 18:39       ` David Hildenbrand
2025-05-20 18:25 ` Shakeel Butt
2025-05-20 18:45   ` Lorenzo Stoakes
2025-05-20 19:49     ` Shakeel Butt
2025-05-20 20:39       ` Lorenzo Stoakes
2025-05-20 22:02         ` Shakeel Butt
2025-05-21  4:21           ` Lorenzo Stoakes
2025-05-21 16:28             ` Shakeel Butt
2025-05-21 16:49               ` Lorenzo Stoakes
2025-05-21 17:39                 ` Shakeel Butt
2025-05-22 13:05                   ` David Hildenbrand
2025-05-22 13:21                     ` Lorenzo Stoakes
2025-05-22 20:53                     ` Shakeel Butt
2025-05-26 12:57                       ` David Hildenbrand
2025-05-21 16:57               ` Usama Arif
2025-05-21 17:39                 ` Lorenzo Stoakes
2025-05-21 18:25                   ` Usama Arif
2025-05-21 18:40                     ` Lorenzo Stoakes
2025-05-21 18:45                       ` Usama Arif
2025-05-21 17:32             ` Johannes Weiner
2025-05-21 18:11               ` Lorenzo Stoakes
2025-05-22 12:45               ` David Hildenbrand
2025-05-22 13:49                 ` Lorenzo Stoakes
2025-05-22 15:32               ` Mike Rapoport
2025-05-22 15:47                 ` Lorenzo Stoakes
2025-05-21  2:16       ` Liam R. Howlett
2025-05-22 12:12 ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).