* [PATCH v2 0/5] madvise cleanup
@ 2025-06-20 15:33 Lorenzo Stoakes
2025-06-20 15:33 ` [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
` (5 more replies)
0 siblings, 6 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 15:33 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
This is a series of patches that helps address a number of historic
problems in the madvise() implementation:
* Eliminate the visitor pattern and having the code which is implemented
for both the anon_vma_name implementation and ordinary madvise()
operations use the same madvise_vma_behavior() implementation.
* Thread state through the madvise_behavior state object - this object,
very usefully introduced by SJ, is already used to transmit state through
operations. This series extends this by having all madvise() operations
use this, including anon_vma_name.
* Thread range, VMA state through madvise_behavior - This helps avoid a lot
of the confusing code around range and VMA state and again keeps things
consistent and with a single 'source of truth'.
* Addressing the very strange behaviour around the passed around struct
vm_area_struct **prev pointer - all read-only users do absolutely nothing
with the prev pointer. The only function that uses it is
madvise_update_vma(), and in all cases prev is always reset to
VMA.
Fix this by no longer having aything but madvise_update_vma() reference
prev, and having madvise_walk_vmas() update prev in each
instance. Additionally make it clear that the meaningful change in vma
state is when madvise_update_vma() potentially merges a VMA, so
explicitly retrieve the VMA in this case.
* Update and clarify the madvise_walk_vmas() function - this is a source of
a great deal of confusion, so simplify, stop using prev = NULL to signify
that the mmap lock has been dropped (!) and make that explicit, and add
some comments to explain what's going on.
v2:
* Propagated tags (thanks everyone!)
* Don't separate out __MADV_SET_ANON_VMA_NAME and __MADV_SET_CLEAR_VMA_NAME,
just use __MADV_SET_ANON_VMA_NAME as per Zi.
* Eliminate is_anon_vma_name() as no longer necessary, addressing Zi's concern
around naming another way :)
* Put mm_struct abstraction of try_vma_read_lock() into 2/5 from 3/5 as per Zi.
* Added comment about get/put anon_vma_name in madvise_vma_behavior() as per
Vlastimil.
* Renamed have_anon_name to set_new_anon_name to make it clear why we make an
exception to this get/put behaviour in madvise_vma_behavior().
* Reworded 1/4 commit message to make it clearer what's being done as per
Vlastimil.
* Avoid comma-separated decls in struct madvise_behavior_range as per Zi and
Vlastimil.
* Put fix for silly development bug (range->start comparison to end not
range->end) in 3/5 rather than 4/5 so as to eliminate it altogether,
having fixed it during development but having not put the fix in the
correct place :) as per Vlastimil.
* Rename end to last_end in madvise_walk_vmas() and added a comment for
clarity as per Vlastimil.
* Update madvise_walk_vmas() comment to no longer refer to a visitor
function.
* Separated out prev, vma fields in struct madvise_behavior as per
Vlastimil.
* Added assert on not holding VMA lock whenever mmap lock is dropped and
abstracted to mark_mmap_lock_dropped() so we always assert when we do
this, based on discussion with Vlastimil.
* Removed duplicate comment about weird -ENOMEM unmapped error behaviour.
v1:
https://lore.kernel.org/all/cover.1750363557.git.lorenzo.stoakes@oracle.com/
Lorenzo Stoakes (5):
mm/madvise: remove the visitor pattern and thread anon_vma state
mm/madvise: thread mm_struct through madvise_behavior
mm/madvise: thread VMA range state through madvise_behavior
mm/madvise: thread all madvise state through madv_behavior
mm/madvise: eliminate very confusing manipulation of prev VMA
include/linux/huge_mm.h | 9 +-
mm/khugepaged.c | 9 +-
mm/madvise.c | 585 +++++++++++++++++++++-------------------
3 files changed, 313 insertions(+), 290 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state
2025-06-20 15:33 [PATCH v2 0/5] madvise cleanup Lorenzo Stoakes
@ 2025-06-20 15:33 ` Lorenzo Stoakes
2025-06-20 16:57 ` Zi Yan
` (3 more replies)
2025-06-20 15:33 ` [PATCH v2 2/5] mm/madvise: thread mm_struct through madvise_behavior Lorenzo Stoakes
` (4 subsequent siblings)
5 siblings, 4 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 15:33 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
Now we have the madvise_behavior helper struct we no longer need to mess
around with void* pointers in order to propagate anon_vma_name, and this
means we can get rid of the confusing and inconsistent visitor pattern
implementation in madvise_vma_anon_name().
This means we now have a single state object that threads through most of
madvise()'s logic and a single code path which executes the majority of
madvise() behaviour (we maintain separate logic for failure injection and
memory population for the time being).
We are able to remove the visitor pattern by handling the anon_vma_name
setting logic via an internal madvise flag - __MADV_SET_ANON_VMA_NAME. This
uses a negative value so it isn't reasonable that we will ever add this as
a UAPI flag.
Additionally, the madvise_behavior_valid() check ensures that
user-specified behaviours are strictly only those we permit which, of
course, this flag will be excluded from.
We are able to propagate the anon_vma_name object through use of the
madvise_behavior helper struct.
Doing this results in a can_modify_vma_madv() check for anonymous VMA name
changes, however this will cause no issues as this operation is not
prohibited.
We can also then reuse more code and drop the redundant
madvise_vma_anon_name() function altogether.
Additionally separate out behaviours that update VMAs from those that do
not.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/madvise.c | 166 +++++++++++++++++++++++++--------------------------
1 file changed, 83 insertions(+), 83 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 070132f9842b..93837b980cc2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -37,6 +37,8 @@
#include "internal.h"
#include "swap.h"
+#define __MADV_SET_ANON_VMA_NAME (-1)
+
/*
* Maximum number of attempts we make to install guard pages before we give up
* and return -ERESTARTNOINTR to have userspace try again.
@@ -59,9 +61,13 @@ struct madvise_behavior {
int behavior;
struct mmu_gather *tlb;
enum madvise_lock_mode lock_mode;
+ struct anon_vma_name *anon_name;
};
#ifdef CONFIG_ANON_VMA_NAME
+static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
+ unsigned long end, struct madvise_behavior *madv_behavior);
+
struct anon_vma_name *anon_vma_name_alloc(const char *name)
{
struct anon_vma_name *anon_name;
@@ -112,6 +118,35 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
return 0;
}
+
+int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
+ unsigned long len_in, struct anon_vma_name *anon_name)
+{
+ unsigned long end;
+ unsigned long len;
+ struct madvise_behavior madv_behavior = {
+ .behavior = __MADV_SET_ANON_VMA_NAME,
+ .lock_mode = MADVISE_MMAP_WRITE_LOCK,
+ .anon_name = anon_name,
+ };
+
+ if (start & ~PAGE_MASK)
+ return -EINVAL;
+ len = (len_in + ~PAGE_MASK) & PAGE_MASK;
+
+ /* Check to see whether len was rounded up from small -ve to zero */
+ if (len_in && !len)
+ return -EINVAL;
+
+ end = start + len;
+ if (end < start)
+ return -EINVAL;
+
+ if (end == start)
+ return 0;
+
+ return madvise_walk_vmas(mm, start, end, &madv_behavior);
+}
#else /* CONFIG_ANON_VMA_NAME */
static int replace_anon_vma_name(struct vm_area_struct *vma,
struct anon_vma_name *anon_name)
@@ -1252,13 +1287,13 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
static int madvise_vma_behavior(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end,
- void *behavior_arg)
+ struct madvise_behavior *madv_behavior)
{
- struct madvise_behavior *arg = behavior_arg;
- int behavior = arg->behavior;
- int error;
- struct anon_vma_name *anon_name;
+ int behavior = madv_behavior->behavior;
+ struct anon_vma_name *anon_name = madv_behavior->anon_name;
vm_flags_t new_flags = vma->vm_flags;
+ bool set_new_anon_name = behavior == __MADV_SET_ANON_VMA_NAME;
+ int error;
if (unlikely(!can_modify_vma_madv(vma, behavior)))
return -EPERM;
@@ -1275,7 +1310,17 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
case MADV_FREE:
case MADV_DONTNEED:
case MADV_DONTNEED_LOCKED:
- return madvise_dontneed_free(vma, prev, start, end, arg);
+ return madvise_dontneed_free(vma, prev, start, end,
+ madv_behavior);
+ case MADV_COLLAPSE:
+ return madvise_collapse(vma, prev, start, end);
+ case MADV_GUARD_INSTALL:
+ return madvise_guard_install(vma, prev, start, end);
+ case MADV_GUARD_REMOVE:
+ return madvise_guard_remove(vma, prev, start, end);
+
+ /* The below behaviours update VMAs via madvise_update_vma(). */
+
case MADV_NORMAL:
new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
break;
@@ -1325,21 +1370,29 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
if (error)
goto out;
break;
- case MADV_COLLAPSE:
- return madvise_collapse(vma, prev, start, end);
- case MADV_GUARD_INSTALL:
- return madvise_guard_install(vma, prev, start, end);
- case MADV_GUARD_REMOVE:
- return madvise_guard_remove(vma, prev, start, end);
+ case __MADV_SET_ANON_VMA_NAME:
+ /* Only anonymous mappings can be named */
+ if (vma->vm_file && !vma_is_anon_shmem(vma))
+ return -EBADF;
+ break;
}
/* We cannot provide prev in this lock mode. */
- VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK);
- anon_name = anon_vma_name(vma);
- anon_vma_name_get(anon_name);
+ VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK);
+
+ /*
+ * madvise_update_vma() might cause a VMA merge which could put an
+ * anon_vma_name, so we must hold an additional reference on the
+ * anon_vma_name so it doesn't disappear from under us.
+ */
+ if (!set_new_anon_name) {
+ anon_name = anon_vma_name(vma);
+ anon_vma_name_get(anon_name);
+ }
error = madvise_update_vma(vma, prev, start, end, new_flags,
anon_name);
- anon_vma_name_put(anon_name);
+ if (!set_new_anon_name)
+ anon_vma_name_put(anon_name);
out:
/*
@@ -1523,20 +1576,17 @@ static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
}
/*
- * Walk the vmas in range [start,end), and call the visit function on each one.
- * The visit function will get start and end parameters that cover the overlap
- * between the current vma and the original range. Any unmapped regions in the
- * original range will result in this function returning -ENOMEM while still
- * calling the visit function on all of the existing vmas in the range.
- * Must be called with the mmap_lock held for reading or writing.
+ * Walk the vmas in range [start,end), and call the madvise_vma_behavior
+ * function on each one. The function will get start and end parameters that
+ * cover the overlap between the current vma and the original range. Any
+ * unmapped regions in the original range will result in this function returning
+ * -ENOMEM while still calling the madvise_vma_behavior function on all of the
+ * existing vmas in the range. Must be called with the mmap_lock held for
+ * reading or writing.
*/
static
int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
- unsigned long end, struct madvise_behavior *madv_behavior,
- void *arg,
- int (*visit)(struct vm_area_struct *vma,
- struct vm_area_struct **prev, unsigned long start,
- unsigned long end, void *arg))
+ unsigned long end, struct madvise_behavior *madv_behavior)
{
struct vm_area_struct *vma;
struct vm_area_struct *prev;
@@ -1548,11 +1598,12 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
* If VMA read lock is supported, apply madvise to a single VMA
* tentatively, avoiding walking VMAs.
*/
- if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
+ if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
vma = try_vma_read_lock(mm, madv_behavior, start, end);
if (vma) {
prev = vma;
- error = visit(vma, &prev, start, end, arg);
+ error = madvise_vma_behavior(vma, &prev, start, end,
+ madv_behavior);
vma_end_read(vma);
return error;
}
@@ -1586,7 +1637,8 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
tmp = end;
/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
- error = visit(vma, &prev, start, tmp, arg);
+ error = madvise_vma_behavior(vma, &prev, start, tmp,
+ madv_behavior);
if (error)
return error;
start = tmp;
@@ -1603,57 +1655,6 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
return unmapped_error;
}
-#ifdef CONFIG_ANON_VMA_NAME
-static int madvise_vma_anon_name(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end,
- void *anon_name)
-{
- int error;
-
- /* Only anonymous mappings can be named */
- if (vma->vm_file && !vma_is_anon_shmem(vma))
- return -EBADF;
-
- error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
- anon_name);
-
- /*
- * madvise() returns EAGAIN if kernel resources, such as
- * slab, are temporarily unavailable.
- */
- if (error == -ENOMEM)
- error = -EAGAIN;
- return error;
-}
-
-int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
- unsigned long len_in, struct anon_vma_name *anon_name)
-{
- unsigned long end;
- unsigned long len;
-
- if (start & ~PAGE_MASK)
- return -EINVAL;
- len = (len_in + ~PAGE_MASK) & PAGE_MASK;
-
- /* Check to see whether len was rounded up from small -ve to zero */
- if (len_in && !len)
- return -EINVAL;
-
- end = start + len;
- if (end < start)
- return -EINVAL;
-
- if (end == start)
- return 0;
-
- return madvise_walk_vmas(mm, start, end, NULL, anon_name,
- madvise_vma_anon_name);
-}
-#endif /* CONFIG_ANON_VMA_NAME */
-
-
/*
* Any behaviour which results in changes to the vma->vm_flags needs to
* take mmap_lock for writing. Others, which simply traverse vmas, need
@@ -1845,8 +1846,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
if (is_madvise_populate(behavior))
error = madvise_populate(mm, start, end, behavior);
else
- error = madvise_walk_vmas(mm, start, end, madv_behavior,
- madv_behavior, madvise_vma_behavior);
+ error = madvise_walk_vmas(mm, start, end, madv_behavior);
blk_finish_plug(&plug);
return error;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v2 2/5] mm/madvise: thread mm_struct through madvise_behavior
2025-06-20 15:33 [PATCH v2 0/5] madvise cleanup Lorenzo Stoakes
2025-06-20 15:33 ` [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
@ 2025-06-20 15:33 ` Lorenzo Stoakes
2025-06-20 16:59 ` Zi Yan
` (2 more replies)
2025-06-20 15:33 ` [PATCH v2 3/5] mm/madvise: thread VMA range state " Lorenzo Stoakes
` (3 subsequent siblings)
5 siblings, 3 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 15:33 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
There's no need to thread a pointer to the mm_struct nor have different
functions signatures for each behaviour, instead store state in the struct
madvise_behavior object consistently and use it for all madvise() actions.
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/madvise.c | 110 ++++++++++++++++++++++++++-------------------------
1 file changed, 57 insertions(+), 53 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 93837b980cc2..f1109c2c63a4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -58,6 +58,7 @@ enum madvise_lock_mode {
};
struct madvise_behavior {
+ struct mm_struct *mm;
int behavior;
struct mmu_gather *tlb;
enum madvise_lock_mode lock_mode;
@@ -65,8 +66,8 @@ struct madvise_behavior {
};
#ifdef CONFIG_ANON_VMA_NAME
-static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
- unsigned long end, struct madvise_behavior *madv_behavior);
+static int madvise_walk_vmas(unsigned long start, unsigned long end,
+ struct madvise_behavior *madv_behavior);
struct anon_vma_name *anon_vma_name_alloc(const char *name)
{
@@ -125,6 +126,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
unsigned long end;
unsigned long len;
struct madvise_behavior madv_behavior = {
+ .mm = mm,
.behavior = __MADV_SET_ANON_VMA_NAME,
.lock_mode = MADVISE_MMAP_WRITE_LOCK,
.anon_name = anon_name,
@@ -145,7 +147,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
if (end == start)
return 0;
- return madvise_walk_vmas(mm, start, end, &madv_behavior);
+ return madvise_walk_vmas(start, end, &madv_behavior);
}
#else /* CONFIG_ANON_VMA_NAME */
static int replace_anon_vma_name(struct vm_area_struct *vma,
@@ -991,10 +993,11 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
return -EINVAL;
}
-static long madvise_populate(struct mm_struct *mm, unsigned long start,
- unsigned long end, int behavior)
+static long madvise_populate(unsigned long start, unsigned long end,
+ struct madvise_behavior *madv_behavior)
{
- const bool write = behavior == MADV_POPULATE_WRITE;
+ struct mm_struct *mm = madv_behavior->mm;
+ const bool write = madv_behavior->behavior == MADV_POPULATE_WRITE;
int locked = 1;
long pages;
@@ -1408,15 +1411,14 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
/*
* Error injection support for memory error handling.
*/
-static int madvise_inject_error(int behavior,
- unsigned long start, unsigned long end)
+static int madvise_inject_error(unsigned long start, unsigned long end,
+ struct madvise_behavior *madv_behavior)
{
unsigned long size;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
-
for (; start < end; start += size) {
unsigned long pfn;
struct page *page;
@@ -1434,7 +1436,7 @@ static int madvise_inject_error(int behavior,
*/
size = page_size(compound_head(page));
- if (behavior == MADV_SOFT_OFFLINE) {
+ if (madv_behavior->behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
pfn, start);
ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
@@ -1453,9 +1455,9 @@ static int madvise_inject_error(int behavior,
return 0;
}
-static bool is_memory_failure(int behavior)
+static bool is_memory_failure(struct madvise_behavior *madv_behavior)
{
- switch (behavior) {
+ switch (madv_behavior->behavior) {
case MADV_HWPOISON:
case MADV_SOFT_OFFLINE:
return true;
@@ -1466,13 +1468,13 @@ static bool is_memory_failure(int behavior)
#else
-static int madvise_inject_error(int behavior,
- unsigned long start, unsigned long end)
+static int madvise_inject_error(unsigned long start, unsigned long end,
+ struct madvise_behavior *madv_behavior)
{
return 0;
}
-static bool is_memory_failure(int behavior)
+static bool is_memory_failure(struct madvise_behavior *madv_behavior)
{
return false;
}
@@ -1549,10 +1551,11 @@ static bool process_madvise_remote_valid(int behavior)
* If a VMA read lock could not be acquired, we return NULL and expect caller to
* fallback to mmap lock behaviour.
*/
-static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
+static struct vm_area_struct *try_vma_read_lock(
struct madvise_behavior *madv_behavior,
unsigned long start, unsigned long end)
{
+ struct mm_struct *mm = madv_behavior->mm;
struct vm_area_struct *vma;
vma = lock_vma_under_rcu(mm, start);
@@ -1585,9 +1588,10 @@ static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
* reading or writing.
*/
static
-int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
- unsigned long end, struct madvise_behavior *madv_behavior)
+int madvise_walk_vmas(unsigned long start, unsigned long end,
+ struct madvise_behavior *madv_behavior)
{
+ struct mm_struct *mm = madv_behavior->mm;
struct vm_area_struct *vma;
struct vm_area_struct *prev;
unsigned long tmp;
@@ -1599,7 +1603,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
* tentatively, avoiding walking VMAs.
*/
if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
- vma = try_vma_read_lock(mm, madv_behavior, start, end);
+ vma = try_vma_read_lock(madv_behavior, start, end);
if (vma) {
prev = vma;
error = madvise_vma_behavior(vma, &prev, start, end,
@@ -1662,12 +1666,10 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
*/
static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior)
{
- int behavior = madv_behavior->behavior;
-
- if (is_memory_failure(behavior))
+ if (is_memory_failure(madv_behavior))
return MADVISE_NO_LOCK;
- switch (behavior) {
+ switch (madv_behavior->behavior) {
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_COLD:
@@ -1687,9 +1689,9 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
}
}
-static int madvise_lock(struct mm_struct *mm,
- struct madvise_behavior *madv_behavior)
+static int madvise_lock(struct madvise_behavior *madv_behavior)
{
+ struct mm_struct *mm = madv_behavior->mm;
enum madvise_lock_mode lock_mode = get_lock_mode(madv_behavior);
switch (lock_mode) {
@@ -1711,9 +1713,10 @@ static int madvise_lock(struct mm_struct *mm,
return 0;
}
-static void madvise_unlock(struct mm_struct *mm,
- struct madvise_behavior *madv_behavior)
+static void madvise_unlock(struct madvise_behavior *madv_behavior)
{
+ struct mm_struct *mm = madv_behavior->mm;
+
switch (madv_behavior->lock_mode) {
case MADVISE_NO_LOCK:
return;
@@ -1743,11 +1746,10 @@ static bool madvise_batch_tlb_flush(int behavior)
}
}
-static void madvise_init_tlb(struct madvise_behavior *madv_behavior,
- struct mm_struct *mm)
+static void madvise_init_tlb(struct madvise_behavior *madv_behavior)
{
if (madvise_batch_tlb_flush(madv_behavior->behavior))
- tlb_gather_mmu(madv_behavior->tlb, mm);
+ tlb_gather_mmu(madv_behavior->tlb, madv_behavior->mm);
}
static void madvise_finish_tlb(struct madvise_behavior *madv_behavior)
@@ -1802,9 +1804,9 @@ static bool madvise_should_skip(unsigned long start, size_t len_in,
return false;
}
-static bool is_madvise_populate(int behavior)
+static bool is_madvise_populate(struct madvise_behavior *madv_behavior)
{
- switch (behavior) {
+ switch (madv_behavior->behavior) {
case MADV_POPULATE_READ:
case MADV_POPULATE_WRITE:
return true;
@@ -1828,25 +1830,26 @@ static inline unsigned long get_untagged_addr(struct mm_struct *mm,
untagged_addr_remote(mm, start);
}
-static int madvise_do_behavior(struct mm_struct *mm,
- unsigned long start, size_t len_in,
+static int madvise_do_behavior(unsigned long start, size_t len_in,
struct madvise_behavior *madv_behavior)
{
- int behavior = madv_behavior->behavior;
struct blk_plug plug;
unsigned long end;
int error;
- if (is_memory_failure(behavior))
- return madvise_inject_error(behavior, start, start + len_in);
- start = get_untagged_addr(mm, start);
+ if (is_memory_failure(madv_behavior)) {
+ end = start + len_in;
+ return madvise_inject_error(start, end, madv_behavior);
+ }
+
+ start = get_untagged_addr(madv_behavior->mm, start);
end = start + PAGE_ALIGN(len_in);
blk_start_plug(&plug);
- if (is_madvise_populate(behavior))
- error = madvise_populate(mm, start, end, behavior);
+ if (is_madvise_populate(madv_behavior))
+ error = madvise_populate(start, end, madv_behavior);
else
- error = madvise_walk_vmas(mm, start, end, madv_behavior);
+ error = madvise_walk_vmas(start, end, madv_behavior);
blk_finish_plug(&plug);
return error;
}
@@ -1928,19 +1931,20 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
int error;
struct mmu_gather tlb;
struct madvise_behavior madv_behavior = {
+ .mm = mm,
.behavior = behavior,
.tlb = &tlb,
};
if (madvise_should_skip(start, len_in, behavior, &error))
return error;
- error = madvise_lock(mm, &madv_behavior);
+ error = madvise_lock(&madv_behavior);
if (error)
return error;
- madvise_init_tlb(&madv_behavior, mm);
- error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
+ madvise_init_tlb(&madv_behavior);
+ error = madvise_do_behavior(start, len_in, &madv_behavior);
madvise_finish_tlb(&madv_behavior);
- madvise_unlock(mm, &madv_behavior);
+ madvise_unlock(&madv_behavior);
return error;
}
@@ -1958,16 +1962,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
size_t total_len;
struct mmu_gather tlb;
struct madvise_behavior madv_behavior = {
+ .mm = mm,
.behavior = behavior,
.tlb = &tlb,
};
total_len = iov_iter_count(iter);
- ret = madvise_lock(mm, &madv_behavior);
+ ret = madvise_lock(&madv_behavior);
if (ret)
return ret;
- madvise_init_tlb(&madv_behavior, mm);
+ madvise_init_tlb(&madv_behavior);
while (iov_iter_count(iter)) {
unsigned long start = (unsigned long)iter_iov_addr(iter);
@@ -1977,8 +1982,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
if (madvise_should_skip(start, len_in, behavior, &error))
ret = error;
else
- ret = madvise_do_behavior(mm, start, len_in,
- &madv_behavior);
+ ret = madvise_do_behavior(start, len_in, &madv_behavior);
/*
* An madvise operation is attempting to restart the syscall,
* but we cannot proceed as it would not be correct to repeat
@@ -1997,11 +2001,11 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
/* Drop and reacquire lock to unwind race. */
madvise_finish_tlb(&madv_behavior);
- madvise_unlock(mm, &madv_behavior);
- ret = madvise_lock(mm, &madv_behavior);
+ madvise_unlock(&madv_behavior);
+ ret = madvise_lock(&madv_behavior);
if (ret)
goto out;
- madvise_init_tlb(&madv_behavior, mm);
+ madvise_init_tlb(&madv_behavior);
continue;
}
if (ret < 0)
@@ -2009,7 +2013,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
iov_iter_advance(iter, iter_iov_len(iter));
}
madvise_finish_tlb(&madv_behavior);
- madvise_unlock(mm, &madv_behavior);
+ madvise_unlock(&madv_behavior);
out:
ret = (total_len - iov_iter_count(iter)) ? : ret;
--
2.49.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v2 3/5] mm/madvise: thread VMA range state through madvise_behavior
2025-06-20 15:33 [PATCH v2 0/5] madvise cleanup Lorenzo Stoakes
2025-06-20 15:33 ` [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
2025-06-20 15:33 ` [PATCH v2 2/5] mm/madvise: thread mm_struct through madvise_behavior Lorenzo Stoakes
@ 2025-06-20 15:33 ` Lorenzo Stoakes
2025-06-20 17:02 ` Zi Yan
` (2 more replies)
2025-06-20 15:33 ` [PATCH v2 4/5] mm/madvise: thread all madvise state through madv_behavior Lorenzo Stoakes
` (2 subsequent siblings)
5 siblings, 3 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 15:33 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
Rather than updating start and a confusing local parameter 'tmp' in
madvise_walk_vmas(), instead store the current range being operated upon in
the struct madvise_behavior helper object in a range pair and use this
consistently in all operations.
This makes it clearer what is going on and opens the door to further
cleanup now we store state regarding what is currently being operated upon
here.
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/madvise.c | 103 ++++++++++++++++++++++++++++-----------------------
1 file changed, 57 insertions(+), 46 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index f1109c2c63a4..764a532ff62a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -57,17 +57,27 @@ enum madvise_lock_mode {
MADVISE_VMA_READ_LOCK,
};
+struct madvise_behavior_range {
+ unsigned long start;
+ unsigned long end;
+};
+
struct madvise_behavior {
struct mm_struct *mm;
int behavior;
struct mmu_gather *tlb;
enum madvise_lock_mode lock_mode;
struct anon_vma_name *anon_name;
+
+ /*
+ * The range over which the behaviour is currently being applied. If
+ * traversing multiple VMAs, this is updated for each.
+ */
+ struct madvise_behavior_range range;
};
#ifdef CONFIG_ANON_VMA_NAME
-static int madvise_walk_vmas(unsigned long start, unsigned long end,
- struct madvise_behavior *madv_behavior);
+static int madvise_walk_vmas(struct madvise_behavior *madv_behavior);
struct anon_vma_name *anon_vma_name_alloc(const char *name)
{
@@ -147,7 +157,9 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
if (end == start)
return 0;
- return madvise_walk_vmas(start, end, &madv_behavior);
+ madv_behavior.range.start = start;
+ madv_behavior.range.end = end;
+ return madvise_walk_vmas(&madv_behavior);
}
#else /* CONFIG_ANON_VMA_NAME */
static int replace_anon_vma_name(struct vm_area_struct *vma,
@@ -993,12 +1005,13 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
return -EINVAL;
}
-static long madvise_populate(unsigned long start, unsigned long end,
- struct madvise_behavior *madv_behavior)
+static long madvise_populate(struct madvise_behavior *madv_behavior)
{
struct mm_struct *mm = madv_behavior->mm;
const bool write = madv_behavior->behavior == MADV_POPULATE_WRITE;
int locked = 1;
+ unsigned long start = madv_behavior->range.start;
+ unsigned long end = madv_behavior->range.end;
long pages;
while (start < end) {
@@ -1289,12 +1302,13 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
*/
static int madvise_vma_behavior(struct vm_area_struct *vma,
struct vm_area_struct **prev,
- unsigned long start, unsigned long end,
struct madvise_behavior *madv_behavior)
{
int behavior = madv_behavior->behavior;
struct anon_vma_name *anon_name = madv_behavior->anon_name;
vm_flags_t new_flags = vma->vm_flags;
+ unsigned long start = madv_behavior->range.start;
+ unsigned long end = madv_behavior->range.end;
bool set_new_anon_name = behavior == __MADV_SET_ANON_VMA_NAME;
int error;
@@ -1411,10 +1425,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
/*
* Error injection support for memory error handling.
*/
-static int madvise_inject_error(unsigned long start, unsigned long end,
- struct madvise_behavior *madv_behavior)
+static int madvise_inject_error(struct madvise_behavior *madv_behavior)
{
unsigned long size;
+ unsigned long start = madv_behavior->range.start;
+ unsigned long end = madv_behavior->range.end;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
@@ -1468,8 +1483,7 @@ static bool is_memory_failure(struct madvise_behavior *madv_behavior)
#else
-static int madvise_inject_error(unsigned long start, unsigned long end,
- struct madvise_behavior *madv_behavior)
+static int madvise_inject_error(struct madvise_behavior *madv_behavior)
{
return 0;
}
@@ -1551,21 +1565,20 @@ static bool process_madvise_remote_valid(int behavior)
* If a VMA read lock could not be acquired, we return NULL and expect caller to
* fallback to mmap lock behaviour.
*/
-static struct vm_area_struct *try_vma_read_lock(
- struct madvise_behavior *madv_behavior,
- unsigned long start, unsigned long end)
+static
+struct vm_area_struct *try_vma_read_lock(struct madvise_behavior *madv_behavior)
{
struct mm_struct *mm = madv_behavior->mm;
struct vm_area_struct *vma;
- vma = lock_vma_under_rcu(mm, start);
+ vma = lock_vma_under_rcu(mm, madv_behavior->range.start);
if (!vma)
goto take_mmap_read_lock;
/*
* Must span only a single VMA; uffd and remote processes are
* unsupported.
*/
- if (end > vma->vm_end || current->mm != mm ||
+ if (madv_behavior->range.end > vma->vm_end || current->mm != mm ||
userfaultfd_armed(vma)) {
vma_end_read(vma);
goto take_mmap_read_lock;
@@ -1588,13 +1601,14 @@ static struct vm_area_struct *try_vma_read_lock(
* reading or writing.
*/
static
-int madvise_walk_vmas(unsigned long start, unsigned long end,
- struct madvise_behavior *madv_behavior)
+int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
{
struct mm_struct *mm = madv_behavior->mm;
+ struct madvise_behavior_range *range = &madv_behavior->range;
+ /* range is updated to span each VMA, so store end of entire range. */
+ unsigned long last_end = range->end;
struct vm_area_struct *vma;
struct vm_area_struct *prev;
- unsigned long tmp;
int unmapped_error = 0;
int error;
@@ -1603,11 +1617,10 @@ int madvise_walk_vmas(unsigned long start, unsigned long end,
* tentatively, avoiding walking VMAs.
*/
if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
- vma = try_vma_read_lock(madv_behavior, start, end);
+ vma = try_vma_read_lock(madv_behavior);
if (vma) {
prev = vma;
- error = madvise_vma_behavior(vma, &prev, start, end,
- madv_behavior);
+ error = madvise_vma_behavior(vma, &prev, madv_behavior);
vma_end_read(vma);
return error;
}
@@ -1618,8 +1631,8 @@ int madvise_walk_vmas(unsigned long start, unsigned long end,
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
- vma = find_vma_prev(mm, start, &prev);
- if (vma && start > vma->vm_start)
+ vma = find_vma_prev(mm, range->start, &prev);
+ if (vma && range->start > vma->vm_start)
prev = vma;
for (;;) {
@@ -1627,33 +1640,30 @@ int madvise_walk_vmas(unsigned long start, unsigned long end,
if (!vma)
return -ENOMEM;
- /* Here start < (end|vma->vm_end). */
- if (start < vma->vm_start) {
+ /* Here start < (last_end|vma->vm_end). */
+ if (range->start < vma->vm_start) {
unmapped_error = -ENOMEM;
- start = vma->vm_start;
- if (start >= end)
+ range->start = vma->vm_start;
+ if (range->start >= last_end)
break;
}
- /* Here vma->vm_start <= start < (end|vma->vm_end) */
- tmp = vma->vm_end;
- if (end < tmp)
- tmp = end;
+ /* Here vma->vm_start <= range->start < (last_end|vma->vm_end) */
+ range->end = min(vma->vm_end, last_end);
- /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
- error = madvise_vma_behavior(vma, &prev, start, tmp,
- madv_behavior);
+ /* Here vma->vm_start <= range->start < range->end <= (last_end|vma->vm_end). */
+ error = madvise_vma_behavior(vma, &prev, madv_behavior);
if (error)
return error;
- start = tmp;
- if (prev && start < prev->vm_end)
- start = prev->vm_end;
- if (start >= end)
+ range->start = range->end;
+ if (prev && range->start < prev->vm_end)
+ range->start = prev->vm_end;
+ if (range->start >= last_end)
break;
if (prev)
vma = find_vma(mm, prev->vm_end);
else /* madvise_remove dropped mmap_lock */
- vma = find_vma(mm, start);
+ vma = find_vma(mm, range->start);
}
return unmapped_error;
@@ -1834,22 +1844,23 @@ static int madvise_do_behavior(unsigned long start, size_t len_in,
struct madvise_behavior *madv_behavior)
{
struct blk_plug plug;
- unsigned long end;
int error;
+ struct madvise_behavior_range *range = &madv_behavior->range;
if (is_memory_failure(madv_behavior)) {
- end = start + len_in;
- return madvise_inject_error(start, end, madv_behavior);
+ range->start = start;
+ range->end = start + len_in;
+ return madvise_inject_error(madv_behavior);
}
- start = get_untagged_addr(madv_behavior->mm, start);
- end = start + PAGE_ALIGN(len_in);
+ range->start = get_untagged_addr(madv_behavior->mm, start);
+ range->end = range->start + PAGE_ALIGN(len_in);
blk_start_plug(&plug);
if (is_madvise_populate(madv_behavior))
- error = madvise_populate(start, end, madv_behavior);
+ error = madvise_populate(madv_behavior);
else
- error = madvise_walk_vmas(start, end, madv_behavior);
+ error = madvise_walk_vmas(madv_behavior);
blk_finish_plug(&plug);
return error;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v2 4/5] mm/madvise: thread all madvise state through madv_behavior
2025-06-20 15:33 [PATCH v2 0/5] madvise cleanup Lorenzo Stoakes
` (2 preceding siblings ...)
2025-06-20 15:33 ` [PATCH v2 3/5] mm/madvise: thread VMA range state " Lorenzo Stoakes
@ 2025-06-20 15:33 ` Lorenzo Stoakes
2025-06-20 17:56 ` SeongJae Park
2025-06-20 15:33 ` [PATCH v2 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA Lorenzo Stoakes
2025-06-20 17:21 ` [PATCH v2 0/5] madvise cleanup SeongJae Park
5 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 15:33 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
Doing so means we can get rid of all the weird struct vm_area_struct **prev
stuff, everything becomes consistent and in future if we want to make
change to behaviour there's a single place where all relevant state is
stored.
This also allows us to update try_vma_read_lock() to be a little more
succinct and set up state for us, as well as cleaning up
madvise_update_vma().
We also update the debug assertion prior to madvise_update_vma() to assert
that this is a write operation as correctly pointed out by Barry in the
relevant thread.
We can't reasonably update the madvise functions that live outside of
mm/madvise.c so we leave those as-is.
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/madvise.c | 281 ++++++++++++++++++++++++++-------------------------
1 file changed, 145 insertions(+), 136 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 764a532ff62a..f04b8165e2ab 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -74,6 +74,8 @@ struct madvise_behavior {
* traversing multiple VMAs, this is updated for each.
*/
struct madvise_behavior_range range;
+ /* The VMA and VMA preceding it (if applicable) currently targeted. */
+ struct vm_area_struct *prev, *vma;
};
#ifdef CONFIG_ANON_VMA_NAME
@@ -177,26 +179,27 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
* Caller should ensure anon_name stability by raising its refcount even when
* anon_name belongs to a valid vma because this function might free that vma.
*/
-static int madvise_update_vma(struct vm_area_struct *vma,
- struct vm_area_struct **prev, unsigned long start,
- unsigned long end, vm_flags_t new_flags,
- struct anon_vma_name *anon_name)
+static int madvise_update_vma(vm_flags_t new_flags,
+ struct madvise_behavior *madv_behavior)
{
- struct mm_struct *mm = vma->vm_mm;
int error;
- VMA_ITERATOR(vmi, mm, start);
+ struct vm_area_struct *vma = madv_behavior->vma;
+ struct madvise_behavior_range *range = &madv_behavior->range;
+ struct anon_vma_name *anon_name = madv_behavior->anon_name;
+ VMA_ITERATOR(vmi, madv_behavior->mm, range->start);
if (new_flags == vma->vm_flags && anon_vma_name_eq(anon_vma_name(vma), anon_name)) {
- *prev = vma;
+ madv_behavior->prev = vma;
return 0;
}
- vma = vma_modify_flags_name(&vmi, *prev, vma, start, end, new_flags,
- anon_name);
+ vma = vma_modify_flags_name(&vmi, madv_behavior->prev, vma,
+ range->start, range->end, new_flags, anon_name);
if (IS_ERR(vma))
return PTR_ERR(vma);
- *prev = vma;
+ madv_behavior->vma = vma;
+ madv_behavior->prev = vma;
/* vm_flags is protected by the mmap_lock held in write mode. */
vma_start_write(vma);
@@ -301,15 +304,16 @@ static void shmem_swapin_range(struct vm_area_struct *vma,
/*
* Schedule all required I/O operations. Do not wait for completion.
*/
-static long madvise_willneed(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+static long madvise_willneed(struct madvise_behavior *madv_behavior)
{
- struct mm_struct *mm = vma->vm_mm;
+ struct vm_area_struct *vma = madv_behavior->vma;
+ struct mm_struct *mm = madv_behavior->mm;
struct file *file = vma->vm_file;
+ unsigned long start = madv_behavior->range.start;
+ unsigned long end = madv_behavior->range.end;
loff_t offset;
- *prev = vma;
+ madv_behavior->prev = vma;
#ifdef CONFIG_SWAP
if (!file) {
walk_page_range_vma(vma, start, end, &swapin_walk_ops, vma);
@@ -338,7 +342,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
* vma's reference to the file) can go away as soon as we drop
* mmap_lock.
*/
- *prev = NULL; /* tell sys_madvise we drop mmap_lock */
+ madv_behavior->prev = NULL; /* tell sys_madvise we drop mmap_lock */
get_file(file);
offset = (loff_t)(start - vma->vm_start)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
@@ -603,16 +607,19 @@ static const struct mm_walk_ops cold_walk_ops = {
};
static void madvise_cold_page_range(struct mmu_gather *tlb,
- struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
+ struct madvise_behavior *madv_behavior)
+
{
+ struct vm_area_struct *vma = madv_behavior->vma;
+ struct madvise_behavior_range *range = &madv_behavior->range;
struct madvise_walk_private walk_private = {
.pageout = false,
.tlb = tlb,
};
tlb_start_vma(tlb, vma);
- walk_page_range_vma(vma, addr, end, &cold_walk_ops, &walk_private);
+ walk_page_range_vma(vma, range->start, range->end, &cold_walk_ops,
+ &walk_private);
tlb_end_vma(tlb, vma);
}
@@ -621,28 +628,26 @@ static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
return !(vma->vm_flags & (VM_LOCKED|VM_PFNMAP|VM_HUGETLB));
}
-static long madvise_cold(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start_addr, unsigned long end_addr)
+static long madvise_cold(struct madvise_behavior *madv_behavior)
{
- struct mm_struct *mm = vma->vm_mm;
+ struct vm_area_struct *vma = madv_behavior->vma;
struct mmu_gather tlb;
- *prev = vma;
+ madv_behavior->prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
- madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_gather_mmu(&tlb, madv_behavior->mm);
+ madvise_cold_page_range(&tlb, madv_behavior);
tlb_finish_mmu(&tlb);
return 0;
}
static void madvise_pageout_page_range(struct mmu_gather *tlb,
- struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
+ struct vm_area_struct *vma,
+ struct madvise_behavior_range *range)
{
struct madvise_walk_private walk_private = {
.pageout = true,
@@ -650,18 +655,17 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
};
tlb_start_vma(tlb, vma);
- walk_page_range_vma(vma, addr, end, &cold_walk_ops, &walk_private);
+ walk_page_range_vma(vma, range->start, range->end, &cold_walk_ops,
+ &walk_private);
tlb_end_vma(tlb, vma);
}
-static long madvise_pageout(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start_addr, unsigned long end_addr)
+static long madvise_pageout(struct madvise_behavior *madv_behavior)
{
- struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
+ struct vm_area_struct *vma = madv_behavior->vma;
- *prev = vma;
+ madv_behavior->prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
@@ -676,8 +680,8 @@ static long madvise_pageout(struct vm_area_struct *vma,
return 0;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
- madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_gather_mmu(&tlb, madv_behavior->mm);
+ madvise_pageout_page_range(&tlb, vma, &madv_behavior->range);
tlb_finish_mmu(&tlb);
return 0;
@@ -840,11 +844,12 @@ static inline enum page_walk_lock get_walk_lock(enum madvise_lock_mode mode)
}
}
-static int madvise_free_single_vma(struct madvise_behavior *madv_behavior,
- struct vm_area_struct *vma,
- unsigned long start_addr, unsigned long end_addr)
+static int madvise_free_single_vma(struct madvise_behavior *madv_behavior)
{
- struct mm_struct *mm = vma->vm_mm;
+ struct mm_struct *mm = madv_behavior->mm;
+ struct vm_area_struct *vma = madv_behavior->vma;
+ unsigned long start_addr = madv_behavior->range.start;
+ unsigned long end_addr = madv_behavior->range.end;
struct mmu_notifier_range range;
struct mmu_gather *tlb = madv_behavior->tlb;
struct mm_walk_ops walk_ops = {
@@ -896,25 +901,28 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior,
* An interface that causes the system to free clean pages and flush
* dirty pages is already available as msync(MS_INVALIDATE).
*/
-static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior,
- struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior)
+
{
+ struct madvise_behavior_range *range = &madv_behavior->range;
struct zap_details details = {
.reclaim_pt = true,
.even_cows = true,
};
zap_page_range_single_batched(
- madv_behavior->tlb, vma, start, end - start, &details);
+ madv_behavior->tlb, madv_behavior->vma, range->start,
+ range->end - range->start, &details);
return 0;
}
-static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
- unsigned long start,
- unsigned long *end,
- int behavior)
+static
+bool madvise_dontneed_free_valid_vma(struct madvise_behavior *madv_behavior)
{
+ struct vm_area_struct *vma = madv_behavior->vma;
+ int behavior = madv_behavior->behavior;
+ struct madvise_behavior_range *range = &madv_behavior->range;
+
if (!is_vm_hugetlb_page(vma)) {
unsigned int forbidden = VM_PFNMAP;
@@ -926,7 +934,7 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
if (behavior != MADV_DONTNEED && behavior != MADV_DONTNEED_LOCKED)
return false;
- if (start & ~huge_page_mask(hstate_vma(vma)))
+ if (range->start & ~huge_page_mask(hstate_vma(vma)))
return false;
/*
@@ -935,41 +943,40 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
* Avoid unexpected data loss by rounding down the number of
* huge pages freed.
*/
- *end = ALIGN_DOWN(*end, huge_page_size(hstate_vma(vma)));
+ range->end = ALIGN_DOWN(range->end, huge_page_size(hstate_vma(vma)));
return true;
}
-static long madvise_dontneed_free(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end,
- struct madvise_behavior *madv_behavior)
+static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
{
+ struct mm_struct *mm = madv_behavior->mm;
+ struct madvise_behavior_range *range = &madv_behavior->range;
int behavior = madv_behavior->behavior;
- struct mm_struct *mm = vma->vm_mm;
- *prev = vma;
- if (!madvise_dontneed_free_valid_vma(vma, start, &end, behavior))
+ madv_behavior->prev = madv_behavior->vma;
+ if (!madvise_dontneed_free_valid_vma(madv_behavior))
return -EINVAL;
- if (start == end)
+ if (range->start == range->end)
return 0;
- if (!userfaultfd_remove(vma, start, end)) {
- *prev = NULL; /* mmap_lock has been dropped, prev is stale */
+ if (!userfaultfd_remove(madv_behavior->vma, range->start, range->end)) {
+ struct vm_area_struct *vma;
+
+ madv_behavior->prev = NULL; /* mmap_lock has been dropped, prev is stale */
mmap_read_lock(mm);
- vma = vma_lookup(mm, start);
+ madv_behavior->vma = vma = vma_lookup(mm, range->start);
if (!vma)
return -ENOMEM;
/*
* Potential end adjustment for hugetlb vma is OK as
* the check below keeps end within vma.
*/
- if (!madvise_dontneed_free_valid_vma(vma, start, &end,
- behavior))
+ if (!madvise_dontneed_free_valid_vma(madv_behavior))
return -EINVAL;
- if (end > vma->vm_end) {
+ if (range->end > vma->vm_end) {
/*
* Don't fail if end > vma->vm_end. If the old
* vma was split while the mmap_lock was
@@ -982,7 +989,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
* end-vma->vm_end range, but the manager can
* handle a repetition fine.
*/
- end = vma->vm_end;
+ range->end = vma->vm_end;
}
/*
* If the memory region between start and end was
@@ -991,16 +998,15 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
* the adjustment for hugetlb vma above may have rounded
* end down to the start address.
*/
- if (start == end)
+ if (range->start == range->end)
return 0;
- VM_WARN_ON(start > end);
+ VM_WARN_ON(range->start > range->end);
}
if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
- return madvise_dontneed_single_vma(
- madv_behavior, vma, start, end);
+ return madvise_dontneed_single_vma(madv_behavior);
else if (behavior == MADV_FREE)
- return madvise_free_single_vma(madv_behavior, vma, start, end);
+ return madvise_free_single_vma(madv_behavior);
else
return -EINVAL;
}
@@ -1048,16 +1054,17 @@ static long madvise_populate(struct madvise_behavior *madv_behavior)
* Application wants to free up the pages and associated backing store.
* This is effectively punching a hole into the middle of a file.
*/
-static long madvise_remove(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+static long madvise_remove(struct madvise_behavior *madv_behavior)
{
loff_t offset;
int error;
struct file *f;
- struct mm_struct *mm = vma->vm_mm;
+ struct mm_struct *mm = madv_behavior->mm;
+ struct vm_area_struct *vma = madv_behavior->vma;
+ unsigned long start = madv_behavior->range.start;
+ unsigned long end = madv_behavior->range.end;
- *prev = NULL; /* tell sys_madvise we drop mmap_lock */
+ madv_behavior->prev = NULL; /* tell sys_madvise we drop mmap_lock */
if (vma->vm_flags & VM_LOCKED)
return -EINVAL;
@@ -1169,14 +1176,14 @@ static const struct mm_walk_ops guard_install_walk_ops = {
.walk_lock = PGWALK_RDLOCK,
};
-static long madvise_guard_install(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+static long madvise_guard_install(struct madvise_behavior *madv_behavior)
{
+ struct vm_area_struct *vma = madv_behavior->vma;
+ struct madvise_behavior_range *range = &madv_behavior->range;
long err;
int i;
- *prev = vma;
+ madv_behavior->prev = vma;
if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
@@ -1207,13 +1214,14 @@ static long madvise_guard_install(struct vm_area_struct *vma,
unsigned long nr_pages = 0;
/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
- err = walk_page_range_mm(vma->vm_mm, start, end,
+ err = walk_page_range_mm(vma->vm_mm, range->start, range->end,
&guard_install_walk_ops, &nr_pages);
if (err < 0)
return err;
if (err == 0) {
- unsigned long nr_expected_pages = PHYS_PFN(end - start);
+ unsigned long nr_expected_pages =
+ PHYS_PFN(range->end - range->start);
VM_WARN_ON(nr_pages != nr_expected_pages);
return 0;
@@ -1223,7 +1231,8 @@ static long madvise_guard_install(struct vm_area_struct *vma,
* OK some of the range have non-guard pages mapped, zap
* them. This leaves existing guard pages in place.
*/
- zap_page_range_single(vma, start, end - start, NULL);
+ zap_page_range_single(vma, range->start,
+ range->end - range->start, NULL);
}
/*
@@ -1279,11 +1288,12 @@ static const struct mm_walk_ops guard_remove_walk_ops = {
.walk_lock = PGWALK_RDLOCK,
};
-static long madvise_guard_remove(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
{
- *prev = vma;
+ struct vm_area_struct *vma = madv_behavior->vma;
+ struct madvise_behavior_range *range = &madv_behavior->range;
+
+ madv_behavior->prev = vma;
/*
* We're ok with removing guards in mlock()'d ranges, as this is a
* non-destructive action.
@@ -1291,7 +1301,7 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
if (!is_valid_guard_vma(vma, /* allow_locked = */true))
return -EINVAL;
- return walk_page_range_vma(vma, start, end,
+ return walk_page_range_vma(vma, range->start, range->end,
&guard_remove_walk_ops, NULL);
}
@@ -1300,41 +1310,38 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
* will handle splitting a vm area into separate areas, each area with its own
* behavior.
*/
-static int madvise_vma_behavior(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- struct madvise_behavior *madv_behavior)
+static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
{
int behavior = madv_behavior->behavior;
- struct anon_vma_name *anon_name = madv_behavior->anon_name;
+ struct vm_area_struct *vma = madv_behavior->vma;
vm_flags_t new_flags = vma->vm_flags;
- unsigned long start = madv_behavior->range.start;
- unsigned long end = madv_behavior->range.end;
bool set_new_anon_name = behavior == __MADV_SET_ANON_VMA_NAME;
+ struct madvise_behavior_range *range = &madv_behavior->range;
int error;
- if (unlikely(!can_modify_vma_madv(vma, behavior)))
+ if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
return -EPERM;
switch (behavior) {
case MADV_REMOVE:
- return madvise_remove(vma, prev, start, end);
+ return madvise_remove(madv_behavior);
case MADV_WILLNEED:
- return madvise_willneed(vma, prev, start, end);
+ return madvise_willneed(madv_behavior);
case MADV_COLD:
- return madvise_cold(vma, prev, start, end);
+ return madvise_cold(madv_behavior);
case MADV_PAGEOUT:
- return madvise_pageout(vma, prev, start, end);
+ return madvise_pageout(madv_behavior);
case MADV_FREE:
case MADV_DONTNEED:
case MADV_DONTNEED_LOCKED:
- return madvise_dontneed_free(vma, prev, start, end,
- madv_behavior);
+ return madvise_dontneed_free(madv_behavior);
case MADV_COLLAPSE:
- return madvise_collapse(vma, prev, start, end);
+ return madvise_collapse(vma, &madv_behavior->prev,
+ range->start, range->end);
case MADV_GUARD_INSTALL:
- return madvise_guard_install(vma, prev, start, end);
+ return madvise_guard_install(madv_behavior);
case MADV_GUARD_REMOVE:
- return madvise_guard_remove(vma, prev, start, end);
+ return madvise_guard_remove(madv_behavior);
/* The below behaviours update VMAs via madvise_update_vma(). */
@@ -1351,18 +1358,18 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
new_flags |= VM_DONTCOPY;
break;
case MADV_DOFORK:
- if (vma->vm_flags & VM_IO)
+ if (new_flags & VM_IO)
return -EINVAL;
new_flags &= ~VM_DONTCOPY;
break;
case MADV_WIPEONFORK:
/* MADV_WIPEONFORK is only supported on anonymous memory. */
- if (vma->vm_file || vma->vm_flags & VM_SHARED)
+ if (vma->vm_file || new_flags & VM_SHARED)
return -EINVAL;
new_flags |= VM_WIPEONFORK;
break;
case MADV_KEEPONFORK:
- if (vma->vm_flags & VM_DROPPABLE)
+ if (new_flags & VM_DROPPABLE)
return -EINVAL;
new_flags &= ~VM_WIPEONFORK;
break;
@@ -1370,14 +1377,15 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
new_flags |= VM_DONTDUMP;
break;
case MADV_DODUMP:
- if ((!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) ||
- (vma->vm_flags & VM_DROPPABLE))
+ if ((!is_vm_hugetlb_page(vma) && (new_flags & VM_SPECIAL)) ||
+ (new_flags & VM_DROPPABLE))
return -EINVAL;
new_flags &= ~VM_DONTDUMP;
break;
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
- error = ksm_madvise(vma, start, end, behavior, &new_flags);
+ error = ksm_madvise(vma, range->start, range->end,
+ behavior, &new_flags);
if (error)
goto out;
break;
@@ -1394,8 +1402,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
break;
}
- /* We cannot provide prev in this lock mode. */
- VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK);
+ /* This is a write operation.*/
+ VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK);
/*
* madvise_update_vma() might cause a VMA merge which could put an
@@ -1403,14 +1411,12 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
* anon_vma_name so it doesn't disappear from under us.
*/
if (!set_new_anon_name) {
- anon_name = anon_vma_name(vma);
- anon_vma_name_get(anon_name);
+ madv_behavior->anon_name = anon_vma_name(vma);
+ anon_vma_name_get(madv_behavior->anon_name);
}
- error = madvise_update_vma(vma, prev, start, end, new_flags,
- anon_name);
+ error = madvise_update_vma(new_flags, madv_behavior);
if (!set_new_anon_name)
- anon_vma_name_put(anon_name);
-
+ anon_vma_name_put(madv_behavior->anon_name);
out:
/*
* madvise() returns EAGAIN if kernel resources, such as
@@ -1560,13 +1566,13 @@ static bool process_madvise_remote_valid(int behavior)
* span either partially or fully.
*
* This function always returns with an appropriate lock held. If a VMA read
- * lock could be acquired, we return the locked VMA.
+ * lock could be acquired, we return true and set madv_behavior state
+ * accordingly.
*
- * If a VMA read lock could not be acquired, we return NULL and expect caller to
+ * If a VMA read lock could not be acquired, we return false and expect caller to
* fallback to mmap lock behaviour.
*/
-static
-struct vm_area_struct *try_vma_read_lock(struct madvise_behavior *madv_behavior)
+static bool try_vma_read_lock(struct madvise_behavior *madv_behavior)
{
struct mm_struct *mm = madv_behavior->mm;
struct vm_area_struct *vma;
@@ -1583,12 +1589,14 @@ struct vm_area_struct *try_vma_read_lock(struct madvise_behavior *madv_behavior)
vma_end_read(vma);
goto take_mmap_read_lock;
}
- return vma;
+ madv_behavior->prev = vma; /* Not currently required. */
+ madv_behavior->vma = vma;
+ return true;
take_mmap_read_lock:
mmap_read_lock(mm);
madv_behavior->lock_mode = MADVISE_MMAP_READ_LOCK;
- return NULL;
+ return false;
}
/*
@@ -1607,23 +1615,19 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
struct madvise_behavior_range *range = &madv_behavior->range;
/* range is updated to span each VMA, so store end of entire range. */
unsigned long last_end = range->end;
- struct vm_area_struct *vma;
- struct vm_area_struct *prev;
int unmapped_error = 0;
int error;
+ struct vm_area_struct *vma;
/*
* If VMA read lock is supported, apply madvise to a single VMA
* tentatively, avoiding walking VMAs.
*/
- if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
- vma = try_vma_read_lock(madv_behavior);
- if (vma) {
- prev = vma;
- error = madvise_vma_behavior(vma, &prev, madv_behavior);
- vma_end_read(vma);
- return error;
- }
+ if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK &&
+ try_vma_read_lock(madv_behavior)) {
+ error = madvise_vma_behavior(madv_behavior);
+ vma_end_read(madv_behavior->vma);
+ return error;
}
/*
@@ -1631,11 +1635,13 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
- vma = find_vma_prev(mm, range->start, &prev);
+ vma = find_vma_prev(mm, range->start, &madv_behavior->prev);
if (vma && range->start > vma->vm_start)
- prev = vma;
+ madv_behavior->prev = vma;
for (;;) {
+ struct vm_area_struct *prev;
+
/* Still start < end. */
if (!vma)
return -ENOMEM;
@@ -1652,9 +1658,12 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
range->end = min(vma->vm_end, last_end);
/* Here vma->vm_start <= range->start < range->end <= (last_end|vma->vm_end). */
- error = madvise_vma_behavior(vma, &prev, madv_behavior);
+ madv_behavior->vma = vma;
+ error = madvise_vma_behavior(madv_behavior);
if (error)
return error;
+ prev = madv_behavior->prev;
+
range->start = range->end;
if (prev && range->start < prev->vm_end)
range->start = prev->vm_end;
--
2.49.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v2 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA
2025-06-20 15:33 [PATCH v2 0/5] madvise cleanup Lorenzo Stoakes
` (3 preceding siblings ...)
2025-06-20 15:33 ` [PATCH v2 4/5] mm/madvise: thread all madvise state through madv_behavior Lorenzo Stoakes
@ 2025-06-20 15:33 ` Lorenzo Stoakes
2025-06-20 17:13 ` Zi Yan
` (2 more replies)
2025-06-20 17:21 ` [PATCH v2 0/5] madvise cleanup SeongJae Park
5 siblings, 3 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 15:33 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
The madvise code has for the longest time had very confusing code around
the 'prev' VMA pointer passed around various functions which, in all cases
except madvise_update_vma(), is unused and instead simply updated as soon
as the function is invoked.
To compound the confusion, the prev pointer is also used to indicate to the
caller that the mmap lock has been dropped and that we can therefore not
safely access the end of the current VMA (which might have been updated by
madvise_update_vma()).
Clear up this confusion by not setting prev = vma anywhere except in
madvise_walk_vmas(), update all references to prev which will always be
equal to vma after madvise_vma_behavior() is invoked, and adding a flag to
indicate that the lock has been dropped to make this explicit.
Additionally, drop a redundant BUG_ON() from madvise_collapse(), which is
simply reiterating the BUG_ON(mmap_locked) above it (note that BUG_ON() is
not appropriate here, but we leave existing code as-is).
We finally adjust the madvise_walk_vmas() logic to be a little clearer -
delaying the assignment of the end of the range to the start of the new
range until the last moment and handling the lock being dropped scenario
immediately.
Additionally add some explanatory comments.
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
include/linux/huge_mm.h | 9 +++--
mm/khugepaged.c | 9 ++---
mm/madvise.c | 77 +++++++++++++++++++++--------------------
3 files changed, 47 insertions(+), 48 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 8f1b15213f61..4d5bb67dc4ec 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -433,9 +433,8 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
int hugepage_madvise(struct vm_area_struct *vma, vm_flags_t *vm_flags,
int advice);
-int madvise_collapse(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end);
+int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, bool *lock_dropped);
void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
unsigned long end, struct vm_area_struct *next);
spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -596,8 +595,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
}
static inline int madvise_collapse(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end, bool *lock_dropped)
{
return -EINVAL;
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3495a20cef5e..1aa7ca67c756 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2727,8 +2727,8 @@ static int madvise_collapse_errno(enum scan_result r)
}
}
-int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, bool *lock_dropped)
{
struct collapse_control *cc;
struct mm_struct *mm = vma->vm_mm;
@@ -2739,8 +2739,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
BUG_ON(vma->vm_start > start);
BUG_ON(vma->vm_end < end);
- *prev = vma;
-
if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
return -EINVAL;
@@ -2788,7 +2786,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
&mmap_locked, cc);
}
if (!mmap_locked)
- *prev = NULL; /* Tell caller we dropped mmap_lock */
+ *lock_dropped = true;
handle_result:
switch (result) {
@@ -2798,7 +2796,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
break;
case SCAN_PTE_MAPPED_HUGEPAGE:
BUG_ON(mmap_locked);
- BUG_ON(*prev);
mmap_read_lock(mm);
result = collapse_pte_mapped_thp(mm, addr, true);
mmap_read_unlock(mm);
diff --git a/mm/madvise.c b/mm/madvise.c
index f04b8165e2ab..4491bf080f55 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -75,7 +75,9 @@ struct madvise_behavior {
*/
struct madvise_behavior_range range;
/* The VMA and VMA preceding it (if applicable) currently targeted. */
- struct vm_area_struct *prev, *vma;
+ struct vm_area_struct *prev;
+ struct vm_area_struct *vma;
+ bool lock_dropped;
};
#ifdef CONFIG_ANON_VMA_NAME
@@ -188,10 +190,8 @@ static int madvise_update_vma(vm_flags_t new_flags,
struct anon_vma_name *anon_name = madv_behavior->anon_name;
VMA_ITERATOR(vmi, madv_behavior->mm, range->start);
- if (new_flags == vma->vm_flags && anon_vma_name_eq(anon_vma_name(vma), anon_name)) {
- madv_behavior->prev = vma;
+ if (new_flags == vma->vm_flags && anon_vma_name_eq(anon_vma_name(vma), anon_name))
return 0;
- }
vma = vma_modify_flags_name(&vmi, madv_behavior->prev, vma,
range->start, range->end, new_flags, anon_name);
@@ -199,7 +199,6 @@ static int madvise_update_vma(vm_flags_t new_flags,
return PTR_ERR(vma);
madv_behavior->vma = vma;
- madv_behavior->prev = vma;
/* vm_flags is protected by the mmap_lock held in write mode. */
vma_start_write(vma);
@@ -301,6 +300,12 @@ static void shmem_swapin_range(struct vm_area_struct *vma,
}
#endif /* CONFIG_SWAP */
+static void mark_mmap_lock_dropped(struct madvise_behavior *madv_behavior)
+{
+ VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK);
+ madv_behavior->lock_dropped = true;
+}
+
/*
* Schedule all required I/O operations. Do not wait for completion.
*/
@@ -313,7 +318,6 @@ static long madvise_willneed(struct madvise_behavior *madv_behavior)
unsigned long end = madv_behavior->range.end;
loff_t offset;
- madv_behavior->prev = vma;
#ifdef CONFIG_SWAP
if (!file) {
walk_page_range_vma(vma, start, end, &swapin_walk_ops, vma);
@@ -342,7 +346,7 @@ static long madvise_willneed(struct madvise_behavior *madv_behavior)
* vma's reference to the file) can go away as soon as we drop
* mmap_lock.
*/
- madv_behavior->prev = NULL; /* tell sys_madvise we drop mmap_lock */
+ mark_mmap_lock_dropped(madv_behavior);
get_file(file);
offset = (loff_t)(start - vma->vm_start)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
@@ -633,7 +637,6 @@ static long madvise_cold(struct madvise_behavior *madv_behavior)
struct vm_area_struct *vma = madv_behavior->vma;
struct mmu_gather tlb;
- madv_behavior->prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
@@ -665,7 +668,6 @@ static long madvise_pageout(struct madvise_behavior *madv_behavior)
struct mmu_gather tlb;
struct vm_area_struct *vma = madv_behavior->vma;
- madv_behavior->prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
@@ -954,7 +956,6 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
struct madvise_behavior_range *range = &madv_behavior->range;
int behavior = madv_behavior->behavior;
- madv_behavior->prev = madv_behavior->vma;
if (!madvise_dontneed_free_valid_vma(madv_behavior))
return -EINVAL;
@@ -964,8 +965,7 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
if (!userfaultfd_remove(madv_behavior->vma, range->start, range->end)) {
struct vm_area_struct *vma;
- madv_behavior->prev = NULL; /* mmap_lock has been dropped, prev is stale */
-
+ mark_mmap_lock_dropped(madv_behavior);
mmap_read_lock(mm);
madv_behavior->vma = vma = vma_lookup(mm, range->start);
if (!vma)
@@ -1064,7 +1064,7 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
unsigned long start = madv_behavior->range.start;
unsigned long end = madv_behavior->range.end;
- madv_behavior->prev = NULL; /* tell sys_madvise we drop mmap_lock */
+ mark_mmap_lock_dropped(madv_behavior);
if (vma->vm_flags & VM_LOCKED)
return -EINVAL;
@@ -1183,7 +1183,6 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
long err;
int i;
- madv_behavior->prev = vma;
if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
@@ -1293,7 +1292,6 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
struct vm_area_struct *vma = madv_behavior->vma;
struct madvise_behavior_range *range = &madv_behavior->range;
- madv_behavior->prev = vma;
/*
* We're ok with removing guards in mlock()'d ranges, as this is a
* non-destructive action.
@@ -1336,8 +1334,8 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
case MADV_DONTNEED_LOCKED:
return madvise_dontneed_free(madv_behavior);
case MADV_COLLAPSE:
- return madvise_collapse(vma, &madv_behavior->prev,
- range->start, range->end);
+ return madvise_collapse(vma, range->start, range->end,
+ &madv_behavior->lock_dropped);
case MADV_GUARD_INSTALL:
return madvise_guard_install(madv_behavior);
case MADV_GUARD_REMOVE:
@@ -1589,7 +1587,6 @@ static bool try_vma_read_lock(struct madvise_behavior *madv_behavior)
vma_end_read(vma);
goto take_mmap_read_lock;
}
- madv_behavior->prev = vma; /* Not currently required. */
madv_behavior->vma = vma;
return true;
@@ -1617,7 +1614,7 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
unsigned long last_end = range->end;
int unmapped_error = 0;
int error;
- struct vm_area_struct *vma;
+ struct vm_area_struct *prev, *vma;
/*
* If VMA read lock is supported, apply madvise to a single VMA
@@ -1630,24 +1627,23 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
return error;
}
- /*
- * If the interval [start,end) covers some unmapped address
- * ranges, just ignore them, but return -ENOMEM at the end.
- * - different from the way of handling in mlock etc.
- */
- vma = find_vma_prev(mm, range->start, &madv_behavior->prev);
+ vma = find_vma_prev(mm, range->start, &prev);
if (vma && range->start > vma->vm_start)
- madv_behavior->prev = vma;
+ prev = vma;
for (;;) {
- struct vm_area_struct *prev;
-
/* Still start < end. */
if (!vma)
return -ENOMEM;
/* Here start < (last_end|vma->vm_end). */
if (range->start < vma->vm_start) {
+ /*
+ * This indicates a gap between VMAs in the input
+ * range. This does not cause the operation to abort,
+ * rather we simply return -ENOMEM to indicate that this
+ * has happened, but carry on.
+ */
unmapped_error = -ENOMEM;
range->start = vma->vm_start;
if (range->start >= last_end)
@@ -1658,21 +1654,28 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
range->end = min(vma->vm_end, last_end);
/* Here vma->vm_start <= range->start < range->end <= (last_end|vma->vm_end). */
+ madv_behavior->prev = prev;
madv_behavior->vma = vma;
error = madvise_vma_behavior(madv_behavior);
if (error)
return error;
- prev = madv_behavior->prev;
+ if (madv_behavior->lock_dropped) {
+ /* We dropped the mmap lock, we can't ref the VMA. */
+ prev = NULL;
+ vma = NULL;
+ madv_behavior->lock_dropped = false;
+ } else {
+ prev = vma;
+ vma = madv_behavior->vma;
+ }
- range->start = range->end;
- if (prev && range->start < prev->vm_end)
- range->start = prev->vm_end;
- if (range->start >= last_end)
+ if (vma && range->end < vma->vm_end)
+ range->end = vma->vm_end;
+ if (range->end >= last_end)
break;
- if (prev)
- vma = find_vma(mm, prev->vm_end);
- else /* madvise_remove dropped mmap_lock */
- vma = find_vma(mm, range->start);
+
+ vma = find_vma(mm, vma ? vma->vm_end : range->end);
+ range->start = range->end;
}
return unmapped_error;
--
2.49.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state
2025-06-20 15:33 ` [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
@ 2025-06-20 16:57 ` Zi Yan
2025-06-20 17:12 ` SeongJae Park
` (2 subsequent siblings)
3 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-06-20 16:57 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 20 Jun 2025, at 11:33, Lorenzo Stoakes wrote:
> Now we have the madvise_behavior helper struct we no longer need to mess
> around with void* pointers in order to propagate anon_vma_name, and this
> means we can get rid of the confusing and inconsistent visitor pattern
> implementation in madvise_vma_anon_name().
>
> This means we now have a single state object that threads through most of
> madvise()'s logic and a single code path which executes the majority of
> madvise() behaviour (we maintain separate logic for failure injection and
> memory population for the time being).
>
> We are able to remove the visitor pattern by handling the anon_vma_name
> setting logic via an internal madvise flag - __MADV_SET_ANON_VMA_NAME. This
> uses a negative value so it isn't reasonable that we will ever add this as
> a UAPI flag.
>
> Additionally, the madvise_behavior_valid() check ensures that
> user-specified behaviours are strictly only those we permit which, of
> course, this flag will be excluded from.
>
> We are able to propagate the anon_vma_name object through use of the
> madvise_behavior helper struct.
>
> Doing this results in a can_modify_vma_madv() check for anonymous VMA name
> changes, however this will cause no issues as this operation is not
> prohibited.
>
> We can also then reuse more code and drop the redundant
> madvise_vma_anon_name() function altogether.
>
> Additionally separate out behaviours that update VMAs from those that do
> not.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 166 +++++++++++++++++++++++++--------------------------
> 1 file changed, 83 insertions(+), 83 deletions(-)
>
LGTM. Thanks. Reviewed-by: Zi Yan <ziy@nvidia.com>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 2/5] mm/madvise: thread mm_struct through madvise_behavior
2025-06-20 15:33 ` [PATCH v2 2/5] mm/madvise: thread mm_struct through madvise_behavior Lorenzo Stoakes
@ 2025-06-20 16:59 ` Zi Yan
2025-06-20 17:25 ` SeongJae Park
2025-06-23 22:42 ` Barry Song
2 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-06-20 16:59 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 20 Jun 2025, at 11:33, Lorenzo Stoakes wrote:
> There's no need to thread a pointer to the mm_struct nor have different
> functions signatures for each behaviour, instead store state in the struct
> madvise_behavior object consistently and use it for all madvise() actions.
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 110 ++++++++++++++++++++++++++-------------------------
> 1 file changed, 57 insertions(+), 53 deletions(-)
>
Reviewed-by: Zi Yan <ziy@nvidia.com>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 3/5] mm/madvise: thread VMA range state through madvise_behavior
2025-06-20 15:33 ` [PATCH v2 3/5] mm/madvise: thread VMA range state " Lorenzo Stoakes
@ 2025-06-20 17:02 ` Zi Yan
2025-06-20 17:33 ` SeongJae Park
2025-06-24 1:05 ` Barry Song
2 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-06-20 17:02 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 20 Jun 2025, at 11:33, Lorenzo Stoakes wrote:
> Rather than updating start and a confusing local parameter 'tmp' in
> madvise_walk_vmas(), instead store the current range being operated upon in
> the struct madvise_behavior helper object in a range pair and use this
> consistently in all operations.
>
> This makes it clearer what is going on and opens the door to further
> cleanup now we store state regarding what is currently being operated upon
> here.
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 103 ++++++++++++++++++++++++++++-----------------------
> 1 file changed, 57 insertions(+), 46 deletions(-)
>
Reviewed-by: Zi Yan <ziy@nvidia.com>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state
2025-06-20 15:33 ` [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
2025-06-20 16:57 ` Zi Yan
@ 2025-06-20 17:12 ` SeongJae Park
2025-06-23 22:38 ` Barry Song
2025-06-25 7:43 ` David Hildenbrand
3 siblings, 0 replies; 26+ messages in thread
From: SeongJae Park @ 2025-06-20 17:12 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Andrew Morton, David Hildenbrand, Zi Yan,
Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Vlastimil Babka, Jann Horn, linux-mm, linux-kernel,
Lance Yang, Suren Baghdasaryan
On Fri, 20 Jun 2025 16:33:01 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> Now we have the madvise_behavior helper struct we no longer need to mess
> around with void* pointers in order to propagate anon_vma_name, and this
> means we can get rid of the confusing and inconsistent visitor pattern
> implementation in madvise_vma_anon_name().
>
> This means we now have a single state object that threads through most of
> madvise()'s logic and a single code path which executes the majority of
> madvise() behaviour (we maintain separate logic for failure injection and
> memory population for the time being).
>
> We are able to remove the visitor pattern by handling the anon_vma_name
> setting logic via an internal madvise flag - __MADV_SET_ANON_VMA_NAME. This
> uses a negative value so it isn't reasonable that we will ever add this as
> a UAPI flag.
>
> Additionally, the madvise_behavior_valid() check ensures that
> user-specified behaviours are strictly only those we permit which, of
> course, this flag will be excluded from.
>
> We are able to propagate the anon_vma_name object through use of the
> madvise_behavior helper struct.
>
> Doing this results in a can_modify_vma_madv() check for anonymous VMA name
> changes, however this will cause no issues as this operation is not
> prohibited.
>
> We can also then reuse more code and drop the redundant
> madvise_vma_anon_name() function altogether.
>
> Additionally separate out behaviours that update VMAs from those that do
> not.
Nice work!
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA
2025-06-20 15:33 ` [PATCH v2 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA Lorenzo Stoakes
@ 2025-06-20 17:13 ` Zi Yan
2025-06-20 18:10 ` SeongJae Park
2025-06-24 13:16 ` Lorenzo Stoakes
2 siblings, 0 replies; 26+ messages in thread
From: Zi Yan @ 2025-06-20 17:13 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 20 Jun 2025, at 11:33, Lorenzo Stoakes wrote:
> The madvise code has for the longest time had very confusing code around
> the 'prev' VMA pointer passed around various functions which, in all cases
> except madvise_update_vma(), is unused and instead simply updated as soon
> as the function is invoked.
>
> To compound the confusion, the prev pointer is also used to indicate to the
> caller that the mmap lock has been dropped and that we can therefore not
> safely access the end of the current VMA (which might have been updated by
> madvise_update_vma()).
>
> Clear up this confusion by not setting prev = vma anywhere except in
> madvise_walk_vmas(), update all references to prev which will always be
> equal to vma after madvise_vma_behavior() is invoked, and adding a flag to
> indicate that the lock has been dropped to make this explicit.
>
> Additionally, drop a redundant BUG_ON() from madvise_collapse(), which is
> simply reiterating the BUG_ON(mmap_locked) above it (note that BUG_ON() is
> not appropriate here, but we leave existing code as-is).
>
> We finally adjust the madvise_walk_vmas() logic to be a little clearer -
> delaying the assignment of the end of the range to the start of the new
> range until the last moment and handling the lock being dropped scenario
> immediately.
>
> Additionally add some explanatory comments.
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/linux/huge_mm.h | 9 +++--
> mm/khugepaged.c | 9 ++---
> mm/madvise.c | 77 +++++++++++++++++++++--------------------
> 3 files changed, 47 insertions(+), 48 deletions(-)
>
Reviewed-by: Zi Yan <ziy@nvidia.com>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 0/5] madvise cleanup
2025-06-20 15:33 [PATCH v2 0/5] madvise cleanup Lorenzo Stoakes
` (4 preceding siblings ...)
2025-06-20 15:33 ` [PATCH v2 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA Lorenzo Stoakes
@ 2025-06-20 17:21 ` SeongJae Park
2025-06-20 17:33 ` Lorenzo Stoakes
5 siblings, 1 reply; 26+ messages in thread
From: SeongJae Park @ 2025-06-20 17:21 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Andrew Morton, David Hildenbrand, Zi Yan,
Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Vlastimil Babka, Jann Horn, linux-mm, linux-kernel,
Lance Yang, Suren Baghdasaryan
On Fri, 20 Jun 2025 16:33:00 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> This is a series of patches that helps address a number of historic
> problems in the madvise() implementation:
>
> * Eliminate the visitor pattern and having the code which is implemented
> for both the anon_vma_name implementation and ordinary madvise()
> operations use the same madvise_vma_behavior() implementation.
>
> * Thread state through the madvise_behavior state object - this object,
> very usefully introduced by SJ, is already used to transmit state through
> operations. This series extends this by having all madvise() operations
> use this, including anon_vma_name.
>
> * Thread range, VMA state through madvise_behavior - This helps avoid a lot
> of the confusing code around range and VMA state and again keeps things
> consistent and with a single 'source of truth'.
>
> * Addressing the very strange behaviour around the passed around struct
> vm_area_struct **prev pointer - all read-only users do absolutely nothing
> with the prev pointer. The only function that uses it is
> madvise_update_vma(), and in all cases prev is always reset to
> VMA.
>
> Fix this by no longer having aything but madvise_update_vma() reference
> prev, and having madvise_walk_vmas() update prev in each
> instance. Additionally make it clear that the meaningful change in vma
> state is when madvise_update_vma() potentially merges a VMA, so
> explicitly retrieve the VMA in this case.
>
> * Update and clarify the madvise_walk_vmas() function - this is a source of
> a great deal of confusion, so simplify, stop using prev = NULL to signify
> that the mmap lock has been dropped (!) and make that explicit, and add
> some comments to explain what's going on.
>
> v2:
> * Propagated tags (thanks everyone!)
> * Don't separate out __MADV_SET_ANON_VMA_NAME and __MADV_SET_CLEAR_VMA_NAME,
FWIW. If this cover letter is added to the first patch, like Andrew usually
does, as-is, checkpatch.pl may warn like below.
WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
Obviously no real problem and I don't really care. I just found this since my
tool (hkml) runs checkpatch.pl after adding the cover letter to the first
patch, and hence this is just FWIW.
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 2/5] mm/madvise: thread mm_struct through madvise_behavior
2025-06-20 15:33 ` [PATCH v2 2/5] mm/madvise: thread mm_struct through madvise_behavior Lorenzo Stoakes
2025-06-20 16:59 ` Zi Yan
@ 2025-06-20 17:25 ` SeongJae Park
2025-06-23 22:42 ` Barry Song
2 siblings, 0 replies; 26+ messages in thread
From: SeongJae Park @ 2025-06-20 17:25 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Andrew Morton, David Hildenbrand, Zi Yan,
Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Vlastimil Babka, Jann Horn, linux-mm, linux-kernel,
Lance Yang, Suren Baghdasaryan
On Fri, 20 Jun 2025 16:33:02 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> There's no need to thread a pointer to the mm_struct nor have different
> functions signatures for each behaviour, instead store state in the struct
> madvise_behavior object consistently and use it for all madvise() actions.
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 0/5] madvise cleanup
2025-06-20 17:21 ` [PATCH v2 0/5] madvise cleanup SeongJae Park
@ 2025-06-20 17:33 ` Lorenzo Stoakes
0 siblings, 0 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 17:33 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Vlastimil Babka, Jann Horn, linux-mm, linux-kernel, Lance Yang,
Suren Baghdasaryan
On Fri, Jun 20, 2025 at 10:21:08AM -0700, SeongJae Park wrote:
> On Fri, 20 Jun 2025 16:33:00 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > This is a series of patches that helps address a number of historic
> > problems in the madvise() implementation:
> >
> > * Eliminate the visitor pattern and having the code which is implemented
> > for both the anon_vma_name implementation and ordinary madvise()
> > operations use the same madvise_vma_behavior() implementation.
> >
> > * Thread state through the madvise_behavior state object - this object,
> > very usefully introduced by SJ, is already used to transmit state through
> > operations. This series extends this by having all madvise() operations
> > use this, including anon_vma_name.
> >
> > * Thread range, VMA state through madvise_behavior - This helps avoid a lot
> > of the confusing code around range and VMA state and again keeps things
> > consistent and with a single 'source of truth'.
> >
> > * Addressing the very strange behaviour around the passed around struct
> > vm_area_struct **prev pointer - all read-only users do absolutely nothing
> > with the prev pointer. The only function that uses it is
> > madvise_update_vma(), and in all cases prev is always reset to
> > VMA.
> >
> > Fix this by no longer having aything but madvise_update_vma() reference
> > prev, and having madvise_walk_vmas() update prev in each
> > instance. Additionally make it clear that the meaningful change in vma
> > state is when madvise_update_vma() potentially merges a VMA, so
> > explicitly retrieve the VMA in this case.
> >
> > * Update and clarify the madvise_walk_vmas() function - this is a source of
> > a great deal of confusion, so simplify, stop using prev = NULL to signify
> > that the mmap lock has been dropped (!) and make that explicit, and add
> > some comments to explain what's going on.
> >
> > v2:
> > * Propagated tags (thanks everyone!)
> > * Don't separate out __MADV_SET_ANON_VMA_NAME and __MADV_SET_CLEAR_VMA_NAME,
>
> FWIW. If this cover letter is added to the first patch, like Andrew usually
> does, as-is, checkpatch.pl may warn like below.
>
> WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
>
> Obviously no real problem and I don't really care. I just found this since my
> tool (hkml) runs checkpatch.pl after adding the cover letter to the first
> patch, and hence this is just FWIW.
Yeah, sorry, this is because I didn't 'compress' the v2 revision log, which
won't be reproduced in patches anyway so should be ok :>)
>
>
> Thanks,
> SJ
>
> [...]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 3/5] mm/madvise: thread VMA range state through madvise_behavior
2025-06-20 15:33 ` [PATCH v2 3/5] mm/madvise: thread VMA range state " Lorenzo Stoakes
2025-06-20 17:02 ` Zi Yan
@ 2025-06-20 17:33 ` SeongJae Park
2025-06-24 1:05 ` Barry Song
2 siblings, 0 replies; 26+ messages in thread
From: SeongJae Park @ 2025-06-20 17:33 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Andrew Morton, David Hildenbrand, Zi Yan,
Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Vlastimil Babka, Jann Horn, linux-mm, linux-kernel,
Lance Yang, Suren Baghdasaryan
On Fri, 20 Jun 2025 16:33:03 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> Rather than updating start and a confusing local parameter 'tmp' in
> madvise_walk_vmas(), instead store the current range being operated upon in
> the struct madvise_behavior helper object in a range pair and use this
> consistently in all operations.
>
> This makes it clearer what is going on and opens the door to further
> cleanup now we store state regarding what is currently being operated upon
> here.
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 4/5] mm/madvise: thread all madvise state through madv_behavior
2025-06-20 15:33 ` [PATCH v2 4/5] mm/madvise: thread all madvise state through madv_behavior Lorenzo Stoakes
@ 2025-06-20 17:56 ` SeongJae Park
2025-06-20 18:01 ` Lorenzo Stoakes
0 siblings, 1 reply; 26+ messages in thread
From: SeongJae Park @ 2025-06-20 17:56 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Andrew Morton, David Hildenbrand, Zi Yan,
Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Vlastimil Babka, Jann Horn, linux-mm, linux-kernel,
Lance Yang, Suren Baghdasaryan
On Fri, 20 Jun 2025 16:33:04 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> Doing so means we can get rid of all the weird struct vm_area_struct **prev
> stuff, everything becomes consistent and in future if we want to make
> change to behaviour there's a single place where all relevant state is
> stored.
>
> This also allows us to update try_vma_read_lock() to be a little more
> succinct and set up state for us, as well as cleaning up
> madvise_update_vma().
>
> We also update the debug assertion prior to madvise_update_vma() to assert
> that this is a write operation as correctly pointed out by Barry in the
> relevant thread.
>
> We can't reasonably update the madvise functions that live outside of
> mm/madvise.c so we leave those as-is.
>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Found a very trivial nit below. Other than that,
Reviewed-by: SeongJae Park <sj@kernel.org>
[...]
> @@ -1607,23 +1615,19 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
> struct madvise_behavior_range *range = &madv_behavior->range;
> /* range is updated to span each VMA, so store end of entire range. */
> unsigned long last_end = range->end;
> - struct vm_area_struct *vma;
> - struct vm_area_struct *prev;
> int unmapped_error = 0;
> int error;
> + struct vm_area_struct *vma;
A very trivial nit. We could just keep old 'struct vm_area_struct *vma'
declaration.
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 4/5] mm/madvise: thread all madvise state through madv_behavior
2025-06-20 17:56 ` SeongJae Park
@ 2025-06-20 18:01 ` Lorenzo Stoakes
2025-06-20 18:12 ` SeongJae Park
0 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 18:01 UTC (permalink / raw)
To: SeongJae Park
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Vlastimil Babka, Jann Horn, linux-mm, linux-kernel, Lance Yang,
Suren Baghdasaryan
On Fri, Jun 20, 2025 at 10:56:22AM -0700, SeongJae Park wrote:
> On Fri, 20 Jun 2025 16:33:04 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > Doing so means we can get rid of all the weird struct vm_area_struct **prev
> > stuff, everything becomes consistent and in future if we want to make
> > change to behaviour there's a single place where all relevant state is
> > stored.
> >
> > This also allows us to update try_vma_read_lock() to be a little more
> > succinct and set up state for us, as well as cleaning up
> > madvise_update_vma().
> >
> > We also update the debug assertion prior to madvise_update_vma() to assert
> > that this is a write operation as correctly pointed out by Barry in the
> > relevant thread.
> >
> > We can't reasonably update the madvise functions that live outside of
> > mm/madvise.c so we leave those as-is.
> >
> > Acked-by: Zi Yan <ziy@nvidia.com>
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Found a very trivial nit below. Other than that,
>
> Reviewed-by: SeongJae Park <sj@kernel.org>
Thanks! And also for other tags of course :)
>
> [...]
> > @@ -1607,23 +1615,19 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
> > struct madvise_behavior_range *range = &madv_behavior->range;
> > /* range is updated to span each VMA, so store end of entire range. */
> > unsigned long last_end = range->end;
> > - struct vm_area_struct *vma;
> > - struct vm_area_struct *prev;
> > int unmapped_error = 0;
> > int error;
> > + struct vm_area_struct *vma;
>
> A very trivial nit. We could just keep old 'struct vm_area_struct *vma'
> declaration.
Haha yes that's a good point! I think probably it's because I initially thought
I should drop both, then decided to keep vma around.
Anyway, I'll fix up if I need to respin but I don't think it's necessary for a
fixpatch at this point :>)
>
>
> Thanks,
> SJ
>
> [...]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA
2025-06-20 15:33 ` [PATCH v2 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA Lorenzo Stoakes
2025-06-20 17:13 ` Zi Yan
@ 2025-06-20 18:10 ` SeongJae Park
2025-06-24 13:16 ` Lorenzo Stoakes
2 siblings, 0 replies; 26+ messages in thread
From: SeongJae Park @ 2025-06-20 18:10 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Andrew Morton, David Hildenbrand, Zi Yan,
Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Vlastimil Babka, Jann Horn, linux-mm, linux-kernel,
Lance Yang, Suren Baghdasaryan
On Fri, 20 Jun 2025 16:33:05 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> The madvise code has for the longest time had very confusing code around
> the 'prev' VMA pointer passed around various functions which, in all cases
> except madvise_update_vma(), is unused and instead simply updated as soon
> as the function is invoked.
>
> To compound the confusion, the prev pointer is also used to indicate to the
> caller that the mmap lock has been dropped and that we can therefore not
> safely access the end of the current VMA (which might have been updated by
> madvise_update_vma()).
>
> Clear up this confusion by not setting prev = vma anywhere except in
> madvise_walk_vmas(), update all references to prev which will always be
> equal to vma after madvise_vma_behavior() is invoked, and adding a flag to
> indicate that the lock has been dropped to make this explicit.
>
> Additionally, drop a redundant BUG_ON() from madvise_collapse(), which is
> simply reiterating the BUG_ON(mmap_locked) above it (note that BUG_ON() is
> not appropriate here, but we leave existing code as-is).
>
> We finally adjust the madvise_walk_vmas() logic to be a little clearer -
> delaying the assignment of the end of the range to the start of the new
> range until the last moment and handling the lock being dropped scenario
> immediately.
>
> Additionally add some explanatory comments.
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 4/5] mm/madvise: thread all madvise state through madv_behavior
2025-06-20 18:01 ` Lorenzo Stoakes
@ 2025-06-20 18:12 ` SeongJae Park
0 siblings, 0 replies; 26+ messages in thread
From: SeongJae Park @ 2025-06-20 18:12 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: SeongJae Park, Andrew Morton, David Hildenbrand, Zi Yan,
Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Vlastimil Babka, Jann Horn, linux-mm, linux-kernel,
Lance Yang, Suren Baghdasaryan
On Fri, 20 Jun 2025 19:01:19 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Fri, Jun 20, 2025 at 10:56:22AM -0700, SeongJae Park wrote:
> > On Fri, 20 Jun 2025 16:33:04 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
[...]
> Anyway, I'll fix up if I need to respin but I don't think it's necessary for a
> fixpatch at this point :>)
Makes sense, and thank you for this nice work! :)
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state
2025-06-20 15:33 ` [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
2025-06-20 16:57 ` Zi Yan
2025-06-20 17:12 ` SeongJae Park
@ 2025-06-23 22:38 ` Barry Song
2025-06-25 7:43 ` David Hildenbrand
3 siblings, 0 replies; 26+ messages in thread
From: Barry Song @ 2025-06-23 22:38 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Vlastimil Babka, Jann Horn, linux-mm, linux-kernel, Lance Yang,
SeongJae Park, Suren Baghdasaryan
On Sat, Jun 21, 2025 at 3:33 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Now we have the madvise_behavior helper struct we no longer need to mess
> around with void* pointers in order to propagate anon_vma_name, and this
> means we can get rid of the confusing and inconsistent visitor pattern
> implementation in madvise_vma_anon_name().
>
> This means we now have a single state object that threads through most of
> madvise()'s logic and a single code path which executes the majority of
> madvise() behaviour (we maintain separate logic for failure injection and
> memory population for the time being).
>
> We are able to remove the visitor pattern by handling the anon_vma_name
> setting logic via an internal madvise flag - __MADV_SET_ANON_VMA_NAME. This
> uses a negative value so it isn't reasonable that we will ever add this as
> a UAPI flag.
>
> Additionally, the madvise_behavior_valid() check ensures that
> user-specified behaviours are strictly only those we permit which, of
> course, this flag will be excluded from.
>
> We are able to propagate the anon_vma_name object through use of the
> madvise_behavior helper struct.
>
> Doing this results in a can_modify_vma_madv() check for anonymous VMA name
> changes, however this will cause no issues as this operation is not
> prohibited.
>
> We can also then reuse more code and drop the redundant
> madvise_vma_anon_name() function altogether.
>
> Additionally separate out behaviours that update VMAs from those that do
> not.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
LGTM,
Reviewed-by: Barry Song <baohua@kernel.org>
>
> /* We cannot provide prev in this lock mode. */
> - VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK);
> - anon_name = anon_vma_name(vma);
> - anon_vma_name_get(anon_name);
> + VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_MMAP_WRITE_LOCK);
> +
> + /*
I’m wondering whether it would be more accurate to:
VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_WRITE_LOCK);
Thanks
Barry
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 2/5] mm/madvise: thread mm_struct through madvise_behavior
2025-06-20 15:33 ` [PATCH v2 2/5] mm/madvise: thread mm_struct through madvise_behavior Lorenzo Stoakes
2025-06-20 16:59 ` Zi Yan
2025-06-20 17:25 ` SeongJae Park
@ 2025-06-23 22:42 ` Barry Song
2 siblings, 0 replies; 26+ messages in thread
From: Barry Song @ 2025-06-23 22:42 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Vlastimil Babka, Jann Horn, linux-mm, linux-kernel, Lance Yang,
SeongJae Park, Suren Baghdasaryan
On Sat, Jun 21, 2025 at 3:33 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> There's no need to thread a pointer to the mm_struct nor have different
> functions signatures for each behaviour, instead store state in the struct
> madvise_behavior object consistently and use it for all madvise() actions.
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 110 ++++++++++++++++++++++++++-------------------------
> 1 file changed, 57 insertions(+), 53 deletions(-)
Reviewed-by: Barry Song <baohua@kernel.org>
Best Regards
Barry
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 3/5] mm/madvise: thread VMA range state through madvise_behavior
2025-06-20 15:33 ` [PATCH v2 3/5] mm/madvise: thread VMA range state " Lorenzo Stoakes
2025-06-20 17:02 ` Zi Yan
2025-06-20 17:33 ` SeongJae Park
@ 2025-06-24 1:05 ` Barry Song
2 siblings, 0 replies; 26+ messages in thread
From: Barry Song @ 2025-06-24 1:05 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Vlastimil Babka, Jann Horn, linux-mm, linux-kernel, Lance Yang,
SeongJae Park, Suren Baghdasaryan
On Sat, Jun 21, 2025 at 3:33 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Rather than updating start and a confusing local parameter 'tmp' in
> madvise_walk_vmas(), instead store the current range being operated upon in
> the struct madvise_behavior helper object in a range pair and use this
> consistently in all operations.
>
> This makes it clearer what is going on and opens the door to further
> cleanup now we store state regarding what is currently being operated upon
> here.
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 103 ++++++++++++++++++++++++++++-----------------------
> 1 file changed, 57 insertions(+), 46 deletions(-)
>
Reviewed-by: Barry Song <baohua@kernel.org>
Best Regards
Barry
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA
2025-06-20 15:33 ` [PATCH v2 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA Lorenzo Stoakes
2025-06-20 17:13 ` Zi Yan
2025-06-20 18:10 ` SeongJae Park
@ 2025-06-24 13:16 ` Lorenzo Stoakes
2025-06-24 17:57 ` Vlastimil Babka
2 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-06-24 13:16 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
Hi Andrew,
stress-ng picked up a very subtle bug here. Please apply this fix-patch to
correct it!
Cheers, Lorenzo
----8<----
From a04de12e1deb50e708392b0e716612f26c6b386a Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Tue, 24 Jun 2025 14:14:19 +0100
Subject: [PATCH] mm/madvise: fix very subtle bug
With thanks to stress-ng --madvise :)
vma may have been updated (in the modify call in madvise_update_vma()), so
we can't assign prev = vma, we must first reassign vma to
madv_behavior->vma to account for this, before setting prev = vma.
---
mm/madvise.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 4491bf080f55..c467ee42596f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1665,8 +1665,8 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
vma = NULL;
madv_behavior->lock_dropped = false;
} else {
- prev = vma;
vma = madv_behavior->vma;
+ prev = vma;
}
if (vma && range->end < vma->vm_end)
--
2.50.0
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v2 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA
2025-06-24 13:16 ` Lorenzo Stoakes
@ 2025-06-24 17:57 ` Vlastimil Babka
2025-06-24 18:01 ` Lorenzo Stoakes
0 siblings, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2025-06-24 17:57 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Jann Horn,
linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 6/24/25 15:16, Lorenzo Stoakes wrote:
> Hi Andrew,
>
> stress-ng picked up a very subtle bug here. Please apply this fix-patch to
> correct it!
>
> Cheers, Lorenzo
>
> ----8<----
> From a04de12e1deb50e708392b0e716612f26c6b386a Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Tue, 24 Jun 2025 14:14:19 +0100
> Subject: [PATCH] mm/madvise: fix very subtle bug
>
> With thanks to stress-ng --madvise :)
>
> vma may have been updated (in the modify call in madvise_update_vma()), so
> we can't assign prev = vma, we must first reassign vma to
> madv_behavior->vma to account for this, before setting prev = vma.
So glad we eliminated very confusing manipulation of prev VMA :) sigh...
LGTM (not worth much it seems...)
> ---
> mm/madvise.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 4491bf080f55..c467ee42596f 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1665,8 +1665,8 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
> vma = NULL;
> madv_behavior->lock_dropped = false;
> } else {
> - prev = vma;
> vma = madv_behavior->vma;
> + prev = vma;
> }
>
> if (vma && range->end < vma->vm_end)
> --
> 2.50.0
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA
2025-06-24 17:57 ` Vlastimil Babka
@ 2025-06-24 18:01 ` Lorenzo Stoakes
0 siblings, 0 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-06-24 18:01 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On Tue, Jun 24, 2025 at 07:57:09PM +0200, Vlastimil Babka wrote:
> On 6/24/25 15:16, Lorenzo Stoakes wrote:
> > vma may have been updated (in the modify call in madvise_update_vma()), so
> > we can't assign prev = vma, we must first reassign vma to
> > madv_behavior->vma to account for this, before setting prev = vma.
>
> So glad we eliminated very confusing manipulation of prev VMA :) sigh...
>
> LGTM (not worth much it seems...)
Appreciated thanks! (this was really quite subtle and we all missed it, so don't
worry :)
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state
2025-06-20 15:33 ` [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
` (2 preceding siblings ...)
2025-06-23 22:38 ` Barry Song
@ 2025-06-25 7:43 ` David Hildenbrand
3 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-06-25 7:43 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Vlastimil Babka, Jann Horn, linux-mm,
linux-kernel, Lance Yang, SeongJae Park, Suren Baghdasaryan
On 20.06.25 17:33, Lorenzo Stoakes wrote:
> Now we have the madvise_behavior helper struct we no longer need to mess
> around with void* pointers in order to propagate anon_vma_name, and this
> means we can get rid of the confusing and inconsistent visitor pattern
> implementation in madvise_vma_anon_name().
>
> This means we now have a single state object that threads through most of
> madvise()'s logic and a single code path which executes the majority of
> madvise() behaviour (we maintain separate logic for failure injection and
> memory population for the time being).
>
> We are able to remove the visitor pattern by handling the anon_vma_name
> setting logic via an internal madvise flag - __MADV_SET_ANON_VMA_NAME. This
> uses a negative value so it isn't reasonable that we will ever add this as
> a UAPI flag.
>
> Additionally, the madvise_behavior_valid() check ensures that
> user-specified behaviours are strictly only those we permit which, of
> course, this flag will be excluded from.
>
> We are able to propagate the anon_vma_name object through use of the
> madvise_behavior helper struct.
>
> Doing this results in a can_modify_vma_madv() check for anonymous VMA name
> changes, however this will cause no issues as this operation is not
> prohibited.
>
> We can also then reuse more code and drop the redundant
> madvise_vma_anon_name() function altogether.
>
> Additionally separate out behaviours that update VMAs from those that do
> not.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
Oh, that is very nice
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2025-06-25 7:43 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-20 15:33 [PATCH v2 0/5] madvise cleanup Lorenzo Stoakes
2025-06-20 15:33 ` [PATCH v2 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
2025-06-20 16:57 ` Zi Yan
2025-06-20 17:12 ` SeongJae Park
2025-06-23 22:38 ` Barry Song
2025-06-25 7:43 ` David Hildenbrand
2025-06-20 15:33 ` [PATCH v2 2/5] mm/madvise: thread mm_struct through madvise_behavior Lorenzo Stoakes
2025-06-20 16:59 ` Zi Yan
2025-06-20 17:25 ` SeongJae Park
2025-06-23 22:42 ` Barry Song
2025-06-20 15:33 ` [PATCH v2 3/5] mm/madvise: thread VMA range state " Lorenzo Stoakes
2025-06-20 17:02 ` Zi Yan
2025-06-20 17:33 ` SeongJae Park
2025-06-24 1:05 ` Barry Song
2025-06-20 15:33 ` [PATCH v2 4/5] mm/madvise: thread all madvise state through madv_behavior Lorenzo Stoakes
2025-06-20 17:56 ` SeongJae Park
2025-06-20 18:01 ` Lorenzo Stoakes
2025-06-20 18:12 ` SeongJae Park
2025-06-20 15:33 ` [PATCH v2 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA Lorenzo Stoakes
2025-06-20 17:13 ` Zi Yan
2025-06-20 18:10 ` SeongJae Park
2025-06-24 13:16 ` Lorenzo Stoakes
2025-06-24 17:57 ` Vlastimil Babka
2025-06-24 18:01 ` Lorenzo Stoakes
2025-06-20 17:21 ` [PATCH v2 0/5] madvise cleanup SeongJae Park
2025-06-20 17:33 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).