* [PATCH 0/5] madvise cleanup
@ 2025-06-19 20:26 Lorenzo Stoakes
2025-06-19 20:26 ` [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
` (4 more replies)
0 siblings, 5 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-19 20:26 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
This is a series of patches that helps address a number of historic
problems in the madvise() implementation:
* Eliminate the visitor pattern and having the code which is implemented
for both the anon_vma_name implementation and ordinary madvise()
operations use the same madvise_vma_behavior() implementation.
* Thread state through the madvise_behavior state object - this object,
very usefully introduced by SJ, is already used to transmit state through
operations. This series extends this by having all madvise() operations
use this, including anon_vma_name.
* Thread range, VMA state through madvise_behavior - This helps avoid a lot
of the confusing code around range and VMA state and again keeps things
consistent and with a single 'source of truth'.
* Addressing the very strange behaviour around the passed around struct
vm_area_struct **prev pointer - all read-only users do absolutely nothing
with the prev pointer. The only function that uses it is
madvise_update_vma(), and in all cases prev is always reset to
VMA.
Fix this by no longer having aything but madvise_update_vma() reference
prev, and having madvise_walk_vmas() update prev in each
instance. Additionally make it clear that the meaningful change in vma
state is when madvise_update_vma() potentially merges a VMA, so
explicitly retrieve the VMA in this case.
* Update and clarify the madvise_walk_vmas() function - this is a source of
a great deal of confusion, so simplify, stop using prev = NULL to signify
that the mmap lock has been dropped (!) and make that explicit, and add
some comments to explain what's going on.
Lorenzo Stoakes (5):
mm/madvise: remove the visitor pattern and thread anon_vma state
mm/madvise: thread mm_struct through madvise_behavior
mm/madvise: thread VMA range state through madvise_behavior
mm/madvise: thread all madvise state through madv_behavior
mm/madvise: eliminate very confusing manipulation of prev VMA
include/linux/huge_mm.h | 9 +-
mm/khugepaged.c | 9 +-
mm/madvise.c | 569 +++++++++++++++++++++-------------------
3 files changed, 309 insertions(+), 278 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state
2025-06-19 20:26 [PATCH 0/5] madvise cleanup Lorenzo Stoakes
@ 2025-06-19 20:26 ` Lorenzo Stoakes
2025-06-20 1:32 ` Zi Yan
2025-06-20 13:05 ` Vlastimil Babka
2025-06-19 20:26 ` [PATCH 2/5] mm/madvise: thread mm_struct through madvise_behavior Lorenzo Stoakes
` (3 subsequent siblings)
4 siblings, 2 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-19 20:26 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
Now we have the madvise_behavior helper struct we no longer need to mess
around with void* pointers in order to propagate anon_vma_name, and this
means we can get rid of the confusing and inconsistent visitor pattern
implementation in madvise_vma_anon_name().
This means we now have a single state object that threads through most of
madvise()'s logic and a single code path which executes the majority of
madvise() behaviour (we maintain separate logic for failure injection and
memory population for the time being).
Note that users cannot inadvertently cause this behaviour to occur, as
madvise_behavior_valid() would reject it.
Doing this results in a can_modify_vma_madv() check for anonymous VMA name
changes, however this will cause no issues as this operation is not
prohibited.
We can also then reuse more code and drop the redundant
madvise_vma_anon_name() function altogether.
Additionally separate out behaviours that update VMAs from those that do
not.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/madvise.c | 167 +++++++++++++++++++++++++++------------------------
1 file changed, 90 insertions(+), 77 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 070132f9842b..9dd935d64692 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -37,6 +37,9 @@
#include "internal.h"
#include "swap.h"
+#define __MADV_SET_ANON_VMA_NAME (-1)
+#define __MADV_CLEAR_ANON_VMA_NAME (-2)
+
/*
* Maximum number of attempts we make to install guard pages before we give up
* and return -ERESTARTNOINTR to have userspace try again.
@@ -59,9 +62,13 @@ struct madvise_behavior {
int behavior;
struct mmu_gather *tlb;
enum madvise_lock_mode lock_mode;
+ struct anon_vma_name *anon_name;
};
#ifdef CONFIG_ANON_VMA_NAME
+static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
+ unsigned long end, struct madvise_behavior *madv_behavior);
+
struct anon_vma_name *anon_vma_name_alloc(const char *name)
{
struct anon_vma_name *anon_name;
@@ -112,6 +119,48 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
return 0;
}
+
+int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
+ unsigned long len_in, struct anon_vma_name *anon_name)
+{
+ unsigned long end;
+ unsigned long len;
+ struct madvise_behavior madv_behavior = {
+ .lock_mode = MADVISE_MMAP_WRITE_LOCK,
+ .anon_name = anon_name,
+ };
+
+ if (start & ~PAGE_MASK)
+ return -EINVAL;
+ len = (len_in + ~PAGE_MASK) & PAGE_MASK;
+
+ /* Check to see whether len was rounded up from small -ve to zero */
+ if (len_in && !len)
+ return -EINVAL;
+
+ end = start + len;
+ if (end < start)
+ return -EINVAL;
+
+ if (end == start)
+ return 0;
+
+ madv_behavior.behavior =
+ anon_name ? __MADV_SET_ANON_VMA_NAME : __MADV_CLEAR_ANON_VMA_NAME;
+
+ return madvise_walk_vmas(mm, start, end, &madv_behavior);
+}
+
+static bool is_anon_vma_name(int behavior)
+{
+ switch (behavior) {
+ case __MADV_SET_ANON_VMA_NAME:
+ case __MADV_CLEAR_ANON_VMA_NAME:
+ return true;
+ default:
+ return false;
+ }
+}
#else /* CONFIG_ANON_VMA_NAME */
static int replace_anon_vma_name(struct vm_area_struct *vma,
struct anon_vma_name *anon_name)
@@ -121,6 +170,11 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
return 0;
}
+
+static bool is_anon_vma_name(int behavior)
+{
+ return false;
+}
#endif /* CONFIG_ANON_VMA_NAME */
/*
* Update the vm_flags on region of a vma, splitting it or merging it as
@@ -1252,13 +1306,12 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
static int madvise_vma_behavior(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end,
- void *behavior_arg)
+ struct madvise_behavior *madv_behavior)
{
- struct madvise_behavior *arg = behavior_arg;
- int behavior = arg->behavior;
- int error;
- struct anon_vma_name *anon_name;
+ int behavior = madv_behavior->behavior;
+ struct anon_vma_name *anon_name = madv_behavior->anon_name;
vm_flags_t new_flags = vma->vm_flags;
+ int error;
if (unlikely(!can_modify_vma_madv(vma, behavior)))
return -EPERM;
@@ -1275,7 +1328,17 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
case MADV_FREE:
case MADV_DONTNEED:
case MADV_DONTNEED_LOCKED:
- return madvise_dontneed_free(vma, prev, start, end, arg);
+ return madvise_dontneed_free(vma, prev, start, end,
+ madv_behavior);
+ case MADV_COLLAPSE:
+ return madvise_collapse(vma, prev, start, end);
+ case MADV_GUARD_INSTALL:
+ return madvise_guard_install(vma, prev, start, end);
+ case MADV_GUARD_REMOVE:
+ return madvise_guard_remove(vma, prev, start, end);
+
+ /* The below behaviours update VMAs via madvise_update_vma(). */
+
case MADV_NORMAL:
new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
break;
@@ -1325,21 +1388,25 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
if (error)
goto out;
break;
- case MADV_COLLAPSE:
- return madvise_collapse(vma, prev, start, end);
- case MADV_GUARD_INSTALL:
- return madvise_guard_install(vma, prev, start, end);
- case MADV_GUARD_REMOVE:
- return madvise_guard_remove(vma, prev, start, end);
+ case __MADV_SET_ANON_VMA_NAME:
+ case __MADV_CLEAR_ANON_VMA_NAME:
+ /* Only anonymous mappings can be named */
+ if (vma->vm_file && !vma_is_anon_shmem(vma))
+ return -EBADF;
+ break;
}
/* We cannot provide prev in this lock mode. */
- VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK);
- anon_name = anon_vma_name(vma);
- anon_vma_name_get(anon_name);
+ VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK);
+
+ if (!is_anon_vma_name(behavior)) {
+ anon_name = anon_vma_name(vma);
+ anon_vma_name_get(anon_name);
+ }
error = madvise_update_vma(vma, prev, start, end, new_flags,
anon_name);
- anon_vma_name_put(anon_name);
+ if (!is_anon_vma_name(behavior))
+ anon_vma_name_put(anon_name);
out:
/*
@@ -1532,11 +1599,7 @@ static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
*/
static
int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
- unsigned long end, struct madvise_behavior *madv_behavior,
- void *arg,
- int (*visit)(struct vm_area_struct *vma,
- struct vm_area_struct **prev, unsigned long start,
- unsigned long end, void *arg))
+ unsigned long end, struct madvise_behavior *madv_behavior)
{
struct vm_area_struct *vma;
struct vm_area_struct *prev;
@@ -1548,11 +1611,12 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
* If VMA read lock is supported, apply madvise to a single VMA
* tentatively, avoiding walking VMAs.
*/
- if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
+ if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
vma = try_vma_read_lock(mm, madv_behavior, start, end);
if (vma) {
prev = vma;
- error = visit(vma, &prev, start, end, arg);
+ error = madvise_vma_behavior(vma, &prev, start, end,
+ madv_behavior);
vma_end_read(vma);
return error;
}
@@ -1586,7 +1650,8 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
tmp = end;
/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
- error = visit(vma, &prev, start, tmp, arg);
+ error = madvise_vma_behavior(vma, &prev, start, tmp,
+ madv_behavior);
if (error)
return error;
start = tmp;
@@ -1603,57 +1668,6 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
return unmapped_error;
}
-#ifdef CONFIG_ANON_VMA_NAME
-static int madvise_vma_anon_name(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end,
- void *anon_name)
-{
- int error;
-
- /* Only anonymous mappings can be named */
- if (vma->vm_file && !vma_is_anon_shmem(vma))
- return -EBADF;
-
- error = madvise_update_vma(vma, prev, start, end, vma->vm_flags,
- anon_name);
-
- /*
- * madvise() returns EAGAIN if kernel resources, such as
- * slab, are temporarily unavailable.
- */
- if (error == -ENOMEM)
- error = -EAGAIN;
- return error;
-}
-
-int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
- unsigned long len_in, struct anon_vma_name *anon_name)
-{
- unsigned long end;
- unsigned long len;
-
- if (start & ~PAGE_MASK)
- return -EINVAL;
- len = (len_in + ~PAGE_MASK) & PAGE_MASK;
-
- /* Check to see whether len was rounded up from small -ve to zero */
- if (len_in && !len)
- return -EINVAL;
-
- end = start + len;
- if (end < start)
- return -EINVAL;
-
- if (end == start)
- return 0;
-
- return madvise_walk_vmas(mm, start, end, NULL, anon_name,
- madvise_vma_anon_name);
-}
-#endif /* CONFIG_ANON_VMA_NAME */
-
-
/*
* Any behaviour which results in changes to the vma->vm_flags needs to
* take mmap_lock for writing. Others, which simply traverse vmas, need
@@ -1845,8 +1859,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
if (is_madvise_populate(behavior))
error = madvise_populate(mm, start, end, behavior);
else
- error = madvise_walk_vmas(mm, start, end, madv_behavior,
- madv_behavior, madvise_vma_behavior);
+ error = madvise_walk_vmas(mm, start, end, madv_behavior);
blk_finish_plug(&plug);
return error;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 2/5] mm/madvise: thread mm_struct through madvise_behavior
2025-06-19 20:26 [PATCH 0/5] madvise cleanup Lorenzo Stoakes
2025-06-19 20:26 ` [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
@ 2025-06-19 20:26 ` Lorenzo Stoakes
2025-06-20 1:40 ` Zi Yan
2025-06-20 13:19 ` Vlastimil Babka
2025-06-19 20:26 ` [PATCH 3/5] mm/madvise: thread VMA range state " Lorenzo Stoakes
` (2 subsequent siblings)
4 siblings, 2 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-19 20:26 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
There's no need to thread a pointer to the mm_struct nor have different
functions signatures for each behaviour, instead store state in the struct
madvise_behavior object consistently and use it for all madvise() actions.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/madvise.c | 105 ++++++++++++++++++++++++++-------------------------
1 file changed, 54 insertions(+), 51 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 9dd935d64692..47485653c2a1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -59,6 +59,7 @@ enum madvise_lock_mode {
};
struct madvise_behavior {
+ struct mm_struct *mm;
int behavior;
struct mmu_gather *tlb;
enum madvise_lock_mode lock_mode;
@@ -66,8 +67,8 @@ struct madvise_behavior {
};
#ifdef CONFIG_ANON_VMA_NAME
-static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
- unsigned long end, struct madvise_behavior *madv_behavior);
+static int madvise_walk_vmas(unsigned long start, unsigned long end,
+ struct madvise_behavior *madv_behavior);
struct anon_vma_name *anon_vma_name_alloc(const char *name)
{
@@ -126,6 +127,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
unsigned long end;
unsigned long len;
struct madvise_behavior madv_behavior = {
+ .mm = mm,
.lock_mode = MADVISE_MMAP_WRITE_LOCK,
.anon_name = anon_name,
};
@@ -148,7 +150,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
madv_behavior.behavior =
anon_name ? __MADV_SET_ANON_VMA_NAME : __MADV_CLEAR_ANON_VMA_NAME;
- return madvise_walk_vmas(mm, start, end, &madv_behavior);
+ return madvise_walk_vmas(start, end, &madv_behavior);
}
static bool is_anon_vma_name(int behavior)
@@ -1010,10 +1012,11 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
return -EINVAL;
}
-static long madvise_populate(struct mm_struct *mm, unsigned long start,
- unsigned long end, int behavior)
+static long madvise_populate(unsigned long start, unsigned long end,
+ struct madvise_behavior *madv_behavior)
{
- const bool write = behavior == MADV_POPULATE_WRITE;
+ struct mm_struct *mm = madv_behavior->mm;
+ const bool write = madv_behavior->behavior == MADV_POPULATE_WRITE;
int locked = 1;
long pages;
@@ -1422,15 +1425,14 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
/*
* Error injection support for memory error handling.
*/
-static int madvise_inject_error(int behavior,
- unsigned long start, unsigned long end)
+static int madvise_inject_error(unsigned long start, unsigned long end,
+ struct madvise_behavior *madv_behavior)
{
unsigned long size;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
-
for (; start < end; start += size) {
unsigned long pfn;
struct page *page;
@@ -1448,7 +1450,7 @@ static int madvise_inject_error(int behavior,
*/
size = page_size(compound_head(page));
- if (behavior == MADV_SOFT_OFFLINE) {
+ if (madv_behavior->behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
pfn, start);
ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
@@ -1467,9 +1469,9 @@ static int madvise_inject_error(int behavior,
return 0;
}
-static bool is_memory_failure(int behavior)
+static bool is_memory_failure(struct madvise_behavior *madv_behavior)
{
- switch (behavior) {
+ switch (madv_behavior->behavior) {
case MADV_HWPOISON:
case MADV_SOFT_OFFLINE:
return true;
@@ -1480,13 +1482,13 @@ static bool is_memory_failure(int behavior)
#else
-static int madvise_inject_error(int behavior,
- unsigned long start, unsigned long end)
+static int madvise_inject_error(unsigned long start, unsigned long end,
+ struct madvise_behavior *madv_behavior)
{
return 0;
}
-static bool is_memory_failure(int behavior)
+static bool is_memory_failure(struct madvise_behavior *madv_behavior)
{
return false;
}
@@ -1598,9 +1600,10 @@ static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
* Must be called with the mmap_lock held for reading or writing.
*/
static
-int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
- unsigned long end, struct madvise_behavior *madv_behavior)
+int madvise_walk_vmas(unsigned long start, unsigned long end,
+ struct madvise_behavior *madv_behavior)
{
+ struct mm_struct *mm = madv_behavior->mm;
struct vm_area_struct *vma;
struct vm_area_struct *prev;
unsigned long tmp;
@@ -1675,12 +1678,10 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
*/
static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior)
{
- int behavior = madv_behavior->behavior;
-
- if (is_memory_failure(behavior))
+ if (is_memory_failure(madv_behavior))
return MADVISE_NO_LOCK;
- switch (behavior) {
+ switch (madv_behavior->behavior) {
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_COLD:
@@ -1700,9 +1701,9 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
}
}
-static int madvise_lock(struct mm_struct *mm,
- struct madvise_behavior *madv_behavior)
+static int madvise_lock(struct madvise_behavior *madv_behavior)
{
+ struct mm_struct *mm = madv_behavior->mm;
enum madvise_lock_mode lock_mode = get_lock_mode(madv_behavior);
switch (lock_mode) {
@@ -1724,9 +1725,10 @@ static int madvise_lock(struct mm_struct *mm,
return 0;
}
-static void madvise_unlock(struct mm_struct *mm,
- struct madvise_behavior *madv_behavior)
+static void madvise_unlock(struct madvise_behavior *madv_behavior)
{
+ struct mm_struct *mm = madv_behavior->mm;
+
switch (madv_behavior->lock_mode) {
case MADVISE_NO_LOCK:
return;
@@ -1756,11 +1758,10 @@ static bool madvise_batch_tlb_flush(int behavior)
}
}
-static void madvise_init_tlb(struct madvise_behavior *madv_behavior,
- struct mm_struct *mm)
+static void madvise_init_tlb(struct madvise_behavior *madv_behavior)
{
if (madvise_batch_tlb_flush(madv_behavior->behavior))
- tlb_gather_mmu(madv_behavior->tlb, mm);
+ tlb_gather_mmu(madv_behavior->tlb, madv_behavior->mm);
}
static void madvise_finish_tlb(struct madvise_behavior *madv_behavior)
@@ -1815,9 +1816,9 @@ static bool madvise_should_skip(unsigned long start, size_t len_in,
return false;
}
-static bool is_madvise_populate(int behavior)
+static bool is_madvise_populate(struct madvise_behavior *madv_behavior)
{
- switch (behavior) {
+ switch (madv_behavior->behavior) {
case MADV_POPULATE_READ:
case MADV_POPULATE_WRITE:
return true;
@@ -1841,25 +1842,26 @@ static inline unsigned long get_untagged_addr(struct mm_struct *mm,
untagged_addr_remote(mm, start);
}
-static int madvise_do_behavior(struct mm_struct *mm,
- unsigned long start, size_t len_in,
+static int madvise_do_behavior(unsigned long start, size_t len_in,
struct madvise_behavior *madv_behavior)
{
- int behavior = madv_behavior->behavior;
struct blk_plug plug;
unsigned long end;
int error;
- if (is_memory_failure(behavior))
- return madvise_inject_error(behavior, start, start + len_in);
- start = get_untagged_addr(mm, start);
+ if (is_memory_failure(madv_behavior)) {
+ end = start + len_in;
+ return madvise_inject_error(start, end, madv_behavior);
+ }
+
+ start = get_untagged_addr(madv_behavior->mm, start);
end = start + PAGE_ALIGN(len_in);
blk_start_plug(&plug);
- if (is_madvise_populate(behavior))
- error = madvise_populate(mm, start, end, behavior);
+ if (is_madvise_populate(madv_behavior))
+ error = madvise_populate(start, end, madv_behavior);
else
- error = madvise_walk_vmas(mm, start, end, madv_behavior);
+ error = madvise_walk_vmas(start, end, madv_behavior);
blk_finish_plug(&plug);
return error;
}
@@ -1941,19 +1943,20 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
int error;
struct mmu_gather tlb;
struct madvise_behavior madv_behavior = {
+ .mm = mm,
.behavior = behavior,
.tlb = &tlb,
};
if (madvise_should_skip(start, len_in, behavior, &error))
return error;
- error = madvise_lock(mm, &madv_behavior);
+ error = madvise_lock(&madv_behavior);
if (error)
return error;
- madvise_init_tlb(&madv_behavior, mm);
- error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
+ madvise_init_tlb(&madv_behavior);
+ error = madvise_do_behavior(start, len_in, &madv_behavior);
madvise_finish_tlb(&madv_behavior);
- madvise_unlock(mm, &madv_behavior);
+ madvise_unlock(&madv_behavior);
return error;
}
@@ -1971,16 +1974,17 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
size_t total_len;
struct mmu_gather tlb;
struct madvise_behavior madv_behavior = {
+ .mm = mm,
.behavior = behavior,
.tlb = &tlb,
};
total_len = iov_iter_count(iter);
- ret = madvise_lock(mm, &madv_behavior);
+ ret = madvise_lock(&madv_behavior);
if (ret)
return ret;
- madvise_init_tlb(&madv_behavior, mm);
+ madvise_init_tlb(&madv_behavior);
while (iov_iter_count(iter)) {
unsigned long start = (unsigned long)iter_iov_addr(iter);
@@ -1990,8 +1994,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
if (madvise_should_skip(start, len_in, behavior, &error))
ret = error;
else
- ret = madvise_do_behavior(mm, start, len_in,
- &madv_behavior);
+ ret = madvise_do_behavior(start, len_in, &madv_behavior);
/*
* An madvise operation is attempting to restart the syscall,
* but we cannot proceed as it would not be correct to repeat
@@ -2010,11 +2013,11 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
/* Drop and reacquire lock to unwind race. */
madvise_finish_tlb(&madv_behavior);
- madvise_unlock(mm, &madv_behavior);
- ret = madvise_lock(mm, &madv_behavior);
+ madvise_unlock(&madv_behavior);
+ ret = madvise_lock(&madv_behavior);
if (ret)
goto out;
- madvise_init_tlb(&madv_behavior, mm);
+ madvise_init_tlb(&madv_behavior);
continue;
}
if (ret < 0)
@@ -2022,7 +2025,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
iov_iter_advance(iter, iter_iov_len(iter));
}
madvise_finish_tlb(&madv_behavior);
- madvise_unlock(mm, &madv_behavior);
+ madvise_unlock(&madv_behavior);
out:
ret = (total_len - iov_iter_count(iter)) ? : ret;
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 3/5] mm/madvise: thread VMA range state through madvise_behavior
2025-06-19 20:26 [PATCH 0/5] madvise cleanup Lorenzo Stoakes
2025-06-19 20:26 ` [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
2025-06-19 20:26 ` [PATCH 2/5] mm/madvise: thread mm_struct through madvise_behavior Lorenzo Stoakes
@ 2025-06-19 20:26 ` Lorenzo Stoakes
2025-06-20 1:54 ` Zi Yan
2025-06-20 13:49 ` Vlastimil Babka
2025-06-19 20:26 ` [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior Lorenzo Stoakes
2025-06-19 20:26 ` [PATCH 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA Lorenzo Stoakes
4 siblings, 2 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-19 20:26 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
Rather than updating start and a confusing local parameter 'tmp' in
madvise_walk_vmas(), instead store the current range being operated upon in
the struct madvise_behavior helper object in a range pair and use this
consistently in all operations.
This makes it clearer what is going on and opens the door to further
cleanup now we store state regarding what is currently being operated upon
here.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/madvise.c | 101 ++++++++++++++++++++++++++++-----------------------
1 file changed, 55 insertions(+), 46 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 47485653c2a1..6faa38b92111 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -58,17 +58,26 @@ enum madvise_lock_mode {
MADVISE_VMA_READ_LOCK,
};
+struct madvise_behavior_range {
+ unsigned long start, end;
+};
+
struct madvise_behavior {
struct mm_struct *mm;
int behavior;
struct mmu_gather *tlb;
enum madvise_lock_mode lock_mode;
struct anon_vma_name *anon_name;
+
+ /*
+ * The range over which the behaviour is currently being applied. If
+ * traversing multiple VMAs, this is updated for each.
+ */
+ struct madvise_behavior_range range;
};
#ifdef CONFIG_ANON_VMA_NAME
-static int madvise_walk_vmas(unsigned long start, unsigned long end,
- struct madvise_behavior *madv_behavior);
+static int madvise_walk_vmas(struct madvise_behavior *madv_behavior);
struct anon_vma_name *anon_vma_name_alloc(const char *name)
{
@@ -149,8 +158,9 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
madv_behavior.behavior =
anon_name ? __MADV_SET_ANON_VMA_NAME : __MADV_CLEAR_ANON_VMA_NAME;
-
- return madvise_walk_vmas(start, end, &madv_behavior);
+ madv_behavior.range.start = start;
+ madv_behavior.range.end = end;
+ return madvise_walk_vmas(&madv_behavior);
}
static bool is_anon_vma_name(int behavior)
@@ -1012,12 +1022,13 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
return -EINVAL;
}
-static long madvise_populate(unsigned long start, unsigned long end,
- struct madvise_behavior *madv_behavior)
+static long madvise_populate(struct madvise_behavior *madv_behavior)
{
struct mm_struct *mm = madv_behavior->mm;
const bool write = madv_behavior->behavior == MADV_POPULATE_WRITE;
int locked = 1;
+ unsigned long start = madv_behavior->range.start;
+ unsigned long end = madv_behavior->range.end;
long pages;
while (start < end) {
@@ -1308,12 +1319,13 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
*/
static int madvise_vma_behavior(struct vm_area_struct *vma,
struct vm_area_struct **prev,
- unsigned long start, unsigned long end,
struct madvise_behavior *madv_behavior)
{
int behavior = madv_behavior->behavior;
struct anon_vma_name *anon_name = madv_behavior->anon_name;
vm_flags_t new_flags = vma->vm_flags;
+ unsigned long start = madv_behavior->range.start;
+ unsigned long end = madv_behavior->range.end;
int error;
if (unlikely(!can_modify_vma_madv(vma, behavior)))
@@ -1425,10 +1437,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
/*
* Error injection support for memory error handling.
*/
-static int madvise_inject_error(unsigned long start, unsigned long end,
- struct madvise_behavior *madv_behavior)
+static int madvise_inject_error(struct madvise_behavior *madv_behavior)
{
unsigned long size;
+ unsigned long start = madv_behavior->range.start;
+ unsigned long end = madv_behavior->range.end;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
@@ -1482,8 +1495,7 @@ static bool is_memory_failure(struct madvise_behavior *madv_behavior)
#else
-static int madvise_inject_error(unsigned long start, unsigned long end,
- struct madvise_behavior *madv_behavior)
+static int madvise_inject_error(struct madvise_behavior *madv_behavior)
{
return 0;
}
@@ -1565,20 +1577,20 @@ static bool process_madvise_remote_valid(int behavior)
* If a VMA read lock could not be acquired, we return NULL and expect caller to
* fallback to mmap lock behaviour.
*/
-static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
- struct madvise_behavior *madv_behavior,
- unsigned long start, unsigned long end)
+static
+struct vm_area_struct *try_vma_read_lock(struct madvise_behavior *madv_behavior)
{
+ struct mm_struct *mm = madv_behavior->mm;
struct vm_area_struct *vma;
- vma = lock_vma_under_rcu(mm, start);
+ vma = lock_vma_under_rcu(mm, madv_behavior->range.start);
if (!vma)
goto take_mmap_read_lock;
/*
* Must span only a single VMA; uffd and remote processes are
* unsupported.
*/
- if (end > vma->vm_end || current->mm != mm ||
+ if (madv_behavior->range.end > vma->vm_end || current->mm != mm ||
userfaultfd_armed(vma)) {
vma_end_read(vma);
goto take_mmap_read_lock;
@@ -1600,13 +1612,13 @@ static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
* Must be called with the mmap_lock held for reading or writing.
*/
static
-int madvise_walk_vmas(unsigned long start, unsigned long end,
- struct madvise_behavior *madv_behavior)
+int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
{
struct mm_struct *mm = madv_behavior->mm;
+ struct madvise_behavior_range *range = &madv_behavior->range;
+ unsigned long end = range->end;
struct vm_area_struct *vma;
struct vm_area_struct *prev;
- unsigned long tmp;
int unmapped_error = 0;
int error;
@@ -1615,11 +1627,10 @@ int madvise_walk_vmas(unsigned long start, unsigned long end,
* tentatively, avoiding walking VMAs.
*/
if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
- vma = try_vma_read_lock(mm, madv_behavior, start, end);
+ vma = try_vma_read_lock(madv_behavior);
if (vma) {
prev = vma;
- error = madvise_vma_behavior(vma, &prev, start, end,
- madv_behavior);
+ error = madvise_vma_behavior(vma, &prev, madv_behavior);
vma_end_read(vma);
return error;
}
@@ -1630,8 +1641,8 @@ int madvise_walk_vmas(unsigned long start, unsigned long end,
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
- vma = find_vma_prev(mm, start, &prev);
- if (vma && start > vma->vm_start)
+ vma = find_vma_prev(mm, range->start, &prev);
+ if (vma && range->start > vma->vm_start)
prev = vma;
for (;;) {
@@ -1640,32 +1651,29 @@ int madvise_walk_vmas(unsigned long start, unsigned long end,
return -ENOMEM;
/* Here start < (end|vma->vm_end). */
- if (start < vma->vm_start) {
+ if (range->start < vma->vm_start) {
unmapped_error = -ENOMEM;
- start = vma->vm_start;
- if (start >= end)
+ range->start = vma->vm_start;
+ if (range->start >= end)
break;
}
- /* Here vma->vm_start <= start < (end|vma->vm_end) */
- tmp = vma->vm_end;
- if (end < tmp)
- tmp = end;
+ /* Here vma->vm_start <= range->start < (end|vma->vm_end) */
+ range->end = min(vma->vm_end, end);
- /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
- error = madvise_vma_behavior(vma, &prev, start, tmp,
- madv_behavior);
+ /* Here vma->vm_start <= range->start < range->end <= (end|vma->vm_end). */
+ error = madvise_vma_behavior(vma, &prev, madv_behavior);
if (error)
return error;
- start = tmp;
- if (prev && start < prev->vm_end)
- start = prev->vm_end;
- if (start >= end)
+ range->start = range->end;
+ if (prev && range->start < prev->vm_end)
+ range->start = prev->vm_end;
+ if (range->start >= range->end)
break;
if (prev)
vma = find_vma(mm, prev->vm_end);
else /* madvise_remove dropped mmap_lock */
- vma = find_vma(mm, start);
+ vma = find_vma(mm, range->start);
}
return unmapped_error;
@@ -1846,22 +1854,23 @@ static int madvise_do_behavior(unsigned long start, size_t len_in,
struct madvise_behavior *madv_behavior)
{
struct blk_plug plug;
- unsigned long end;
int error;
+ struct madvise_behavior_range *range = &madv_behavior->range;
if (is_memory_failure(madv_behavior)) {
- end = start + len_in;
- return madvise_inject_error(start, end, madv_behavior);
+ range->start = start;
+ range->end = start + len_in;
+ return madvise_inject_error(madv_behavior);
}
- start = get_untagged_addr(madv_behavior->mm, start);
- end = start + PAGE_ALIGN(len_in);
+ range->start = get_untagged_addr(madv_behavior->mm, start);
+ range->end = range->start + PAGE_ALIGN(len_in);
blk_start_plug(&plug);
if (is_madvise_populate(madv_behavior))
- error = madvise_populate(start, end, madv_behavior);
+ error = madvise_populate(madv_behavior);
else
- error = madvise_walk_vmas(start, end, madv_behavior);
+ error = madvise_walk_vmas(madv_behavior);
blk_finish_plug(&plug);
return error;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior
2025-06-19 20:26 [PATCH 0/5] madvise cleanup Lorenzo Stoakes
` (2 preceding siblings ...)
2025-06-19 20:26 ` [PATCH 3/5] mm/madvise: thread VMA range state " Lorenzo Stoakes
@ 2025-06-19 20:26 ` Lorenzo Stoakes
2025-06-20 2:20 ` Zi Yan
` (2 more replies)
2025-06-19 20:26 ` [PATCH 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA Lorenzo Stoakes
4 siblings, 3 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-19 20:26 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
Doing so means we can get rid of all the weird struct vm_area_struct **prev
stuff, everything becomes consistent and in future if we want to make
change to behaviour there's a single place where all relevant state is
stored.
This also allows us to update try_vma_read_lock() to be a little more
succinct and set up state for us, as well as cleaning up
madvise_update_vma().
We also update the debug assertion prior to madvise_update_vma() to assert
that this is a write operation as correctly pointed out by Barry in the
relevant thread.
We can't reasonably update the madvise functions that live outside of
mm/madvise.c so we leave those as-is.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/madvise.c | 283 ++++++++++++++++++++++++++-------------------------
1 file changed, 146 insertions(+), 137 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 6faa38b92111..86fe04aa7c88 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -74,6 +74,8 @@ struct madvise_behavior {
* traversing multiple VMAs, this is updated for each.
*/
struct madvise_behavior_range range;
+ /* The VMA and VMA preceding it (if applicable) currently targeted. */
+ struct vm_area_struct *prev, *vma;
};
#ifdef CONFIG_ANON_VMA_NAME
@@ -194,26 +196,27 @@ static bool is_anon_vma_name(int behavior)
* Caller should ensure anon_name stability by raising its refcount even when
* anon_name belongs to a valid vma because this function might free that vma.
*/
-static int madvise_update_vma(struct vm_area_struct *vma,
- struct vm_area_struct **prev, unsigned long start,
- unsigned long end, vm_flags_t new_flags,
- struct anon_vma_name *anon_name)
+static int madvise_update_vma(vm_flags_t new_flags,
+ struct madvise_behavior *madv_behavior)
{
- struct mm_struct *mm = vma->vm_mm;
int error;
- VMA_ITERATOR(vmi, mm, start);
+ struct vm_area_struct *vma = madv_behavior->vma;
+ struct madvise_behavior_range *range = &madv_behavior->range;
+ struct anon_vma_name *anon_name = madv_behavior->anon_name;
+ VMA_ITERATOR(vmi, madv_behavior->mm, range->start);
if (new_flags == vma->vm_flags && anon_vma_name_eq(anon_vma_name(vma), anon_name)) {
- *prev = vma;
+ madv_behavior->prev = vma;
return 0;
}
- vma = vma_modify_flags_name(&vmi, *prev, vma, start, end, new_flags,
- anon_name);
+ vma = vma_modify_flags_name(&vmi, madv_behavior->prev, vma,
+ range->start, range->end, new_flags, anon_name);
if (IS_ERR(vma))
return PTR_ERR(vma);
- *prev = vma;
+ madv_behavior->vma = vma;
+ madv_behavior->prev = vma;
/* vm_flags is protected by the mmap_lock held in write mode. */
vma_start_write(vma);
@@ -318,15 +321,16 @@ static void shmem_swapin_range(struct vm_area_struct *vma,
/*
* Schedule all required I/O operations. Do not wait for completion.
*/
-static long madvise_willneed(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+static long madvise_willneed(struct madvise_behavior *madv_behavior)
{
- struct mm_struct *mm = vma->vm_mm;
+ struct vm_area_struct *vma = madv_behavior->vma;
+ struct mm_struct *mm = madv_behavior->mm;
struct file *file = vma->vm_file;
+ unsigned long start = madv_behavior->range.start;
+ unsigned long end = madv_behavior->range.end;
loff_t offset;
- *prev = vma;
+ madv_behavior->prev = vma;
#ifdef CONFIG_SWAP
if (!file) {
walk_page_range_vma(vma, start, end, &swapin_walk_ops, vma);
@@ -355,7 +359,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
* vma's reference to the file) can go away as soon as we drop
* mmap_lock.
*/
- *prev = NULL; /* tell sys_madvise we drop mmap_lock */
+ madv_behavior->prev = NULL; /* tell sys_madvise we drop mmap_lock */
get_file(file);
offset = (loff_t)(start - vma->vm_start)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
@@ -620,16 +624,19 @@ static const struct mm_walk_ops cold_walk_ops = {
};
static void madvise_cold_page_range(struct mmu_gather *tlb,
- struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
+ struct madvise_behavior *madv_behavior)
+
{
+ struct vm_area_struct *vma = madv_behavior->vma;
+ struct madvise_behavior_range *range = &madv_behavior->range;
struct madvise_walk_private walk_private = {
.pageout = false,
.tlb = tlb,
};
tlb_start_vma(tlb, vma);
- walk_page_range_vma(vma, addr, end, &cold_walk_ops, &walk_private);
+ walk_page_range_vma(vma, range->start, range->end, &cold_walk_ops,
+ &walk_private);
tlb_end_vma(tlb, vma);
}
@@ -638,28 +645,26 @@ static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
return !(vma->vm_flags & (VM_LOCKED|VM_PFNMAP|VM_HUGETLB));
}
-static long madvise_cold(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start_addr, unsigned long end_addr)
+static long madvise_cold(struct madvise_behavior *madv_behavior)
{
- struct mm_struct *mm = vma->vm_mm;
+ struct vm_area_struct *vma = madv_behavior->vma;
struct mmu_gather tlb;
- *prev = vma;
+ madv_behavior->prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
- madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_gather_mmu(&tlb, madv_behavior->mm);
+ madvise_cold_page_range(&tlb, madv_behavior);
tlb_finish_mmu(&tlb);
return 0;
}
static void madvise_pageout_page_range(struct mmu_gather *tlb,
- struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
+ struct vm_area_struct *vma,
+ struct madvise_behavior_range *range)
{
struct madvise_walk_private walk_private = {
.pageout = true,
@@ -667,18 +672,17 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
};
tlb_start_vma(tlb, vma);
- walk_page_range_vma(vma, addr, end, &cold_walk_ops, &walk_private);
+ walk_page_range_vma(vma, range->start, range->end, &cold_walk_ops,
+ &walk_private);
tlb_end_vma(tlb, vma);
}
-static long madvise_pageout(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start_addr, unsigned long end_addr)
+static long madvise_pageout(struct madvise_behavior *madv_behavior)
{
- struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
+ struct vm_area_struct *vma = madv_behavior->vma;
- *prev = vma;
+ madv_behavior->prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
@@ -693,8 +697,8 @@ static long madvise_pageout(struct vm_area_struct *vma,
return 0;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
- madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_gather_mmu(&tlb, madv_behavior->mm);
+ madvise_pageout_page_range(&tlb, vma, &madv_behavior->range);
tlb_finish_mmu(&tlb);
return 0;
@@ -857,11 +861,12 @@ static inline enum page_walk_lock get_walk_lock(enum madvise_lock_mode mode)
}
}
-static int madvise_free_single_vma(struct madvise_behavior *madv_behavior,
- struct vm_area_struct *vma,
- unsigned long start_addr, unsigned long end_addr)
+static int madvise_free_single_vma(struct madvise_behavior *madv_behavior)
{
- struct mm_struct *mm = vma->vm_mm;
+ struct mm_struct *mm = madv_behavior->mm;
+ struct vm_area_struct *vma = madv_behavior->vma;
+ unsigned long start_addr = madv_behavior->range.start;
+ unsigned long end_addr = madv_behavior->range.end;
struct mmu_notifier_range range;
struct mmu_gather *tlb = madv_behavior->tlb;
struct mm_walk_ops walk_ops = {
@@ -913,25 +918,28 @@ static int madvise_free_single_vma(struct madvise_behavior *madv_behavior,
* An interface that causes the system to free clean pages and flush
* dirty pages is already available as msync(MS_INVALIDATE).
*/
-static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior,
- struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+static long madvise_dontneed_single_vma(struct madvise_behavior *madv_behavior)
+
{
+ struct madvise_behavior_range *range = &madv_behavior->range;
struct zap_details details = {
.reclaim_pt = true,
.even_cows = true,
};
zap_page_range_single_batched(
- madv_behavior->tlb, vma, start, end - start, &details);
+ madv_behavior->tlb, madv_behavior->vma, range->start,
+ range->end - range->start, &details);
return 0;
}
-static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
- unsigned long start,
- unsigned long *end,
- int behavior)
+static
+bool madvise_dontneed_free_valid_vma(struct madvise_behavior *madv_behavior)
{
+ struct vm_area_struct *vma = madv_behavior->vma;
+ int behavior = madv_behavior->behavior;
+ struct madvise_behavior_range *range = &madv_behavior->range;
+
if (!is_vm_hugetlb_page(vma)) {
unsigned int forbidden = VM_PFNMAP;
@@ -943,7 +951,7 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
if (behavior != MADV_DONTNEED && behavior != MADV_DONTNEED_LOCKED)
return false;
- if (start & ~huge_page_mask(hstate_vma(vma)))
+ if (range->start & ~huge_page_mask(hstate_vma(vma)))
return false;
/*
@@ -952,41 +960,40 @@ static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
* Avoid unexpected data loss by rounding down the number of
* huge pages freed.
*/
- *end = ALIGN_DOWN(*end, huge_page_size(hstate_vma(vma)));
+ range->end = ALIGN_DOWN(range->end, huge_page_size(hstate_vma(vma)));
return true;
}
-static long madvise_dontneed_free(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end,
- struct madvise_behavior *madv_behavior)
+static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
{
+ struct mm_struct *mm = madv_behavior->mm;
+ struct madvise_behavior_range *range = &madv_behavior->range;
int behavior = madv_behavior->behavior;
- struct mm_struct *mm = vma->vm_mm;
- *prev = vma;
- if (!madvise_dontneed_free_valid_vma(vma, start, &end, behavior))
+ madv_behavior->prev = madv_behavior->vma;
+ if (!madvise_dontneed_free_valid_vma(madv_behavior))
return -EINVAL;
- if (start == end)
+ if (range->start == range->end)
return 0;
- if (!userfaultfd_remove(vma, start, end)) {
- *prev = NULL; /* mmap_lock has been dropped, prev is stale */
+ if (!userfaultfd_remove(madv_behavior->vma, range->start, range->end)) {
+ struct vm_area_struct *vma;
+
+ madv_behavior->prev = NULL; /* mmap_lock has been dropped, prev is stale */
mmap_read_lock(mm);
- vma = vma_lookup(mm, start);
+ madv_behavior->vma = vma = vma_lookup(mm, range->start);
if (!vma)
return -ENOMEM;
/*
* Potential end adjustment for hugetlb vma is OK as
* the check below keeps end within vma.
*/
- if (!madvise_dontneed_free_valid_vma(vma, start, &end,
- behavior))
+ if (!madvise_dontneed_free_valid_vma(madv_behavior))
return -EINVAL;
- if (end > vma->vm_end) {
+ if (range->end > vma->vm_end) {
/*
* Don't fail if end > vma->vm_end. If the old
* vma was split while the mmap_lock was
@@ -999,7 +1006,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
* end-vma->vm_end range, but the manager can
* handle a repetition fine.
*/
- end = vma->vm_end;
+ range->end = vma->vm_end;
}
/*
* If the memory region between start and end was
@@ -1008,16 +1015,15 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
* the adjustment for hugetlb vma above may have rounded
* end down to the start address.
*/
- if (start == end)
+ if (range->start == range->end)
return 0;
- VM_WARN_ON(start > end);
+ VM_WARN_ON(range->start > range->end);
}
if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED)
- return madvise_dontneed_single_vma(
- madv_behavior, vma, start, end);
+ return madvise_dontneed_single_vma(madv_behavior);
else if (behavior == MADV_FREE)
- return madvise_free_single_vma(madv_behavior, vma, start, end);
+ return madvise_free_single_vma(madv_behavior);
else
return -EINVAL;
}
@@ -1065,16 +1071,17 @@ static long madvise_populate(struct madvise_behavior *madv_behavior)
* Application wants to free up the pages and associated backing store.
* This is effectively punching a hole into the middle of a file.
*/
-static long madvise_remove(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+static long madvise_remove(struct madvise_behavior *madv_behavior)
{
loff_t offset;
int error;
struct file *f;
- struct mm_struct *mm = vma->vm_mm;
+ struct mm_struct *mm = madv_behavior->mm;
+ struct vm_area_struct *vma = madv_behavior->vma;
+ unsigned long start = madv_behavior->range.start;
+ unsigned long end = madv_behavior->range.end;
- *prev = NULL; /* tell sys_madvise we drop mmap_lock */
+ madv_behavior->prev = NULL; /* tell sys_madvise we drop mmap_lock */
if (vma->vm_flags & VM_LOCKED)
return -EINVAL;
@@ -1186,14 +1193,14 @@ static const struct mm_walk_ops guard_install_walk_ops = {
.walk_lock = PGWALK_RDLOCK,
};
-static long madvise_guard_install(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+static long madvise_guard_install(struct madvise_behavior *madv_behavior)
{
+ struct vm_area_struct *vma = madv_behavior->vma;
+ struct madvise_behavior_range *range = &madv_behavior->range;
long err;
int i;
- *prev = vma;
+ madv_behavior->prev = vma;
if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
@@ -1224,13 +1231,14 @@ static long madvise_guard_install(struct vm_area_struct *vma,
unsigned long nr_pages = 0;
/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
- err = walk_page_range_mm(vma->vm_mm, start, end,
+ err = walk_page_range_mm(vma->vm_mm, range->start, range->end,
&guard_install_walk_ops, &nr_pages);
if (err < 0)
return err;
if (err == 0) {
- unsigned long nr_expected_pages = PHYS_PFN(end - start);
+ unsigned long nr_expected_pages =
+ PHYS_PFN(range->end - range->start);
VM_WARN_ON(nr_pages != nr_expected_pages);
return 0;
@@ -1240,7 +1248,8 @@ static long madvise_guard_install(struct vm_area_struct *vma,
* OK some of the range have non-guard pages mapped, zap
* them. This leaves existing guard pages in place.
*/
- zap_page_range_single(vma, start, end - start, NULL);
+ zap_page_range_single(vma, range->start,
+ range->end - range->start, NULL);
}
/*
@@ -1296,11 +1305,12 @@ static const struct mm_walk_ops guard_remove_walk_ops = {
.walk_lock = PGWALK_RDLOCK,
};
-static long madvise_guard_remove(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
{
- *prev = vma;
+ struct vm_area_struct *vma = madv_behavior->vma;
+ struct madvise_behavior_range *range = &madv_behavior->range;
+
+ madv_behavior->prev = vma;
/*
* We're ok with removing guards in mlock()'d ranges, as this is a
* non-destructive action.
@@ -1308,7 +1318,7 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
if (!is_valid_guard_vma(vma, /* allow_locked = */true))
return -EINVAL;
- return walk_page_range_vma(vma, start, end,
+ return walk_page_range_vma(vma, range->start, range->end,
&guard_remove_walk_ops, NULL);
}
@@ -1317,40 +1327,37 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
* will handle splitting a vm area into separate areas, each area with its own
* behavior.
*/
-static int madvise_vma_behavior(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- struct madvise_behavior *madv_behavior)
+static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
{
int behavior = madv_behavior->behavior;
- struct anon_vma_name *anon_name = madv_behavior->anon_name;
+ struct vm_area_struct *vma = madv_behavior->vma;
vm_flags_t new_flags = vma->vm_flags;
- unsigned long start = madv_behavior->range.start;
- unsigned long end = madv_behavior->range.end;
+ struct madvise_behavior_range *range = &madv_behavior->range;
int error;
- if (unlikely(!can_modify_vma_madv(vma, behavior)))
+ if (unlikely(!can_modify_vma_madv(madv_behavior->vma, behavior)))
return -EPERM;
switch (behavior) {
case MADV_REMOVE:
- return madvise_remove(vma, prev, start, end);
+ return madvise_remove(madv_behavior);
case MADV_WILLNEED:
- return madvise_willneed(vma, prev, start, end);
+ return madvise_willneed(madv_behavior);
case MADV_COLD:
- return madvise_cold(vma, prev, start, end);
+ return madvise_cold(madv_behavior);
case MADV_PAGEOUT:
- return madvise_pageout(vma, prev, start, end);
+ return madvise_pageout(madv_behavior);
case MADV_FREE:
case MADV_DONTNEED:
case MADV_DONTNEED_LOCKED:
- return madvise_dontneed_free(vma, prev, start, end,
- madv_behavior);
+ return madvise_dontneed_free(madv_behavior);
case MADV_COLLAPSE:
- return madvise_collapse(vma, prev, start, end);
+ return madvise_collapse(vma, &madv_behavior->prev,
+ range->start, range->end);
case MADV_GUARD_INSTALL:
- return madvise_guard_install(vma, prev, start, end);
+ return madvise_guard_install(madv_behavior);
case MADV_GUARD_REMOVE:
- return madvise_guard_remove(vma, prev, start, end);
+ return madvise_guard_remove(madv_behavior);
/* The below behaviours update VMAs via madvise_update_vma(). */
@@ -1367,18 +1374,18 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
new_flags |= VM_DONTCOPY;
break;
case MADV_DOFORK:
- if (vma->vm_flags & VM_IO)
+ if (new_flags & VM_IO)
return -EINVAL;
new_flags &= ~VM_DONTCOPY;
break;
case MADV_WIPEONFORK:
/* MADV_WIPEONFORK is only supported on anonymous memory. */
- if (vma->vm_file || vma->vm_flags & VM_SHARED)
+ if (vma->vm_file || new_flags & VM_SHARED)
return -EINVAL;
new_flags |= VM_WIPEONFORK;
break;
case MADV_KEEPONFORK:
- if (vma->vm_flags & VM_DROPPABLE)
+ if (new_flags & VM_DROPPABLE)
return -EINVAL;
new_flags &= ~VM_WIPEONFORK;
break;
@@ -1386,14 +1393,15 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
new_flags |= VM_DONTDUMP;
break;
case MADV_DODUMP:
- if ((!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) ||
- (vma->vm_flags & VM_DROPPABLE))
+ if ((!is_vm_hugetlb_page(vma) && (new_flags & VM_SPECIAL)) ||
+ (new_flags & VM_DROPPABLE))
return -EINVAL;
new_flags &= ~VM_DONTDUMP;
break;
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
- error = ksm_madvise(vma, start, end, behavior, &new_flags);
+ error = ksm_madvise(vma, range->start, range->end,
+ behavior, &new_flags);
if (error)
goto out;
break;
@@ -1411,18 +1419,16 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
break;
}
- /* We cannot provide prev in this lock mode. */
- VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK);
+ /* This is a write operation.*/
+ VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK);
if (!is_anon_vma_name(behavior)) {
- anon_name = anon_vma_name(vma);
- anon_vma_name_get(anon_name);
+ madv_behavior->anon_name = anon_vma_name(vma);
+ anon_vma_name_get(madv_behavior->anon_name);
}
- error = madvise_update_vma(vma, prev, start, end, new_flags,
- anon_name);
+ error = madvise_update_vma(new_flags, madv_behavior);
if (!is_anon_vma_name(behavior))
- anon_vma_name_put(anon_name);
-
+ anon_vma_name_put(madv_behavior->anon_name);
out:
/*
* madvise() returns EAGAIN if kernel resources, such as
@@ -1572,13 +1578,13 @@ static bool process_madvise_remote_valid(int behavior)
* span either partially or fully.
*
* This function always returns with an appropriate lock held. If a VMA read
- * lock could be acquired, we return the locked VMA.
+ * lock could be acquired, we return true and set madv_behavior state
+ * accordingly.
*
- * If a VMA read lock could not be acquired, we return NULL and expect caller to
+ * If a VMA read lock could not be acquired, we return false and expect caller to
* fallback to mmap lock behaviour.
*/
-static
-struct vm_area_struct *try_vma_read_lock(struct madvise_behavior *madv_behavior)
+static bool try_vma_read_lock(struct madvise_behavior *madv_behavior)
{
struct mm_struct *mm = madv_behavior->mm;
struct vm_area_struct *vma;
@@ -1595,12 +1601,14 @@ struct vm_area_struct *try_vma_read_lock(struct madvise_behavior *madv_behavior)
vma_end_read(vma);
goto take_mmap_read_lock;
}
- return vma;
+ madv_behavior->prev = vma; /* Not currently required. */
+ madv_behavior->vma = vma;
+ return true;
take_mmap_read_lock:
mmap_read_lock(mm);
madv_behavior->lock_mode = MADVISE_MMAP_READ_LOCK;
- return NULL;
+ return false;
}
/*
@@ -1617,23 +1625,19 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
struct mm_struct *mm = madv_behavior->mm;
struct madvise_behavior_range *range = &madv_behavior->range;
unsigned long end = range->end;
- struct vm_area_struct *vma;
- struct vm_area_struct *prev;
int unmapped_error = 0;
int error;
+ struct vm_area_struct *vma;
/*
* If VMA read lock is supported, apply madvise to a single VMA
* tentatively, avoiding walking VMAs.
*/
- if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
- vma = try_vma_read_lock(madv_behavior);
- if (vma) {
- prev = vma;
- error = madvise_vma_behavior(vma, &prev, madv_behavior);
- vma_end_read(vma);
- return error;
- }
+ if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK &&
+ try_vma_read_lock(madv_behavior)) {
+ error = madvise_vma_behavior(madv_behavior);
+ vma_end_read(madv_behavior->vma);
+ return error;
}
/*
@@ -1641,11 +1645,13 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
- vma = find_vma_prev(mm, range->start, &prev);
+ vma = find_vma_prev(mm, range->start, &madv_behavior->prev);
if (vma && range->start > vma->vm_start)
- prev = vma;
+ madv_behavior->prev = vma;
for (;;) {
+ struct vm_area_struct *prev;
+
/* Still start < end. */
if (!vma)
return -ENOMEM;
@@ -1662,13 +1668,16 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
range->end = min(vma->vm_end, end);
/* Here vma->vm_start <= range->start < range->end <= (end|vma->vm_end). */
- error = madvise_vma_behavior(vma, &prev, madv_behavior);
+ madv_behavior->vma = vma;
+ error = madvise_vma_behavior(madv_behavior);
if (error)
return error;
+ prev = madv_behavior->prev;
+
range->start = range->end;
if (prev && range->start < prev->vm_end)
range->start = prev->vm_end;
- if (range->start >= range->end)
+ if (range->start >= end)
break;
if (prev)
vma = find_vma(mm, prev->vm_end);
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA
2025-06-19 20:26 [PATCH 0/5] madvise cleanup Lorenzo Stoakes
` (3 preceding siblings ...)
2025-06-19 20:26 ` [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior Lorenzo Stoakes
@ 2025-06-19 20:26 ` Lorenzo Stoakes
2025-06-20 15:02 ` Vlastimil Babka
4 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-19 20:26 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
The madvise code has for the longest time had very confusing code around
the 'prev' VMA pointer passed around various functions which, in all cases
except madvise_update_vma(), is unused and instead simply updated as soon
as the function is invoked.
To compound the confusion, the prev pointer is also used to indicate to the
caller that the mmap lock has been dropped and that we can therefore not
safely access the end of the current VMA (which might have been updated by
madvise_update_vma()).
Clear up this confusion by not setting prev = vma anywhere except in
madvise_walk_vmas(), update all references to prev which will always be
equal to vma after madvise_vma_behavior() is invoked, and adding a flag to
indicate that the lock has been dropped to make this explicit.
Additionally, drop a redundant BUG_ON() from madvise_collapse(), which is
simply reiterating the BUG_ON(mmap_locked) above it (note that BUG_ON() is
not appropriate here, but we leave existing code as-is).
We finally adjust the madvise_walk_vmas() logic to be a little clearer -
delaying the assignment of the end of the range to the start of the new
range until the last moment and handling the lock being dropped scenario
immediately.
Additionally add some explanatory comments.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
include/linux/huge_mm.h | 9 +++---
mm/khugepaged.c | 9 ++----
mm/madvise.c | 63 +++++++++++++++++++++--------------------
3 files changed, 39 insertions(+), 42 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 8f1b15213f61..4d5bb67dc4ec 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -433,9 +433,8 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
int hugepage_madvise(struct vm_area_struct *vma, vm_flags_t *vm_flags,
int advice);
-int madvise_collapse(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end);
+int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, bool *lock_dropped);
void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
unsigned long end, struct vm_area_struct *next);
spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -596,8 +595,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
}
static inline int madvise_collapse(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+ unsigned long start,
+ unsigned long end, bool *lock_dropped)
{
return -EINVAL;
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3495a20cef5e..1aa7ca67c756 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2727,8 +2727,8 @@ static int madvise_collapse_errno(enum scan_result r)
}
}
-int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
- unsigned long start, unsigned long end)
+int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, bool *lock_dropped)
{
struct collapse_control *cc;
struct mm_struct *mm = vma->vm_mm;
@@ -2739,8 +2739,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
BUG_ON(vma->vm_start > start);
BUG_ON(vma->vm_end < end);
- *prev = vma;
-
if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
return -EINVAL;
@@ -2788,7 +2786,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
&mmap_locked, cc);
}
if (!mmap_locked)
- *prev = NULL; /* Tell caller we dropped mmap_lock */
+ *lock_dropped = true;
handle_result:
switch (result) {
@@ -2798,7 +2796,6 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
break;
case SCAN_PTE_MAPPED_HUGEPAGE:
BUG_ON(mmap_locked);
- BUG_ON(*prev);
mmap_read_lock(mm);
result = collapse_pte_mapped_thp(mm, addr, true);
mmap_read_unlock(mm);
diff --git a/mm/madvise.c b/mm/madvise.c
index 86fe04aa7c88..53c3a46d7bf6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -76,6 +76,7 @@ struct madvise_behavior {
struct madvise_behavior_range range;
/* The VMA and VMA preceding it (if applicable) currently targeted. */
struct vm_area_struct *prev, *vma;
+ bool lock_dropped;
};
#ifdef CONFIG_ANON_VMA_NAME
@@ -205,10 +206,8 @@ static int madvise_update_vma(vm_flags_t new_flags,
struct anon_vma_name *anon_name = madv_behavior->anon_name;
VMA_ITERATOR(vmi, madv_behavior->mm, range->start);
- if (new_flags == vma->vm_flags && anon_vma_name_eq(anon_vma_name(vma), anon_name)) {
- madv_behavior->prev = vma;
+ if (new_flags == vma->vm_flags && anon_vma_name_eq(anon_vma_name(vma), anon_name))
return 0;
- }
vma = vma_modify_flags_name(&vmi, madv_behavior->prev, vma,
range->start, range->end, new_flags, anon_name);
@@ -216,7 +215,6 @@ static int madvise_update_vma(vm_flags_t new_flags,
return PTR_ERR(vma);
madv_behavior->vma = vma;
- madv_behavior->prev = vma;
/* vm_flags is protected by the mmap_lock held in write mode. */
vma_start_write(vma);
@@ -330,7 +328,6 @@ static long madvise_willneed(struct madvise_behavior *madv_behavior)
unsigned long end = madv_behavior->range.end;
loff_t offset;
- madv_behavior->prev = vma;
#ifdef CONFIG_SWAP
if (!file) {
walk_page_range_vma(vma, start, end, &swapin_walk_ops, vma);
@@ -359,7 +356,7 @@ static long madvise_willneed(struct madvise_behavior *madv_behavior)
* vma's reference to the file) can go away as soon as we drop
* mmap_lock.
*/
- madv_behavior->prev = NULL; /* tell sys_madvise we drop mmap_lock */
+ madv_behavior->lock_dropped = true;
get_file(file);
offset = (loff_t)(start - vma->vm_start)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
@@ -650,7 +647,6 @@ static long madvise_cold(struct madvise_behavior *madv_behavior)
struct vm_area_struct *vma = madv_behavior->vma;
struct mmu_gather tlb;
- madv_behavior->prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
@@ -682,7 +678,6 @@ static long madvise_pageout(struct madvise_behavior *madv_behavior)
struct mmu_gather tlb;
struct vm_area_struct *vma = madv_behavior->vma;
- madv_behavior->prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
@@ -971,7 +966,6 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
struct madvise_behavior_range *range = &madv_behavior->range;
int behavior = madv_behavior->behavior;
- madv_behavior->prev = madv_behavior->vma;
if (!madvise_dontneed_free_valid_vma(madv_behavior))
return -EINVAL;
@@ -981,8 +975,7 @@ static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
if (!userfaultfd_remove(madv_behavior->vma, range->start, range->end)) {
struct vm_area_struct *vma;
- madv_behavior->prev = NULL; /* mmap_lock has been dropped, prev is stale */
-
+ madv_behavior->lock_dropped = true;
mmap_read_lock(mm);
madv_behavior->vma = vma = vma_lookup(mm, range->start);
if (!vma)
@@ -1081,7 +1074,7 @@ static long madvise_remove(struct madvise_behavior *madv_behavior)
unsigned long start = madv_behavior->range.start;
unsigned long end = madv_behavior->range.end;
- madv_behavior->prev = NULL; /* tell sys_madvise we drop mmap_lock */
+ madv_behavior->lock_dropped = true;
if (vma->vm_flags & VM_LOCKED)
return -EINVAL;
@@ -1200,7 +1193,6 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
long err;
int i;
- madv_behavior->prev = vma;
if (!is_valid_guard_vma(vma, /* allow_locked = */false))
return -EINVAL;
@@ -1310,7 +1302,6 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
struct vm_area_struct *vma = madv_behavior->vma;
struct madvise_behavior_range *range = &madv_behavior->range;
- madv_behavior->prev = vma;
/*
* We're ok with removing guards in mlock()'d ranges, as this is a
* non-destructive action.
@@ -1352,8 +1343,8 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
case MADV_DONTNEED_LOCKED:
return madvise_dontneed_free(madv_behavior);
case MADV_COLLAPSE:
- return madvise_collapse(vma, &madv_behavior->prev,
- range->start, range->end);
+ return madvise_collapse(vma, range->start, range->end,
+ &madv_behavior->lock_dropped);
case MADV_GUARD_INSTALL:
return madvise_guard_install(madv_behavior);
case MADV_GUARD_REMOVE:
@@ -1601,7 +1592,6 @@ static bool try_vma_read_lock(struct madvise_behavior *madv_behavior)
vma_end_read(vma);
goto take_mmap_read_lock;
}
- madv_behavior->prev = vma; /* Not currently required. */
madv_behavior->vma = vma;
return true;
@@ -1627,7 +1617,7 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
unsigned long end = range->end;
int unmapped_error = 0;
int error;
- struct vm_area_struct *vma;
+ struct vm_area_struct *prev, *vma;
/*
* If VMA read lock is supported, apply madvise to a single VMA
@@ -1645,19 +1635,23 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
- vma = find_vma_prev(mm, range->start, &madv_behavior->prev);
+ vma = find_vma_prev(mm, range->start, &prev);
if (vma && range->start > vma->vm_start)
- madv_behavior->prev = vma;
+ prev = vma;
for (;;) {
- struct vm_area_struct *prev;
-
/* Still start < end. */
if (!vma)
return -ENOMEM;
/* Here start < (end|vma->vm_end). */
if (range->start < vma->vm_start) {
+ /*
+ * This indicates a gap between VMAs in the input
+ * range. This does not cause the operation to abort,
+ * rather we simply return -ENOMEM to indicate that this
+ * has happened, but carry on.
+ */
unmapped_error = -ENOMEM;
range->start = vma->vm_start;
if (range->start >= end)
@@ -1668,21 +1662,28 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
range->end = min(vma->vm_end, end);
/* Here vma->vm_start <= range->start < range->end <= (end|vma->vm_end). */
+ madv_behavior->prev = prev;
madv_behavior->vma = vma;
error = madvise_vma_behavior(madv_behavior);
if (error)
return error;
- prev = madv_behavior->prev;
+ if (madv_behavior->lock_dropped) {
+ /* We dropped the mmap lock, we can't ref the VMA. */
+ prev = NULL;
+ vma = NULL;
+ madv_behavior->lock_dropped = false;
+ } else {
+ prev = vma;
+ vma = madv_behavior->vma;
+ }
- range->start = range->end;
- if (prev && range->start < prev->vm_end)
- range->start = prev->vm_end;
- if (range->start >= end)
+ if (vma && range->end < vma->vm_end)
+ range->end = vma->vm_end;
+ if (range->end >= end)
break;
- if (prev)
- vma = find_vma(mm, prev->vm_end);
- else /* madvise_remove dropped mmap_lock */
- vma = find_vma(mm, range->start);
+
+ vma = find_vma(mm, vma ? vma->vm_end : range->end);
+ range->start = range->end;
}
return unmapped_error;
--
2.49.0
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state
2025-06-19 20:26 ` [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
@ 2025-06-20 1:32 ` Zi Yan
2025-06-20 2:43 ` Lance Yang
2025-06-20 5:08 ` Lorenzo Stoakes
2025-06-20 13:05 ` Vlastimil Babka
1 sibling, 2 replies; 29+ messages in thread
From: Zi Yan @ 2025-06-20 1:32 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 19 Jun 2025, at 16:26, Lorenzo Stoakes wrote:
> Now we have the madvise_behavior helper struct we no longer need to mess
> around with void* pointers in order to propagate anon_vma_name, and this
> means we can get rid of the confusing and inconsistent visitor pattern
> implementation in madvise_vma_anon_name().
>
> This means we now have a single state object that threads through most of
> madvise()'s logic and a single code path which executes the majority of
> madvise() behaviour (we maintain separate logic for failure injection and
> memory population for the time being).
>
> Note that users cannot inadvertently cause this behaviour to occur, as
> madvise_behavior_valid() would reject it.
>
> Doing this results in a can_modify_vma_madv() check for anonymous VMA name
> changes, however this will cause no issues as this operation is not
> prohibited.
>
> We can also then reuse more code and drop the redundant
> madvise_vma_anon_name() function altogether.
>
> Additionally separate out behaviours that update VMAs from those that do
> not.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 167 +++++++++++++++++++++++++++------------------------
> 1 file changed, 90 insertions(+), 77 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 070132f9842b..9dd935d64692 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -37,6 +37,9 @@
> #include "internal.h"
> #include "swap.h"
>
> +#define __MADV_SET_ANON_VMA_NAME (-1)
> +#define __MADV_CLEAR_ANON_VMA_NAME (-2)
> +
These are stored in madvise_behavior.behavior field and used
internally. At least you could add a comment in mman-common.h
about this use, in case someone uses these values for MADV_*.
Yes, these are large values that are very unlikely to be used,
but who knows whether one will use them. :)
> /*
> * Maximum number of attempts we make to install guard pages before we give up
> * and return -ERESTARTNOINTR to have userspace try again.
> @@ -59,9 +62,13 @@ struct madvise_behavior {
> int behavior;
> struct mmu_gather *tlb;
> enum madvise_lock_mode lock_mode;
> + struct anon_vma_name *anon_name;
> };
>
> #ifdef CONFIG_ANON_VMA_NAME
> +static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> + unsigned long end, struct madvise_behavior *madv_behavior);
> +
> struct anon_vma_name *anon_vma_name_alloc(const char *name)
> {
> struct anon_vma_name *anon_name;
> @@ -112,6 +119,48 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
>
> return 0;
> }
> +
> +int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> + unsigned long len_in, struct anon_vma_name *anon_name)
> +{
> + unsigned long end;
> + unsigned long len;
> + struct madvise_behavior madv_behavior = {
> + .lock_mode = MADVISE_MMAP_WRITE_LOCK,
> + .anon_name = anon_name,
> + };
> +
> + if (start & ~PAGE_MASK)
> + return -EINVAL;
> + len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> +
> + /* Check to see whether len was rounded up from small -ve to zero */
> + if (len_in && !len)
> + return -EINVAL;
> +
> + end = start + len;
> + if (end < start)
> + return -EINVAL;
> +
> + if (end == start)
> + return 0;
> +
> + madv_behavior.behavior =
> + anon_name ? __MADV_SET_ANON_VMA_NAME : __MADV_CLEAR_ANON_VMA_NAME;
How are __MADV_SET_ANON_VMA_NAME and __MADV_CLEAR_ANON_VMA_NAME used?
It seems to me that madvise_vma_behavior() treats them the same.
Why does anon_name is NULL or not make a difference?
> +
> + return madvise_walk_vmas(mm, start, end, &madv_behavior);
> +}
> +
> +static bool is_anon_vma_name(int behavior)
Maybe update_anon_vma_name()? Otherwise the function reads like
the behavior can be anon_vma_name.
> +{
> + switch (behavior) {
> + case __MADV_SET_ANON_VMA_NAME:
> + case __MADV_CLEAR_ANON_VMA_NAME:
> + return true;
> + default:
> + return false;
> + }
> +}
> #else /* CONFIG_ANON_VMA_NAME */
> static int replace_anon_vma_name(struct vm_area_struct *vma,
> struct anon_vma_name *anon_name)
> @@ -121,6 +170,11 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
>
> return 0;
> }
> +
> +static bool is_anon_vma_name(int behavior)
> +{
> + return false;
> +}
> #endif /* CONFIG_ANON_VMA_NAME */
> /*
> * Update the vm_flags on region of a vma, splitting it or merging it as
> @@ -1252,13 +1306,12 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
> static int madvise_vma_behavior(struct vm_area_struct *vma,
> struct vm_area_struct **prev,
> unsigned long start, unsigned long end,
> - void *behavior_arg)
> + struct madvise_behavior *madv_behavior)
> {
> - struct madvise_behavior *arg = behavior_arg;
> - int behavior = arg->behavior;
> - int error;
> - struct anon_vma_name *anon_name;
> + int behavior = madv_behavior->behavior;
> + struct anon_vma_name *anon_name = madv_behavior->anon_name;
> vm_flags_t new_flags = vma->vm_flags;
> + int error;
>
> if (unlikely(!can_modify_vma_madv(vma, behavior)))
> return -EPERM;
> @@ -1275,7 +1328,17 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> case MADV_FREE:
> case MADV_DONTNEED:
> case MADV_DONTNEED_LOCKED:
> - return madvise_dontneed_free(vma, prev, start, end, arg);
> + return madvise_dontneed_free(vma, prev, start, end,
> + madv_behavior);
> + case MADV_COLLAPSE:
> + return madvise_collapse(vma, prev, start, end);
> + case MADV_GUARD_INSTALL:
> + return madvise_guard_install(vma, prev, start, end);
> + case MADV_GUARD_REMOVE:
> + return madvise_guard_remove(vma, prev, start, end);
> +
> + /* The below behaviours update VMAs via madvise_update_vma(). */
> +
Great comment and code move!
> case MADV_NORMAL:
> new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
> break;
> @@ -1325,21 +1388,25 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> if (error)
> goto out;
> break;
> - case MADV_COLLAPSE:
> - return madvise_collapse(vma, prev, start, end);
> - case MADV_GUARD_INSTALL:
> - return madvise_guard_install(vma, prev, start, end);
> - case MADV_GUARD_REMOVE:
> - return madvise_guard_remove(vma, prev, start, end);
> + case __MADV_SET_ANON_VMA_NAME:
> + case __MADV_CLEAR_ANON_VMA_NAME:
> + /* Only anonymous mappings can be named */
> + if (vma->vm_file && !vma_is_anon_shmem(vma))
> + return -EBADF;
> + break;
> }
__MADV_SET_ANON_VMA_NAME and __MADV_CLEAR_ANON_VMA_NAME are
used here the code below. I do not see the functional difference
of them. I understand a NULL anon_name means clear the name,
but it is also just set the name to NULL.
>
> /* We cannot provide prev in this lock mode. */
> - VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK);
> - anon_name = anon_vma_name(vma);
> - anon_vma_name_get(anon_name);
> + VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK);
> +
> + if (!is_anon_vma_name(behavior)) {
> + anon_name = anon_vma_name(vma);
> + anon_vma_name_get(anon_name);
> + }
> error = madvise_update_vma(vma, prev, start, end, new_flags,
> anon_name);
> - anon_vma_name_put(anon_name);
> + if (!is_anon_vma_name(behavior))
> + anon_vma_name_put(anon_name);
Otherwise, the rest looks good to me.
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/5] mm/madvise: thread mm_struct through madvise_behavior
2025-06-19 20:26 ` [PATCH 2/5] mm/madvise: thread mm_struct through madvise_behavior Lorenzo Stoakes
@ 2025-06-20 1:40 ` Zi Yan
2025-06-20 5:12 ` Lorenzo Stoakes
2025-06-20 13:19 ` Vlastimil Babka
1 sibling, 1 reply; 29+ messages in thread
From: Zi Yan @ 2025-06-20 1:40 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 19 Jun 2025, at 16:26, Lorenzo Stoakes wrote:
> There's no need to thread a pointer to the mm_struct nor have different
> functions signatures for each behaviour, instead store state in the struct
> madvise_behavior object consistently and use it for all madvise() actions.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 105 ++++++++++++++++++++++++++-------------------------
> 1 file changed, 54 insertions(+), 51 deletions(-)
>
<snip>
> @@ -1422,15 +1425,14 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> /*
> * Error injection support for memory error handling.
> */
> -static int madvise_inject_error(int behavior,
> - unsigned long start, unsigned long end)
> +static int madvise_inject_error(unsigned long start, unsigned long end,
> + struct madvise_behavior *madv_behavior)
> {
> unsigned long size;
>
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
>
> -
> for (; start < end; start += size) {
> unsigned long pfn;
> struct page *page;
> @@ -1448,7 +1450,7 @@ static int madvise_inject_error(int behavior,
> */
> size = page_size(compound_head(page));
>
> - if (behavior == MADV_SOFT_OFFLINE) {
> + if (madv_behavior->behavior == MADV_SOFT_OFFLINE) {
> pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
> pfn, start);
> ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
> @@ -1467,9 +1469,9 @@ static int madvise_inject_error(int behavior,
> return 0;
> }
>
Is this necessary? madvise_inject_error() only cares about behavior.
> -static bool is_memory_failure(int behavior)
> +static bool is_memory_failure(struct madvise_behavior *madv_behavior)
> {
> - switch (behavior) {
> + switch (madv_behavior->behavior) {
> case MADV_HWPOISON:
> case MADV_SOFT_OFFLINE:
> return true;
> @@ -1480,13 +1482,13 @@ static bool is_memory_failure(int behavior)
>
> #else
>
> -static int madvise_inject_error(int behavior,
> - unsigned long start, unsigned long end)
> +static int madvise_inject_error(unsigned long start, unsigned long end,
> + struct madvise_behavior *madv_behavior)
> {
> return 0;
> }
>
> -static bool is_memory_failure(int behavior)
> +static bool is_memory_failure(struct madvise_behavior *madv_behavior)
> {
> return false;
> }
Same here. Your is_anon_vma_name() still takes int behavior, why
would is_memory_failure() take struct madvise_behavior?
<snip>
> -static bool is_madvise_populate(int behavior)
> +static bool is_madvise_populate(struct madvise_behavior *madv_behavior)
> {
> - switch (behavior) {
> + switch (madv_behavior->behavior) {
> case MADV_POPULATE_READ:
> case MADV_POPULATE_WRITE:
> return true;
Ditto.
The rest looks good to me.
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/5] mm/madvise: thread VMA range state through madvise_behavior
2025-06-19 20:26 ` [PATCH 3/5] mm/madvise: thread VMA range state " Lorenzo Stoakes
@ 2025-06-20 1:54 ` Zi Yan
2025-06-20 2:13 ` Zi Yan
2025-06-20 5:17 ` Lorenzo Stoakes
2025-06-20 13:49 ` Vlastimil Babka
1 sibling, 2 replies; 29+ messages in thread
From: Zi Yan @ 2025-06-20 1:54 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 19 Jun 2025, at 16:26, Lorenzo Stoakes wrote:
> Rather than updating start and a confusing local parameter 'tmp' in
> madvise_walk_vmas(), instead store the current range being operated upon in
> the struct madvise_behavior helper object in a range pair and use this
> consistently in all operations.
>
> This makes it clearer what is going on and opens the door to further
> cleanup now we store state regarding what is currently being operated upon
> here.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 101 ++++++++++++++++++++++++++++-----------------------
> 1 file changed, 55 insertions(+), 46 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 47485653c2a1..6faa38b92111 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -58,17 +58,26 @@ enum madvise_lock_mode {
> MADVISE_VMA_READ_LOCK,
> };
>
> +struct madvise_behavior_range {
> + unsigned long start, end;
> +};
> +
Declare members separately?
<snip>
> @@ -1425,10 +1437,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> /*
> * Error injection support for memory error handling.
> */
> -static int madvise_inject_error(unsigned long start, unsigned long end,
> - struct madvise_behavior *madv_behavior)
> +static int madvise_inject_error(struct madvise_behavior *madv_behavior)
> {
> unsigned long size;
> + unsigned long start = madv_behavior->range.start;
> + unsigned long end = madv_behavior->range.end;
>
> if (!capable(CAP_SYS_ADMIN))
> return -EPERM;
> @@ -1482,8 +1495,7 @@ static bool is_memory_failure(struct madvise_behavior *madv_behavior)
>
> #else
>
> -static int madvise_inject_error(unsigned long start, unsigned long end,
> - struct madvise_behavior *madv_behavior)
> +static int madvise_inject_error(struct madvise_behavior *madv_behavior)
> {
> return 0;
> }
OK, now I get why you pass struct madvise_behavior to madvise_inject_error()
in Patch 2. The changes make sense to me now. Maybe delay that conversation
in this one.
> @@ -1565,20 +1577,20 @@ static bool process_madvise_remote_valid(int behavior)
> * If a VMA read lock could not be acquired, we return NULL and expect caller to
> * fallback to mmap lock behaviour.
> */
> -static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
> - struct madvise_behavior *madv_behavior,
> - unsigned long start, unsigned long end)
> +static
> +struct vm_area_struct *try_vma_read_lock(struct madvise_behavior *madv_behavior)
> {
> + struct mm_struct *mm = madv_behavior->mm;
Is the struct mm_struct removal missed in Patch 2?
<snip>
> @@ -1846,22 +1854,23 @@ static int madvise_do_behavior(unsigned long start, size_t len_in,
> struct madvise_behavior *madv_behavior)
> {
> struct blk_plug plug;
> - unsigned long end;
> int error;
> + struct madvise_behavior_range *range = &madv_behavior->range;
>
> if (is_memory_failure(madv_behavior)) {
> - end = start + len_in;
> - return madvise_inject_error(start, end, madv_behavior);
> + range->start = start;
> + range->end = start + len_in;
> + return madvise_inject_error(madv_behavior);
> }
>
> - start = get_untagged_addr(madv_behavior->mm, start);
> - end = start + PAGE_ALIGN(len_in);
> + range->start = get_untagged_addr(madv_behavior->mm, start);
> + range->end = range->start + PAGE_ALIGN(len_in);
>
> blk_start_plug(&plug);
> if (is_madvise_populate(madv_behavior))
> - error = madvise_populate(start, end, madv_behavior);
> + error = madvise_populate(madv_behavior);
> else
> - error = madvise_walk_vmas(start, end, madv_behavior);
> + error = madvise_walk_vmas(madv_behavior);
> blk_finish_plug(&plug);
> return error;
> }
We almost can pass just struct madvise_behavior to madvise_do_behavior().
I wonder why memory_failure behaves differently.
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/5] mm/madvise: thread VMA range state through madvise_behavior
2025-06-20 1:54 ` Zi Yan
@ 2025-06-20 2:13 ` Zi Yan
2025-06-20 5:21 ` Lorenzo Stoakes
2025-06-20 5:17 ` Lorenzo Stoakes
1 sibling, 1 reply; 29+ messages in thread
From: Zi Yan @ 2025-06-20 2:13 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan, Kirill A . Shutemov
On 19 Jun 2025, at 21:54, Zi Yan wrote:
> On 19 Jun 2025, at 16:26, Lorenzo Stoakes wrote:
>
>> Rather than updating start and a confusing local parameter 'tmp' in
>> madvise_walk_vmas(), instead store the current range being operated upon in
>> the struct madvise_behavior helper object in a range pair and use this
>> consistently in all operations.
>>
>> This makes it clearer what is going on and opens the door to further
>> cleanup now we store state regarding what is currently being operated upon
>> here.
>>
>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> ---
>> mm/madvise.c | 101 ++++++++++++++++++++++++++++-----------------------
>> 1 file changed, 55 insertions(+), 46 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 47485653c2a1..6faa38b92111 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -58,17 +58,26 @@ enum madvise_lock_mode {
>> MADVISE_VMA_READ_LOCK,
>> };
>>
>> +struct madvise_behavior_range {
>> + unsigned long start, end;
>> +};
>> +
>
> Declare members separately?
>
> <snip>
>
>> @@ -1425,10 +1437,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>> /*
>> * Error injection support for memory error handling.
>> */
>> -static int madvise_inject_error(unsigned long start, unsigned long end,
>> - struct madvise_behavior *madv_behavior)
>> +static int madvise_inject_error(struct madvise_behavior *madv_behavior)
>> {
>> unsigned long size;
>> + unsigned long start = madv_behavior->range.start;
>> + unsigned long end = madv_behavior->range.end;
>>
>> if (!capable(CAP_SYS_ADMIN))
>> return -EPERM;
>> @@ -1482,8 +1495,7 @@ static bool is_memory_failure(struct madvise_behavior *madv_behavior)
>>
>> #else
>>
>> -static int madvise_inject_error(unsigned long start, unsigned long end,
>> - struct madvise_behavior *madv_behavior)
>> +static int madvise_inject_error(struct madvise_behavior *madv_behavior)
>> {
>> return 0;
>> }
>
> OK, now I get why you pass struct madvise_behavior to madvise_inject_error()
> in Patch 2. The changes make sense to me now. Maybe delay that conversation
> in this one.
>
>
>
>> @@ -1565,20 +1577,20 @@ static bool process_madvise_remote_valid(int behavior)
>> * If a VMA read lock could not be acquired, we return NULL and expect caller to
>> * fallback to mmap lock behaviour.
>> */
>> -static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
>> - struct madvise_behavior *madv_behavior,
>> - unsigned long start, unsigned long end)
>> +static
>> +struct vm_area_struct *try_vma_read_lock(struct madvise_behavior *madv_behavior)
>> {
>> + struct mm_struct *mm = madv_behavior->mm;
>
> Is the struct mm_struct removal missed in Patch 2?
>
>
> <snip>
>
>> @@ -1846,22 +1854,23 @@ static int madvise_do_behavior(unsigned long start, size_t len_in,
>> struct madvise_behavior *madv_behavior)
>> {
>> struct blk_plug plug;
>> - unsigned long end;
>> int error;
>> + struct madvise_behavior_range *range = &madv_behavior->range;
>>
>> if (is_memory_failure(madv_behavior)) {
>> - end = start + len_in;
>> - return madvise_inject_error(start, end, madv_behavior);
>> + range->start = start;
>> + range->end = start + len_in;
>> + return madvise_inject_error(madv_behavior);
>> }
>>
>> - start = get_untagged_addr(madv_behavior->mm, start);
>> - end = start + PAGE_ALIGN(len_in);
>> + range->start = get_untagged_addr(madv_behavior->mm, start);
>> + range->end = range->start + PAGE_ALIGN(len_in);
>>
>> blk_start_plug(&plug);
>> if (is_madvise_populate(madv_behavior))
>> - error = madvise_populate(start, end, madv_behavior);
>> + error = madvise_populate(madv_behavior);
>> else
>> - error = madvise_walk_vmas(start, end, madv_behavior);
>> + error = madvise_walk_vmas(madv_behavior);
>> blk_finish_plug(&plug);
>> return error;
>> }
>
> We almost can pass just struct madvise_behavior to madvise_do_behavior().
> I wonder why memory_failure behaves differently.
Based on git history, it seems that no one paid attention to
madvise_inject_error() and the [start, start + len_in] has never been
changed since it was added back from 2009.
OK, it seems that Kirill (cc'd) moved start = untagged_addr(start); from
before madvise_inject_error() to after it at commit 428e106ae1ad
("mm: Introduce untagged_addr_remote()"). It changed code behavior.
So memory_failure should get the same range as others, meaning
madvise_do_behavior() can just take struct madvise_behavior
and the range can be set at the call sites.
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior
2025-06-19 20:26 ` [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior Lorenzo Stoakes
@ 2025-06-20 2:20 ` Zi Yan
2025-06-20 5:21 ` Lorenzo Stoakes
2025-06-20 14:16 ` Vlastimil Babka
2025-06-20 14:32 ` Vlastimil Babka
2 siblings, 1 reply; 29+ messages in thread
From: Zi Yan @ 2025-06-20 2:20 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 19 Jun 2025, at 16:26, Lorenzo Stoakes wrote:
> Doing so means we can get rid of all the weird struct vm_area_struct **prev
> stuff, everything becomes consistent and in future if we want to make
> change to behaviour there's a single place where all relevant state is
> stored.
>
> This also allows us to update try_vma_read_lock() to be a little more
> succinct and set up state for us, as well as cleaning up
> madvise_update_vma().
>
> We also update the debug assertion prior to madvise_update_vma() to assert
> that this is a write operation as correctly pointed out by Barry in the
> relevant thread.
>
> We can't reasonably update the madvise functions that live outside of
> mm/madvise.c so we leave those as-is.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/madvise.c | 283 ++++++++++++++++++++++++++-------------------------
> 1 file changed, 146 insertions(+), 137 deletions(-)
>
Acked-by: Zi Yan <ziy@nvidia.com>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state
2025-06-20 1:32 ` Zi Yan
@ 2025-06-20 2:43 ` Lance Yang
2025-06-20 5:10 ` Lorenzo Stoakes
2025-06-20 5:08 ` Lorenzo Stoakes
1 sibling, 1 reply; 29+ messages in thread
From: Lance Yang @ 2025-06-20 2:43 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan, Zi Yan
On 2025/6/20 09:32, Zi Yan wrote:
> On 19 Jun 2025, at 16:26, Lorenzo Stoakes wrote:
>
>> Now we have the madvise_behavior helper struct we no longer need to mess
>> around with void* pointers in order to propagate anon_vma_name, and this
>> means we can get rid of the confusing and inconsistent visitor pattern
>> implementation in madvise_vma_anon_name().
>>
>> This means we now have a single state object that threads through most of
>> madvise()'s logic and a single code path which executes the majority of
>> madvise() behaviour (we maintain separate logic for failure injection and
>> memory population for the time being).
>>
>> Note that users cannot inadvertently cause this behaviour to occur, as
>> madvise_behavior_valid() would reject it.
>>
>> Doing this results in a can_modify_vma_madv() check for anonymous VMA name
>> changes, however this will cause no issues as this operation is not
>> prohibited.
>>
>> We can also then reuse more code and drop the redundant
>> madvise_vma_anon_name() function altogether.
>>
>> Additionally separate out behaviours that update VMAs from those that do
>> not.
>>
>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> ---
>> mm/madvise.c | 167 +++++++++++++++++++++++++++------------------------
>> 1 file changed, 90 insertions(+), 77 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 070132f9842b..9dd935d64692 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -37,6 +37,9 @@
>> #include "internal.h"
>> #include "swap.h"
>>
>> +#define __MADV_SET_ANON_VMA_NAME (-1)
>> +#define __MADV_CLEAR_ANON_VMA_NAME (-2)
>> +
>
> These are stored in madvise_behavior.behavior field and used
> internally. At least you could add a comment in mman-common.h
> about this use, in case someone uses these values for MADV_*.
> Yes, these are large values that are very unlikely to be used,
> but who knows whether one will use them. :)
>
>> /*
>> * Maximum number of attempts we make to install guard pages before we give up
>> * and return -ERESTARTNOINTR to have userspace try again.
>> @@ -59,9 +62,13 @@ struct madvise_behavior {
>> int behavior;
>> struct mmu_gather *tlb;
>> enum madvise_lock_mode lock_mode;
>> + struct anon_vma_name *anon_name;
>> };
>>
>> #ifdef CONFIG_ANON_VMA_NAME
>> +static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
>> + unsigned long end, struct madvise_behavior *madv_behavior);
>> +
>> struct anon_vma_name *anon_vma_name_alloc(const char *name)
>> {
>> struct anon_vma_name *anon_name;
>> @@ -112,6 +119,48 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
>>
>> return 0;
>> }
>> +
>> +int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>> + unsigned long len_in, struct anon_vma_name *anon_name)
>> +{
>> + unsigned long end;
>> + unsigned long len;
>> + struct madvise_behavior madv_behavior = {
>> + .lock_mode = MADVISE_MMAP_WRITE_LOCK,
>> + .anon_name = anon_name,
>> + };
>> +
>> + if (start & ~PAGE_MASK)
>> + return -EINVAL;
>> + len = (len_in + ~PAGE_MASK) & PAGE_MASK;
>> +
>> + /* Check to see whether len was rounded up from small -ve to zero */
>> + if (len_in && !len)
>> + return -EINVAL;
>> +
>> + end = start + len;
>> + if (end < start)
>> + return -EINVAL;
>> +
>> + if (end == start)
>> + return 0;
>> +
>> + madv_behavior.behavior =
>> + anon_name ? __MADV_SET_ANON_VMA_NAME : __MADV_CLEAR_ANON_VMA_NAME;
>
> How are __MADV_SET_ANON_VMA_NAME and __MADV_CLEAR_ANON_VMA_NAME used?
> It seems to me that madvise_vma_behavior() treats them the same.
> Why does anon_name is NULL or not make a difference?
>
>> +
>> + return madvise_walk_vmas(mm, start, end, &madv_behavior);
>> +}
>> +
>> +static bool is_anon_vma_name(int behavior)
>
> Maybe update_anon_vma_name()? Otherwise the function reads like
> the behavior can be anon_vma_name.
>
>> +{
>> + switch (behavior) {
>> + case __MADV_SET_ANON_VMA_NAME:
>> + case __MADV_CLEAR_ANON_VMA_NAME:
>> + return true;
>> + default:
>> + return false;
>> + }
>> +}
>> #else /* CONFIG_ANON_VMA_NAME */
>> static int replace_anon_vma_name(struct vm_area_struct *vma,
>> struct anon_vma_name *anon_name)
>> @@ -121,6 +170,11 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
>>
>> return 0;
>> }
>> +
>> +static bool is_anon_vma_name(int behavior)
>> +{
>> + return false;
>> +}
>> #endif /* CONFIG_ANON_VMA_NAME */
>> /*
>> * Update the vm_flags on region of a vma, splitting it or merging it as
>> @@ -1252,13 +1306,12 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
>> static int madvise_vma_behavior(struct vm_area_struct *vma,
>> struct vm_area_struct **prev,
>> unsigned long start, unsigned long end,
>> - void *behavior_arg)
>> + struct madvise_behavior *madv_behavior)
>> {
>> - struct madvise_behavior *arg = behavior_arg;
>> - int behavior = arg->behavior;
>> - int error;
>> - struct anon_vma_name *anon_name;
>> + int behavior = madv_behavior->behavior;
>> + struct anon_vma_name *anon_name = madv_behavior->anon_name;
>> vm_flags_t new_flags = vma->vm_flags;
>> + int error;
>>
>> if (unlikely(!can_modify_vma_madv(vma, behavior)))
>> return -EPERM;
>> @@ -1275,7 +1328,17 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>> case MADV_FREE:
>> case MADV_DONTNEED:
>> case MADV_DONTNEED_LOCKED:
>> - return madvise_dontneed_free(vma, prev, start, end, arg);
>> + return madvise_dontneed_free(vma, prev, start, end,
>> + madv_behavior);
>> + case MADV_COLLAPSE:
>> + return madvise_collapse(vma, prev, start, end);
>> + case MADV_GUARD_INSTALL:
>> + return madvise_guard_install(vma, prev, start, end);
>> + case MADV_GUARD_REMOVE:
>> + return madvise_guard_remove(vma, prev, start, end);
>> +
>> + /* The below behaviours update VMAs via madvise_update_vma(). */
>> +
>
> Great comment and code move!
>
>> case MADV_NORMAL:
>> new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
>> break;
>> @@ -1325,21 +1388,25 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>> if (error)
>> goto out;
>> break;
>> - case MADV_COLLAPSE:
>> - return madvise_collapse(vma, prev, start, end);
>> - case MADV_GUARD_INSTALL:
>> - return madvise_guard_install(vma, prev, start, end);
>> - case MADV_GUARD_REMOVE:
>> - return madvise_guard_remove(vma, prev, start, end);
>> + case __MADV_SET_ANON_VMA_NAME:
>> + case __MADV_CLEAR_ANON_VMA_NAME:
Would this be clearer and more closely describe the code's logic?
/* Reject naming for regular file-backed mappings. */
>> + /* Only anonymous mappings can be named */
>> + if (vma->vm_file && !vma_is_anon_shmem(vma))
>> + return -EBADF;
>> + break;
>> }
Thanks,
Lance
>
> __MADV_SET_ANON_VMA_NAME and __MADV_CLEAR_ANON_VMA_NAME are
> used here the code below. I do not see the functional difference
> of them. I understand a NULL anon_name means clear the name,
> but it is also just set the name to NULL.
>
>>
>> /* We cannot provide prev in this lock mode. */
>> - VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK);
>> - anon_name = anon_vma_name(vma);
>> - anon_vma_name_get(anon_name);
>> + VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK);
>> +
>> + if (!is_anon_vma_name(behavior)) {
>> + anon_name = anon_vma_name(vma);
>> + anon_vma_name_get(anon_name);
>> + }
>> error = madvise_update_vma(vma, prev, start, end, new_flags,
>> anon_name);
>> - anon_vma_name_put(anon_name);
>> + if (!is_anon_vma_name(behavior))
>> + anon_vma_name_put(anon_name);
>
> Otherwise, the rest looks good to me.
>
> --
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state
2025-06-20 1:32 ` Zi Yan
2025-06-20 2:43 ` Lance Yang
@ 2025-06-20 5:08 ` Lorenzo Stoakes
1 sibling, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 5:08 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On Thu, Jun 19, 2025 at 09:32:43PM -0400, Zi Yan wrote:
> On 19 Jun 2025, at 16:26, Lorenzo Stoakes wrote:
>
> > Now we have the madvise_behavior helper struct we no longer need to mess
> > around with void* pointers in order to propagate anon_vma_name, and this
> > means we can get rid of the confusing and inconsistent visitor pattern
> > implementation in madvise_vma_anon_name().
> >
> > This means we now have a single state object that threads through most of
> > madvise()'s logic and a single code path which executes the majority of
> > madvise() behaviour (we maintain separate logic for failure injection and
> > memory population for the time being).
> >
> > Note that users cannot inadvertently cause this behaviour to occur, as
> > madvise_behavior_valid() would reject it.
> >
> > Doing this results in a can_modify_vma_madv() check for anonymous VMA name
> > changes, however this will cause no issues as this operation is not
> > prohibited.
> >
> > We can also then reuse more code and drop the redundant
> > madvise_vma_anon_name() function altogether.
> >
> > Additionally separate out behaviours that update VMAs from those that do
> > not.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> > mm/madvise.c | 167 +++++++++++++++++++++++++++------------------------
> > 1 file changed, 90 insertions(+), 77 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 070132f9842b..9dd935d64692 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -37,6 +37,9 @@
> > #include "internal.h"
> > #include "swap.h"
> >
> > +#define __MADV_SET_ANON_VMA_NAME (-1)
> > +#define __MADV_CLEAR_ANON_VMA_NAME (-2)
> > +
>
> These are stored in madvise_behavior.behavior field and used
> internally. At least you could add a comment in mman-common.h
> about this use, in case someone uses these values for MADV_*.
> Yes, these are large values that are very unlikely to be used,
> but who knows whether one will use them. :)
These are signed integers, so somebody would have to overflow >2.14bn, I don't
think this is really worth commenting there without adding confusion, and at any
rate, even if somebody did do that, madvise_behavior_valid() would catch it :)
>
> > /*
> > * Maximum number of attempts we make to install guard pages before we give up
> > * and return -ERESTARTNOINTR to have userspace try again.
> > @@ -59,9 +62,13 @@ struct madvise_behavior {
> > int behavior;
> > struct mmu_gather *tlb;
> > enum madvise_lock_mode lock_mode;
> > + struct anon_vma_name *anon_name;
> > };
> >
> > #ifdef CONFIG_ANON_VMA_NAME
> > +static int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> > + unsigned long end, struct madvise_behavior *madv_behavior);
> > +
> > struct anon_vma_name *anon_vma_name_alloc(const char *name)
> > {
> > struct anon_vma_name *anon_name;
> > @@ -112,6 +119,48 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
> >
> > return 0;
> > }
> > +
> > +int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> > + unsigned long len_in, struct anon_vma_name *anon_name)
> > +{
> > + unsigned long end;
> > + unsigned long len;
> > + struct madvise_behavior madv_behavior = {
> > + .lock_mode = MADVISE_MMAP_WRITE_LOCK,
> > + .anon_name = anon_name,
> > + };
> > +
> > + if (start & ~PAGE_MASK)
> > + return -EINVAL;
> > + len = (len_in + ~PAGE_MASK) & PAGE_MASK;
> > +
> > + /* Check to see whether len was rounded up from small -ve to zero */
> > + if (len_in && !len)
> > + return -EINVAL;
> > +
> > + end = start + len;
> > + if (end < start)
> > + return -EINVAL;
> > +
> > + if (end == start)
> > + return 0;
> > +
> > + madv_behavior.behavior =
> > + anon_name ? __MADV_SET_ANON_VMA_NAME : __MADV_CLEAR_ANON_VMA_NAME;
>
> How are __MADV_SET_ANON_VMA_NAME and __MADV_CLEAR_ANON_VMA_NAME used?
> It seems to me that madvise_vma_behavior() treats them the same.
> Why does anon_name is NULL or not make a difference?
Yeah you're right, I separated out because at one point I thought I needed to to
correctly handle the update vs. other updates but that's not actually the case,
will change.
>
> > +
> > + return madvise_walk_vmas(mm, start, end, &madv_behavior);
> > +}
> > +
> > +static bool is_anon_vma_name(int behavior)
>
> Maybe update_anon_vma_name()? Otherwise the function reads like
> the behavior can be anon_vma_name.
Ack will change to is_set_anon_vma_name() or something like this to make
clearer.
>
> > +{
> > + switch (behavior) {
> > + case __MADV_SET_ANON_VMA_NAME:
> > + case __MADV_CLEAR_ANON_VMA_NAME:
> > + return true;
> > + default:
> > + return false;
> > + }
> > +}
> > #else /* CONFIG_ANON_VMA_NAME */
> > static int replace_anon_vma_name(struct vm_area_struct *vma,
> > struct anon_vma_name *anon_name)
> > @@ -121,6 +170,11 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
> >
> > return 0;
> > }
> > +
> > +static bool is_anon_vma_name(int behavior)
> > +{
> > + return false;
> > +}
> > #endif /* CONFIG_ANON_VMA_NAME */
> > /*
> > * Update the vm_flags on region of a vma, splitting it or merging it as
> > @@ -1252,13 +1306,12 @@ static long madvise_guard_remove(struct vm_area_struct *vma,
> > static int madvise_vma_behavior(struct vm_area_struct *vma,
> > struct vm_area_struct **prev,
> > unsigned long start, unsigned long end,
> > - void *behavior_arg)
> > + struct madvise_behavior *madv_behavior)
> > {
> > - struct madvise_behavior *arg = behavior_arg;
> > - int behavior = arg->behavior;
> > - int error;
> > - struct anon_vma_name *anon_name;
> > + int behavior = madv_behavior->behavior;
> > + struct anon_vma_name *anon_name = madv_behavior->anon_name;
> > vm_flags_t new_flags = vma->vm_flags;
> > + int error;
> >
> > if (unlikely(!can_modify_vma_madv(vma, behavior)))
> > return -EPERM;
> > @@ -1275,7 +1328,17 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > case MADV_FREE:
> > case MADV_DONTNEED:
> > case MADV_DONTNEED_LOCKED:
> > - return madvise_dontneed_free(vma, prev, start, end, arg);
> > + return madvise_dontneed_free(vma, prev, start, end,
> > + madv_behavior);
> > + case MADV_COLLAPSE:
> > + return madvise_collapse(vma, prev, start, end);
> > + case MADV_GUARD_INSTALL:
> > + return madvise_guard_install(vma, prev, start, end);
> > + case MADV_GUARD_REMOVE:
> > + return madvise_guard_remove(vma, prev, start, end);
> > +
> > + /* The below behaviours update VMAs via madvise_update_vma(). */
> > +
>
> Great comment and code move!
Thanks!
>
> > case MADV_NORMAL:
> > new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ;
> > break;
> > @@ -1325,21 +1388,25 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > if (error)
> > goto out;
> > break;
> > - case MADV_COLLAPSE:
> > - return madvise_collapse(vma, prev, start, end);
> > - case MADV_GUARD_INSTALL:
> > - return madvise_guard_install(vma, prev, start, end);
> > - case MADV_GUARD_REMOVE:
> > - return madvise_guard_remove(vma, prev, start, end);
> > + case __MADV_SET_ANON_VMA_NAME:
> > + case __MADV_CLEAR_ANON_VMA_NAME:
> > + /* Only anonymous mappings can be named */
> > + if (vma->vm_file && !vma_is_anon_shmem(vma))
> > + return -EBADF;
> > + break;
> > }
>
> __MADV_SET_ANON_VMA_NAME and __MADV_CLEAR_ANON_VMA_NAME are
> used here the code below. I do not see the functional difference
> of them. I understand a NULL anon_name means clear the name,
> but it is also just set the name to NULL.
Yeah you're right this was my mistake :) I had previously felt I should
differentiate as previously the code below was
if (madv_behavior->anon_name) { ...
Which would have become:
if (behavior != __MADV_CLEAR_ANON_VMA_NAME && madv_behavior->anon_name) { ...
But then realised we can just check the behavior flag and anyway avoid the
unnecessary get/put by doing this.
Will fixup.
>
> >
> > /* We cannot provide prev in this lock mode. */
> > - VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK);
> > - anon_name = anon_vma_name(vma);
> > - anon_vma_name_get(anon_name);
> > + VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK);
> > +
> > + if (!is_anon_vma_name(behavior)) {
> > + anon_name = anon_vma_name(vma);
> > + anon_vma_name_get(anon_name);
> > + }
> > error = madvise_update_vma(vma, prev, start, end, new_flags,
> > anon_name);
> > - anon_vma_name_put(anon_name);
> > + if (!is_anon_vma_name(behavior))
> > + anon_vma_name_put(anon_name);
>
> Otherwise, the rest looks good to me.
Thanks!
>
> --
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state
2025-06-20 2:43 ` Lance Yang
@ 2025-06-20 5:10 ` Lorenzo Stoakes
0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 5:10 UTC (permalink / raw)
To: Lance Yang
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan, Zi Yan
On Fri, Jun 20, 2025 at 10:43:30AM +0800, Lance Yang wrote:
> Would this be clearer and more closely describe the code's logic?
>
> /* Reject naming for regular file-backed mappings. */
Yeah here's the fun of 'anonymous' being super unclear :) it's like the 'no
true Scotsman' fallacy in mm form... ;)
'Anonymous, no really really anonymous, not even file-backed' :P
it's a good point but I'd rather keep this as close to the original as
possible so it's just a code _move_. Perhaps we can update later but I
think the context makes clear what is meant here at least.
>
>
> > > + /* Only anonymous mappings can be named */
> > > + if (vma->vm_file && !vma_is_anon_shmem(vma))
> > > + return -EBADF;
> > > + break;
> > > }
>
>
> Thanks,
> Lance
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/5] mm/madvise: thread mm_struct through madvise_behavior
2025-06-20 1:40 ` Zi Yan
@ 2025-06-20 5:12 ` Lorenzo Stoakes
0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 5:12 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On Thu, Jun 19, 2025 at 09:40:34PM -0400, Zi Yan wrote:
> On 19 Jun 2025, at 16:26, Lorenzo Stoakes wrote:
>
> > There's no need to thread a pointer to the mm_struct nor have different
> > functions signatures for each behaviour, instead store state in the struct
> > madvise_behavior object consistently and use it for all madvise() actions.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> > mm/madvise.c | 105 ++++++++++++++++++++++++++-------------------------
> > 1 file changed, 54 insertions(+), 51 deletions(-)
> >
>
> <snip>
>
> > @@ -1422,15 +1425,14 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > /*
> > * Error injection support for memory error handling.
> > */
> > -static int madvise_inject_error(int behavior,
> > - unsigned long start, unsigned long end)
> > +static int madvise_inject_error(unsigned long start, unsigned long end,
> > + struct madvise_behavior *madv_behavior)
> > {
> > unsigned long size;
> >
> > if (!capable(CAP_SYS_ADMIN))
> > return -EPERM;
> >
> > -
> > for (; start < end; start += size) {
> > unsigned long pfn;
> > struct page *page;
> > @@ -1448,7 +1450,7 @@ static int madvise_inject_error(int behavior,
> > */
> > size = page_size(compound_head(page));
> >
> > - if (behavior == MADV_SOFT_OFFLINE) {
> > + if (madv_behavior->behavior == MADV_SOFT_OFFLINE) {
> > pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
> > pfn, start);
> > ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
> > @@ -1467,9 +1469,9 @@ static int madvise_inject_error(int behavior,
> > return 0;
> > }
> >
>
> Is this necessary? madvise_inject_error() only cares about behavior.
As you notice in a subsequent patch, we go further. It's also useful
signature-wise for all functions to have the exact same signature for
consistency, something sorely lacking in madvise.c for some time.
This is covered under the 'nor have different functions signatures for each
behaviour' bit in the commit message :)
>
> > -static bool is_memory_failure(int behavior)
> > +static bool is_memory_failure(struct madvise_behavior *madv_behavior)
> > {
> > - switch (behavior) {
> > + switch (madv_behavior->behavior) {
> > case MADV_HWPOISON:
> > case MADV_SOFT_OFFLINE:
> > return true;
> > @@ -1480,13 +1482,13 @@ static bool is_memory_failure(int behavior)
> >
> > #else
> >
> > -static int madvise_inject_error(int behavior,
> > - unsigned long start, unsigned long end)
> > +static int madvise_inject_error(unsigned long start, unsigned long end,
> > + struct madvise_behavior *madv_behavior)
> > {
> > return 0;
> > }
> >
> > -static bool is_memory_failure(int behavior)
> > +static bool is_memory_failure(struct madvise_behavior *madv_behavior)
> > {
> > return false;
> > }
>
> Same here. Your is_anon_vma_name() still takes int behavior, why
> would is_memory_failure() take struct madvise_behavior?
See above.
>
> <snip>
>
> > -static bool is_madvise_populate(int behavior)
> > +static bool is_madvise_populate(struct madvise_behavior *madv_behavior)
> > {
> > - switch (behavior) {
> > + switch (madv_behavior->behavior) {
> > case MADV_POPULATE_READ:
> > case MADV_POPULATE_WRITE:
> > return true;
>
> Ditto.
Doing it this way makes life easier (esp later) and is more consistent with
other invocations.
>
> The rest looks good to me.
Thanks!
>
> --
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/5] mm/madvise: thread VMA range state through madvise_behavior
2025-06-20 1:54 ` Zi Yan
2025-06-20 2:13 ` Zi Yan
@ 2025-06-20 5:17 ` Lorenzo Stoakes
1 sibling, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 5:17 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On Thu, Jun 19, 2025 at 09:54:11PM -0400, Zi Yan wrote:
> On 19 Jun 2025, at 16:26, Lorenzo Stoakes wrote:
>
> > Rather than updating start and a confusing local parameter 'tmp' in
> > madvise_walk_vmas(), instead store the current range being operated upon in
> > the struct madvise_behavior helper object in a range pair and use this
> > consistently in all operations.
> >
> > This makes it clearer what is going on and opens the door to further
> > cleanup now we store state regarding what is currently being operated upon
> > here.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> > mm/madvise.c | 101 ++++++++++++++++++++++++++++-----------------------
> > 1 file changed, 55 insertions(+), 46 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 47485653c2a1..6faa38b92111 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -58,17 +58,26 @@ enum madvise_lock_mode {
> > MADVISE_VMA_READ_LOCK,
> > };
> >
> > +struct madvise_behavior_range {
> > + unsigned long start, end;
> > +};
> > +
>
> Declare members separately?
Can do, but this is one of those subject things where everyone has different
views, if I did it the other way no doubt somebody else would comment about
declaring together :P
I think as a range here it's not a big deal unless you feel strongly about it?
>
> <snip>
>
> > @@ -1425,10 +1437,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > /*
> > * Error injection support for memory error handling.
> > */
> > -static int madvise_inject_error(unsigned long start, unsigned long end,
> > - struct madvise_behavior *madv_behavior)
> > +static int madvise_inject_error(struct madvise_behavior *madv_behavior)
> > {
> > unsigned long size;
> > + unsigned long start = madv_behavior->range.start;
> > + unsigned long end = madv_behavior->range.end;
> >
> > if (!capable(CAP_SYS_ADMIN))
> > return -EPERM;
> > @@ -1482,8 +1495,7 @@ static bool is_memory_failure(struct madvise_behavior *madv_behavior)
> >
> > #else
> >
> > -static int madvise_inject_error(unsigned long start, unsigned long end,
> > - struct madvise_behavior *madv_behavior)
> > +static int madvise_inject_error(struct madvise_behavior *madv_behavior)
> > {
> > return 0;
> > }
>
> OK, now I get why you pass struct madvise_behavior to madvise_inject_error()
> in Patch 2. The changes make sense to me now. Maybe delay that conversation
> in this one.
I think it's valuable there because otherwise all the function invocations were
inconsistent, but after 2/5 become completely consistent. I mention this in the
commit message and I think it's valuable so you're not doing:
if (foo)
bar(x, y, z)
if (blah)
baz(y, x, z)
etc.
When you quickly read through it's easy to get confused/lost as to what's going
on, whereas if they all have the same signatures it's very clear you're
offloading the heavy lifting to each function.
>
>
>
> > @@ -1565,20 +1577,20 @@ static bool process_madvise_remote_valid(int behavior)
> > * If a VMA read lock could not be acquired, we return NULL and expect caller to
> > * fallback to mmap lock behaviour.
> > */
> > -static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
> > - struct madvise_behavior *madv_behavior,
> > - unsigned long start, unsigned long end)
> > +static
> > +struct vm_area_struct *try_vma_read_lock(struct madvise_behavior *madv_behavior)
> > {
> > + struct mm_struct *mm = madv_behavior->mm;
>
> Is the struct mm_struct removal missed in Patch 2?
Yeah, I will go back and put it in on respin.
>
>
> <snip>
>
> > @@ -1846,22 +1854,23 @@ static int madvise_do_behavior(unsigned long start, size_t len_in,
> > struct madvise_behavior *madv_behavior)
> > {
> > struct blk_plug plug;
> > - unsigned long end;
> > int error;
> > + struct madvise_behavior_range *range = &madv_behavior->range;
> >
> > if (is_memory_failure(madv_behavior)) {
> > - end = start + len_in;
> > - return madvise_inject_error(start, end, madv_behavior);
> > + range->start = start;
> > + range->end = start + len_in;
> > + return madvise_inject_error(madv_behavior);
> > }
> >
> > - start = get_untagged_addr(madv_behavior->mm, start);
> > - end = start + PAGE_ALIGN(len_in);
> > + range->start = get_untagged_addr(madv_behavior->mm, start);
> > + range->end = range->start + PAGE_ALIGN(len_in);
> >
> > blk_start_plug(&plug);
> > if (is_madvise_populate(madv_behavior))
> > - error = madvise_populate(start, end, madv_behavior);
> > + error = madvise_populate(madv_behavior);
> > else
> > - error = madvise_walk_vmas(start, end, madv_behavior);
> > + error = madvise_walk_vmas(madv_behavior);
> > blk_finish_plug(&plug);
> > return error;
> > }
>
> We almost can pass just struct madvise_behavior to madvise_do_behavior().
> I wonder why memory_failure behaves differently.
There's complexity around the start, end stuff (Barry bumped into some of this)
and I don't want to mess with that in this series. This series is meant to have
no functional changes.
>
> --
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/5] mm/madvise: thread VMA range state through madvise_behavior
2025-06-20 2:13 ` Zi Yan
@ 2025-06-20 5:21 ` Lorenzo Stoakes
0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 5:21 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan, Kirill A . Shutemov
On Thu, Jun 19, 2025 at 10:13:19PM -0400, Zi Yan wrote:
> On 19 Jun 2025, at 21:54, Zi Yan wrote:
> >> @@ -1846,22 +1854,23 @@ static int madvise_do_behavior(unsigned long start, size_t len_in,
> >> struct madvise_behavior *madv_behavior)
> >> {
> >> struct blk_plug plug;
> >> - unsigned long end;
> >> int error;
> >> + struct madvise_behavior_range *range = &madv_behavior->range;
> >>
> >> if (is_memory_failure(madv_behavior)) {
> >> - end = start + len_in;
> >> - return madvise_inject_error(start, end, madv_behavior);
> >> + range->start = start;
> >> + range->end = start + len_in;
> >> + return madvise_inject_error(madv_behavior);
> >> }
> >>
> >> - start = get_untagged_addr(madv_behavior->mm, start);
> >> - end = start + PAGE_ALIGN(len_in);
> >> + range->start = get_untagged_addr(madv_behavior->mm, start);
> >> + range->end = range->start + PAGE_ALIGN(len_in);
> >>
> >> blk_start_plug(&plug);
> >> if (is_madvise_populate(madv_behavior))
> >> - error = madvise_populate(start, end, madv_behavior);
> >> + error = madvise_populate(madv_behavior);
> >> else
> >> - error = madvise_walk_vmas(start, end, madv_behavior);
> >> + error = madvise_walk_vmas(madv_behavior);
> >> blk_finish_plug(&plug);
> >> return error;
> >> }
> >
> > We almost can pass just struct madvise_behavior to madvise_do_behavior().
> > I wonder why memory_failure behaves differently.
>
> Based on git history, it seems that no one paid attention to
> madvise_inject_error() and the [start, start + len_in] has never been
> changed since it was added back from 2009.
>
> OK, it seems that Kirill (cc'd) moved start = untagged_addr(start); from
> before madvise_inject_error() to after it at commit 428e106ae1ad
> ("mm: Introduce untagged_addr_remote()"). It changed code behavior.
>
> So memory_failure should get the same range as others, meaning
> madvise_do_behavior() can just take struct madvise_behavior
> and the range can be set at the call sites.
Well it's also page aligned in other cases right?
Anyway overall I'm not touching that in this series, sorry. This is a clean
up, and this stuff is a functional change that needs to be carefully
thought out.
So this is legitimately I think one for a follow-up. But this series makes
it easier to fix up later :)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior
2025-06-20 2:20 ` Zi Yan
@ 2025-06-20 5:21 ` Lorenzo Stoakes
0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 5:21 UTC (permalink / raw)
To: Zi Yan
Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On Thu, Jun 19, 2025 at 10:20:23PM -0400, Zi Yan wrote:
> On 19 Jun 2025, at 16:26, Lorenzo Stoakes wrote:
>
> > Doing so means we can get rid of all the weird struct vm_area_struct **prev
> > stuff, everything becomes consistent and in future if we want to make
> > change to behaviour there's a single place where all relevant state is
> > stored.
> >
> > This also allows us to update try_vma_read_lock() to be a little more
> > succinct and set up state for us, as well as cleaning up
> > madvise_update_vma().
> >
> > We also update the debug assertion prior to madvise_update_vma() to assert
> > that this is a write operation as correctly pointed out by Barry in the
> > relevant thread.
> >
> > We can't reasonably update the madvise functions that live outside of
> > mm/madvise.c so we leave those as-is.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> > mm/madvise.c | 283 ++++++++++++++++++++++++++-------------------------
> > 1 file changed, 146 insertions(+), 137 deletions(-)
> >
>
> Acked-by: Zi Yan <ziy@nvidia.com>
Thanks!
>
> --
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state
2025-06-19 20:26 ` [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
2025-06-20 1:32 ` Zi Yan
@ 2025-06-20 13:05 ` Vlastimil Babka
2025-06-20 13:17 ` Lorenzo Stoakes
1 sibling, 1 reply; 29+ messages in thread
From: Vlastimil Babka @ 2025-06-20 13:05 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Jann Horn,
linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 6/19/25 22:26, Lorenzo Stoakes wrote:
> Now we have the madvise_behavior helper struct we no longer need to mess
> around with void* pointers in order to propagate anon_vma_name, and this
> means we can get rid of the confusing and inconsistent visitor pattern
> implementation in madvise_vma_anon_name().
>
> This means we now have a single state object that threads through most of
> madvise()'s logic and a single code path which executes the majority of
> madvise() behaviour (we maintain separate logic for failure injection and
> memory population for the time being).
>
> Note that users cannot inadvertently cause this behaviour to occur, as
> madvise_behavior_valid() would reject it.
This paragraph is a bit confusing. I've inferred from the code you're
talking about the new internal negative values, but the preceding paragraphs
don't mention them. Could you explain in more detail what the patch does?
I.e. adding the new struct madvise_behavior field and the new behavior value(s).
> Doing this results in a can_modify_vma_madv() check for anonymous VMA name
> changes, however this will cause no issues as this operation is not
> prohibited.
>
> We can also then reuse more code and drop the redundant
> madvise_vma_anon_name() function altogether.
>
> Additionally separate out behaviours that update VMAs from those that do
> not.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> @@ -1325,21 +1388,25 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> if (error)
> goto out;
> break;
> - case MADV_COLLAPSE:
> - return madvise_collapse(vma, prev, start, end);
> - case MADV_GUARD_INSTALL:
> - return madvise_guard_install(vma, prev, start, end);
> - case MADV_GUARD_REMOVE:
> - return madvise_guard_remove(vma, prev, start, end);
> + case __MADV_SET_ANON_VMA_NAME:
> + case __MADV_CLEAR_ANON_VMA_NAME:
> + /* Only anonymous mappings can be named */
> + if (vma->vm_file && !vma_is_anon_shmem(vma))
> + return -EBADF;
> + break;
> }
>
> /* We cannot provide prev in this lock mode. */
> - VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK);
> - anon_name = anon_vma_name(vma);
> - anon_vma_name_get(anon_name);
> + VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK);
> +
> + if (!is_anon_vma_name(behavior)) {
> + anon_name = anon_vma_name(vma);
> + anon_vma_name_get(anon_name);
> + }
> error = madvise_update_vma(vma, prev, start, end, new_flags,
> anon_name);
> - anon_vma_name_put(anon_name);
> + if (!is_anon_vma_name(behavior))
> + anon_vma_name_put(anon_name);
This is not new, but the refactoring made it very visible that we're doing
get/put on anon_name exactly in cases where we're not messing with anon_name
so it might look buggy. Some explanatory comment would be thus nice,
otherwise people need to git blame for commit 942341dcc5748.
Otherwise LGTM, will wait with tag for v2 as you replied elsewhere there
will be changes. Thanks!
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state
2025-06-20 13:05 ` Vlastimil Babka
@ 2025-06-20 13:17 ` Lorenzo Stoakes
0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 13:17 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On Fri, Jun 20, 2025 at 03:05:02PM +0200, Vlastimil Babka wrote:
> On 6/19/25 22:26, Lorenzo Stoakes wrote:
> > Now we have the madvise_behavior helper struct we no longer need to mess
> > around with void* pointers in order to propagate anon_vma_name, and this
> > means we can get rid of the confusing and inconsistent visitor pattern
> > implementation in madvise_vma_anon_name().
> >
> > This means we now have a single state object that threads through most of
> > madvise()'s logic and a single code path which executes the majority of
> > madvise() behaviour (we maintain separate logic for failure injection and
> > memory population for the time being).
> >
> > Note that users cannot inadvertently cause this behaviour to occur, as
> > madvise_behavior_valid() would reject it.
>
> This paragraph is a bit confusing. I've inferred from the code you're
> talking about the new internal negative values, but the preceding paragraphs
> don't mention them. Could you explain in more detail what the patch does?
> I.e. adding the new struct madvise_behavior field and the new behavior value(s).
Sure will update on respin.
>
> > Doing this results in a can_modify_vma_madv() check for anonymous VMA name
> > changes, however this will cause no issues as this operation is not
> > prohibited.
> >
> > We can also then reuse more code and drop the redundant
> > madvise_vma_anon_name() function altogether.
> >
> > Additionally separate out behaviours that update VMAs from those that do
> > not.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > @@ -1325,21 +1388,25 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> > if (error)
> > goto out;
> > break;
> > - case MADV_COLLAPSE:
> > - return madvise_collapse(vma, prev, start, end);
> > - case MADV_GUARD_INSTALL:
> > - return madvise_guard_install(vma, prev, start, end);
> > - case MADV_GUARD_REMOVE:
> > - return madvise_guard_remove(vma, prev, start, end);
> > + case __MADV_SET_ANON_VMA_NAME:
> > + case __MADV_CLEAR_ANON_VMA_NAME:
> > + /* Only anonymous mappings can be named */
> > + if (vma->vm_file && !vma_is_anon_shmem(vma))
> > + return -EBADF;
> > + break;
> > }
> >
> > /* We cannot provide prev in this lock mode. */
> > - VM_WARN_ON_ONCE(arg->lock_mode == MADVISE_VMA_READ_LOCK);
> > - anon_name = anon_vma_name(vma);
> > - anon_vma_name_get(anon_name);
> > + VM_WARN_ON_ONCE(madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK);
> > +
> > + if (!is_anon_vma_name(behavior)) {
> > + anon_name = anon_vma_name(vma);
> > + anon_vma_name_get(anon_name);
> > + }
> > error = madvise_update_vma(vma, prev, start, end, new_flags,
> > anon_name);
> > - anon_vma_name_put(anon_name);
> > + if (!is_anon_vma_name(behavior))
> > + anon_vma_name_put(anon_name);
>
> This is not new, but the refactoring made it very visible that we're doing
> get/put on anon_name exactly in cases where we're not messing with anon_name
> so it might look buggy. Some explanatory comment would be thus nice,
> otherwise people need to git blame for commit 942341dcc5748.
Yeah I was confused myself until you mentioned that commit and - of course -
it's because of merge :P which maybe I should have figured out right away but
there we are :>)
So for my own sake as well as others I will add on respin.
>
> Otherwise LGTM, will wait with tag for v2 as you replied elsewhere there
> will be changes. Thanks!
>
Thanks!
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/5] mm/madvise: thread mm_struct through madvise_behavior
2025-06-19 20:26 ` [PATCH 2/5] mm/madvise: thread mm_struct through madvise_behavior Lorenzo Stoakes
2025-06-20 1:40 ` Zi Yan
@ 2025-06-20 13:19 ` Vlastimil Babka
1 sibling, 0 replies; 29+ messages in thread
From: Vlastimil Babka @ 2025-06-20 13:19 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Jann Horn,
linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 6/19/25 22:26, Lorenzo Stoakes wrote:
> There's no need to thread a pointer to the mm_struct nor have different
> functions signatures for each behaviour, instead store state in the struct
> madvise_behavior object consistently and use it for all madvise() actions.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/5] mm/madvise: thread VMA range state through madvise_behavior
2025-06-19 20:26 ` [PATCH 3/5] mm/madvise: thread VMA range state " Lorenzo Stoakes
2025-06-20 1:54 ` Zi Yan
@ 2025-06-20 13:49 ` Vlastimil Babka
2025-06-20 13:51 ` Lorenzo Stoakes
2025-06-20 14:16 ` Lorenzo Stoakes
1 sibling, 2 replies; 29+ messages in thread
From: Vlastimil Babka @ 2025-06-20 13:49 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Jann Horn,
linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 6/19/25 22:26, Lorenzo Stoakes wrote:
> Rather than updating start and a confusing local parameter 'tmp' in
> madvise_walk_vmas(), instead store the current range being operated upon in
> the struct madvise_behavior helper object in a range pair and use this
> consistently in all operations.
Yeah much better but it still took me a bit to understand that "end" is now
the original range->end that doesn't change during the iterations. Maybe
make it const and comment?
> This makes it clearer what is going on and opens the door to further
> cleanup now we store state regarding what is currently being operated upon
> here.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> mm/madvise.c | 101 ++++++++++++++++++++++++++++-----------------------
> 1 file changed, 55 insertions(+), 46 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 47485653c2a1..6faa38b92111 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -58,17 +58,26 @@ enum madvise_lock_mode {
> MADVISE_VMA_READ_LOCK,
> };
>
> +struct madvise_behavior_range {
> + unsigned long start, end;
I also thought multiple names on one line is only done for local variables,
but always separate declarations in structs. But I don't know if it's
documented as such or if there are pre-existing counter examples. Consider
it a non-binding agreement with Zi :)
Thanks!
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/5] mm/madvise: thread VMA range state through madvise_behavior
2025-06-20 13:49 ` Vlastimil Babka
@ 2025-06-20 13:51 ` Lorenzo Stoakes
2025-06-20 14:16 ` Lorenzo Stoakes
1 sibling, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 13:51 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On Fri, Jun 20, 2025 at 03:49:31PM +0200, Vlastimil Babka wrote:
> On 6/19/25 22:26, Lorenzo Stoakes wrote:
> > Rather than updating start and a confusing local parameter 'tmp' in
> > madvise_walk_vmas(), instead store the current range being operated upon in
> > the struct madvise_behavior helper object in a range pair and use this
> > consistently in all operations.
>
> Yeah much better but it still took me a bit to understand that "end" is now
> the original range->end that doesn't change during the iterations. Maybe
> make it const and comment?
>
> > This makes it clearer what is going on and opens the door to further
> > cleanup now we store state regarding what is currently being operated upon
> > here.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Thanks!
>
> > ---
> > mm/madvise.c | 101 ++++++++++++++++++++++++++++-----------------------
> > 1 file changed, 55 insertions(+), 46 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 47485653c2a1..6faa38b92111 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -58,17 +58,26 @@ enum madvise_lock_mode {
> > MADVISE_VMA_READ_LOCK,
> > };
> >
> > +struct madvise_behavior_range {
> > + unsigned long start, end;
>
> I also thought multiple names on one line is only done for local variables,
> but always separate declarations in structs. But I don't know if it's
> documented as such or if there are pre-existing counter examples. Consider
> it a non-binding agreement with Zi :)
Hm I didn't realise that was a thing at struct level, and since Zi also
highlights this I'll change it :)
Sorry Zi - my confusion, I was thinking comma-separated wasn't just for locals.
>
> Thanks!
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior
2025-06-19 20:26 ` [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior Lorenzo Stoakes
2025-06-20 2:20 ` Zi Yan
@ 2025-06-20 14:16 ` Vlastimil Babka
2025-06-20 14:17 ` Lorenzo Stoakes
2025-06-20 14:32 ` Vlastimil Babka
2 siblings, 1 reply; 29+ messages in thread
From: Vlastimil Babka @ 2025-06-20 14:16 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Jann Horn,
linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 6/19/25 22:26, Lorenzo Stoakes wrote:
> @@ -1662,13 +1668,16 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
> range->end = min(vma->vm_end, end);
>
> /* Here vma->vm_start <= range->start < range->end <= (end|vma->vm_end). */
> - error = madvise_vma_behavior(vma, &prev, madv_behavior);
> + madv_behavior->vma = vma;
> + error = madvise_vma_behavior(madv_behavior);
> if (error)
> return error;
> + prev = madv_behavior->prev;
> +
> range->start = range->end;
> if (prev && range->start < prev->vm_end)
> range->start = prev->vm_end;
> - if (range->start >= range->end)
> + if (range->start >= end)
I believe this change is fixing a bug from patch 3/5 (which I didn't catch,
sigh).
> break;
> if (prev)
> vma = find_vma(mm, prev->vm_end);
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/5] mm/madvise: thread VMA range state through madvise_behavior
2025-06-20 13:49 ` Vlastimil Babka
2025-06-20 13:51 ` Lorenzo Stoakes
@ 2025-06-20 14:16 ` Lorenzo Stoakes
1 sibling, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 14:16 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On Fri, Jun 20, 2025 at 03:49:31PM +0200, Vlastimil Babka wrote:
> On 6/19/25 22:26, Lorenzo Stoakes wrote:
> > Rather than updating start and a confusing local parameter 'tmp' in
> > madvise_walk_vmas(), instead store the current range being operated upon in
> > the struct madvise_behavior helper object in a range pair and use this
> > consistently in all operations.
>
> Yeah much better but it still took me a bit to understand that "end" is now
> the original range->end that doesn't change during the iterations. Maybe
> make it const and comment?
Sorry missed this, will adjust on respin.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior
2025-06-20 14:16 ` Vlastimil Babka
@ 2025-06-20 14:17 ` Lorenzo Stoakes
0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 14:17 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On Fri, Jun 20, 2025 at 04:16:07PM +0200, Vlastimil Babka wrote:
> On 6/19/25 22:26, Lorenzo Stoakes wrote:
> > @@ -1662,13 +1668,16 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
> > range->end = min(vma->vm_end, end);
> >
> > /* Here vma->vm_start <= range->start < range->end <= (end|vma->vm_end). */
> > - error = madvise_vma_behavior(vma, &prev, madv_behavior);
> > + madv_behavior->vma = vma;
> > + error = madvise_vma_behavior(madv_behavior);
> > if (error)
> > return error;
> > + prev = madv_behavior->prev;
> > +
> > range->start = range->end;
> > if (prev && range->start < prev->vm_end)
> > range->start = prev->vm_end;
> > - if (range->start >= range->end)
> > + if (range->start >= end)
>
> I believe this change is fixing a bug from patch 3/5 (which I didn't catch,
> sigh).
Whoops, I fixed this during development but clearly didn't backport it to the
correct commit, will fix on respin.
>
> > break;
> > if (prev)
> > vma = find_vma(mm, prev->vm_end);
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior
2025-06-19 20:26 ` [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior Lorenzo Stoakes
2025-06-20 2:20 ` Zi Yan
2025-06-20 14:16 ` Vlastimil Babka
@ 2025-06-20 14:32 ` Vlastimil Babka
2025-06-20 14:58 ` Lorenzo Stoakes
2 siblings, 1 reply; 29+ messages in thread
From: Vlastimil Babka @ 2025-06-20 14:32 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Jann Horn,
linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 6/19/25 22:26, Lorenzo Stoakes wrote:
> Doing so means we can get rid of all the weird struct vm_area_struct **prev
> stuff, everything becomes consistent and in future if we want to make
> change to behaviour there's a single place where all relevant state is
> stored.
>
> This also allows us to update try_vma_read_lock() to be a little more
> succinct and set up state for us, as well as cleaning up
> madvise_update_vma().
>
> We also update the debug assertion prior to madvise_update_vma() to assert
> that this is a write operation as correctly pointed out by Barry in the
> relevant thread.
>
> We can't reasonably update the madvise functions that live outside of
> mm/madvise.c so we leave those as-is.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
The prev manipulation is indeed confusing, looking forward to the next patch...
Nits:
> ---
> mm/madvise.c | 283 ++++++++++++++++++++++++++-------------------------
> 1 file changed, 146 insertions(+), 137 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6faa38b92111..86fe04aa7c88 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -74,6 +74,8 @@ struct madvise_behavior {
> * traversing multiple VMAs, this is updated for each.
> */
> struct madvise_behavior_range range;
> + /* The VMA and VMA preceding it (if applicable) currently targeted. */
> + struct vm_area_struct *prev, *vma;
Would also do separate lines here.
> -static long madvise_dontneed_free(struct vm_area_struct *vma,
> - struct vm_area_struct **prev,
> - unsigned long start, unsigned long end,
> - struct madvise_behavior *madv_behavior)
> +static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
> {
> + struct mm_struct *mm = madv_behavior->mm;
> + struct madvise_behavior_range *range = &madv_behavior->range;
> int behavior = madv_behavior->behavior;
> - struct mm_struct *mm = vma->vm_mm;
>
> - *prev = vma;
> - if (!madvise_dontneed_free_valid_vma(vma, start, &end, behavior))
> + madv_behavior->prev = madv_behavior->vma;
> + if (!madvise_dontneed_free_valid_vma(madv_behavior))
> return -EINVAL;
>
> - if (start == end)
> + if (range->start == range->end)
> return 0;
>
> - if (!userfaultfd_remove(vma, start, end)) {
> - *prev = NULL; /* mmap_lock has been dropped, prev is stale */
> + if (!userfaultfd_remove(madv_behavior->vma, range->start, range->end)) {
> + struct vm_area_struct *vma;
> +
> + madv_behavior->prev = NULL; /* mmap_lock has been dropped, prev is stale */
>
> mmap_read_lock(mm);
> - vma = vma_lookup(mm, start);
> + madv_behavior->vma = vma = vma_lookup(mm, range->start);
This replaces vma in madv_behavior...
> @@ -1617,23 +1625,19 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
> struct mm_struct *mm = madv_behavior->mm;
> struct madvise_behavior_range *range = &madv_behavior->range;
> unsigned long end = range->end;
> - struct vm_area_struct *vma;
> - struct vm_area_struct *prev;
> int unmapped_error = 0;
> int error;
> + struct vm_area_struct *vma;
>
> /*
> * If VMA read lock is supported, apply madvise to a single VMA
> * tentatively, avoiding walking VMAs.
> */
> - if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
> - vma = try_vma_read_lock(madv_behavior);
> - if (vma) {
> - prev = vma;
> - error = madvise_vma_behavior(vma, &prev, madv_behavior);
> - vma_end_read(vma);
> - return error;
> - }
> + if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK &&
> + try_vma_read_lock(madv_behavior)) {
> + error = madvise_vma_behavior(madv_behavior);
> + vma_end_read(madv_behavior->vma);
> + return error;
And here we could potentially do vma_end_read() on the replaced vma. And
it's exactly cases using madvise_dontneed_free() that use
MADVISE_VMA_READ_LOCK mode. But it's not an issue as try_vma_read_lock()
will fail with uffd and that vma replacement scenario is tied to
userfaultfd_remove(). It's just quite tricky, hm...
> }
>
> /*
> @@ -1641,11 +1645,13 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
> * ranges, just ignore them, but return -ENOMEM at the end.
> * - different from the way of handling in mlock etc.
> */
> - vma = find_vma_prev(mm, range->start, &prev);
> + vma = find_vma_prev(mm, range->start, &madv_behavior->prev);
> if (vma && range->start > vma->vm_start)
> - prev = vma;
> + madv_behavior->prev = vma;
>
> for (;;) {
> + struct vm_area_struct *prev;
> +
> /* Still start < end. */
> if (!vma)
> return -ENOMEM;
> @@ -1662,13 +1668,16 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
> range->end = min(vma->vm_end, end);
>
> /* Here vma->vm_start <= range->start < range->end <= (end|vma->vm_end). */
> - error = madvise_vma_behavior(vma, &prev, madv_behavior);
> + madv_behavior->vma = vma;
> + error = madvise_vma_behavior(madv_behavior);
> if (error)
> return error;
> + prev = madv_behavior->prev;
> +
> range->start = range->end;
> if (prev && range->start < prev->vm_end)
> range->start = prev->vm_end;
> - if (range->start >= range->end)
> + if (range->start >= end)
> break;
> if (prev)
> vma = find_vma(mm, prev->vm_end);
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior
2025-06-20 14:32 ` Vlastimil Babka
@ 2025-06-20 14:58 ` Lorenzo Stoakes
0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-20 14:58 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On Fri, Jun 20, 2025 at 04:32:48PM +0200, Vlastimil Babka wrote:
> On 6/19/25 22:26, Lorenzo Stoakes wrote:
> > Doing so means we can get rid of all the weird struct vm_area_struct **prev
> > stuff, everything becomes consistent and in future if we want to make
> > change to behaviour there's a single place where all relevant state is
> > stored.
> >
> > This also allows us to update try_vma_read_lock() to be a little more
> > succinct and set up state for us, as well as cleaning up
> > madvise_update_vma().
> >
> > We also update the debug assertion prior to madvise_update_vma() to assert
> > that this is a write operation as correctly pointed out by Barry in the
> > relevant thread.
> >
> > We can't reasonably update the madvise functions that live outside of
> > mm/madvise.c so we leave those as-is.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Thanks!
>
> The prev manipulation is indeed confusing, looking forward to the next patch...
Yeah this was the whole motivation for the series, ultimately...
>
> Nits:
>
> > ---
> > mm/madvise.c | 283 ++++++++++++++++++++++++++-------------------------
> > 1 file changed, 146 insertions(+), 137 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6faa38b92111..86fe04aa7c88 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -74,6 +74,8 @@ struct madvise_behavior {
> > * traversing multiple VMAs, this is updated for each.
> > */
> > struct madvise_behavior_range range;
> > + /* The VMA and VMA preceding it (if applicable) currently targeted. */
> > + struct vm_area_struct *prev, *vma;
>
> Would also do separate lines here.
Ack will do.
>
> > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > - struct vm_area_struct **prev,
> > - unsigned long start, unsigned long end,
> > - struct madvise_behavior *madv_behavior)
> > +static long madvise_dontneed_free(struct madvise_behavior *madv_behavior)
> > {
> > + struct mm_struct *mm = madv_behavior->mm;
> > + struct madvise_behavior_range *range = &madv_behavior->range;
> > int behavior = madv_behavior->behavior;
> > - struct mm_struct *mm = vma->vm_mm;
> >
> > - *prev = vma;
> > - if (!madvise_dontneed_free_valid_vma(vma, start, &end, behavior))
> > + madv_behavior->prev = madv_behavior->vma;
> > + if (!madvise_dontneed_free_valid_vma(madv_behavior))
> > return -EINVAL;
> >
> > - if (start == end)
> > + if (range->start == range->end)
> > return 0;
> >
> > - if (!userfaultfd_remove(vma, start, end)) {
> > - *prev = NULL; /* mmap_lock has been dropped, prev is stale */
> > + if (!userfaultfd_remove(madv_behavior->vma, range->start, range->end)) {
> > + struct vm_area_struct *vma;
> > +
> > + madv_behavior->prev = NULL; /* mmap_lock has been dropped, prev is stale */
> >
> > mmap_read_lock(mm);
> > - vma = vma_lookup(mm, start);
> > + madv_behavior->vma = vma = vma_lookup(mm, range->start);
>
> This replaces vma in madv_behavior...
Yeah note that the lock is dropped here also :) so the previous VMA is
completely invalidated, so this is valid.
>
> > @@ -1617,23 +1625,19 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
> > struct mm_struct *mm = madv_behavior->mm;
> > struct madvise_behavior_range *range = &madv_behavior->range;
> > unsigned long end = range->end;
> > - struct vm_area_struct *vma;
> > - struct vm_area_struct *prev;
> > int unmapped_error = 0;
> > int error;
> > + struct vm_area_struct *vma;
> >
> > /*
> > * If VMA read lock is supported, apply madvise to a single VMA
> > * tentatively, avoiding walking VMAs.
> > */
> > - if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
> > - vma = try_vma_read_lock(madv_behavior);
> > - if (vma) {
> > - prev = vma;
> > - error = madvise_vma_behavior(vma, &prev, madv_behavior);
> > - vma_end_read(vma);
> > - return error;
> > - }
> > + if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK &&
> > + try_vma_read_lock(madv_behavior)) {
> > + error = madvise_vma_behavior(madv_behavior);
> > + vma_end_read(madv_behavior->vma);
> > + return error;
>
> And here we could potentially do vma_end_read() on the replaced vma. And
> it's exactly cases using madvise_dontneed_free() that use
> MADVISE_VMA_READ_LOCK mode. But it's not an issue as try_vma_read_lock()
> will fail with uffd and that vma replacement scenario is tied to
> userfaultfd_remove(). It's just quite tricky, hm...
Yeah this kind of thing is explicitly why Barry excluded uffd replace from the
per-VMA lock I think.
So we absolutely can't use the per-VMA lock in conjunction with
uffd_remove(). I'll add a debug assert in madvise_dontneed_free() to make this
clearer.
Will add in madvise_remove() too as that does the same thing there, just to make
sure we have assert coverage.
>
> > }
> >
> > /*
> > @@ -1641,11 +1645,13 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
> > * ranges, just ignore them, but return -ENOMEM at the end.
> > * - different from the way of handling in mlock etc.
> > */
> > - vma = find_vma_prev(mm, range->start, &prev);
> > + vma = find_vma_prev(mm, range->start, &madv_behavior->prev);
> > if (vma && range->start > vma->vm_start)
> > - prev = vma;
> > + madv_behavior->prev = vma;
> >
> > for (;;) {
> > + struct vm_area_struct *prev;
> > +
> > /* Still start < end. */
> > if (!vma)
> > return -ENOMEM;
> > @@ -1662,13 +1668,16 @@ int madvise_walk_vmas(struct madvise_behavior *madv_behavior)
> > range->end = min(vma->vm_end, end);
> >
> > /* Here vma->vm_start <= range->start < range->end <= (end|vma->vm_end). */
> > - error = madvise_vma_behavior(vma, &prev, madv_behavior);
> > + madv_behavior->vma = vma;
> > + error = madvise_vma_behavior(madv_behavior);
> > if (error)
> > return error;
> > + prev = madv_behavior->prev;
> > +
> > range->start = range->end;
> > if (prev && range->start < prev->vm_end)
> > range->start = prev->vm_end;
> > - if (range->start >= range->end)
> > + if (range->start >= end)
> > break;
> > if (prev)
> > vma = find_vma(mm, prev->vm_end);
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA
2025-06-19 20:26 ` [PATCH 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA Lorenzo Stoakes
@ 2025-06-20 15:02 ` Vlastimil Babka
0 siblings, 0 replies; 29+ messages in thread
From: Vlastimil Babka @ 2025-06-20 15:02 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: David Hildenbrand, Zi Yan, Baolin Wang, Liam R . Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Jann Horn,
linux-mm, linux-kernel, Lance Yang, SeongJae Park,
Suren Baghdasaryan
On 6/19/25 22:26, Lorenzo Stoakes wrote:
> The madvise code has for the longest time had very confusing code around
> the 'prev' VMA pointer passed around various functions which, in all cases
> except madvise_update_vma(), is unused and instead simply updated as soon
> as the function is invoked.
>
> To compound the confusion, the prev pointer is also used to indicate to the
> caller that the mmap lock has been dropped and that we can therefore not
> safely access the end of the current VMA (which might have been updated by
> madvise_update_vma()).
>
> Clear up this confusion by not setting prev = vma anywhere except in
> madvise_walk_vmas(), update all references to prev which will always be
> equal to vma after madvise_vma_behavior() is invoked, and adding a flag to
> indicate that the lock has been dropped to make this explicit.
>
> Additionally, drop a redundant BUG_ON() from madvise_collapse(), which is
> simply reiterating the BUG_ON(mmap_locked) above it (note that BUG_ON() is
> not appropriate here, but we leave existing code as-is).
>
> We finally adjust the madvise_walk_vmas() logic to be a little clearer -
> delaying the assignment of the end of the range to the start of the new
> range until the last moment and handling the lock being dropped scenario
> immediately.
>
> Additionally add some explanatory comments.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Much nicer now, thanks!
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2025-06-20 15:02 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-19 20:26 [PATCH 0/5] madvise cleanup Lorenzo Stoakes
2025-06-19 20:26 ` [PATCH 1/5] mm/madvise: remove the visitor pattern and thread anon_vma state Lorenzo Stoakes
2025-06-20 1:32 ` Zi Yan
2025-06-20 2:43 ` Lance Yang
2025-06-20 5:10 ` Lorenzo Stoakes
2025-06-20 5:08 ` Lorenzo Stoakes
2025-06-20 13:05 ` Vlastimil Babka
2025-06-20 13:17 ` Lorenzo Stoakes
2025-06-19 20:26 ` [PATCH 2/5] mm/madvise: thread mm_struct through madvise_behavior Lorenzo Stoakes
2025-06-20 1:40 ` Zi Yan
2025-06-20 5:12 ` Lorenzo Stoakes
2025-06-20 13:19 ` Vlastimil Babka
2025-06-19 20:26 ` [PATCH 3/5] mm/madvise: thread VMA range state " Lorenzo Stoakes
2025-06-20 1:54 ` Zi Yan
2025-06-20 2:13 ` Zi Yan
2025-06-20 5:21 ` Lorenzo Stoakes
2025-06-20 5:17 ` Lorenzo Stoakes
2025-06-20 13:49 ` Vlastimil Babka
2025-06-20 13:51 ` Lorenzo Stoakes
2025-06-20 14:16 ` Lorenzo Stoakes
2025-06-19 20:26 ` [PATCH 4/5] mm/madvise: thread all madvise state through madv_behavior Lorenzo Stoakes
2025-06-20 2:20 ` Zi Yan
2025-06-20 5:21 ` Lorenzo Stoakes
2025-06-20 14:16 ` Vlastimil Babka
2025-06-20 14:17 ` Lorenzo Stoakes
2025-06-20 14:32 ` Vlastimil Babka
2025-06-20 14:58 ` Lorenzo Stoakes
2025-06-19 20:26 ` [PATCH 5/5] mm/madvise: eliminate very confusing manipulation of prev VMA Lorenzo Stoakes
2025-06-20 15:02 ` Vlastimil Babka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).