[PATCH v3] mm: use per_vma lock for MADV

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3] mm: use per_vma lock for MADV_DONTNEED
@ 2025-06-07  0:46 Barry Song
  2025-06-07  4:04 ` Matthew Wilcox
  2025-06-07  7:22 ` Lorenzo Stoakes
  0 siblings, 2 replies; 9+ messages in thread
From: Barry Song @ 2025-06-07  0:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, Barry Song, Liam R. Howlett, Lorenzo Stoakes,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng

From: Barry Song <v-songbaohua@oppo.com>

Certain madvise operations, especially MADV_DONTNEED, occur far more
frequently than other madvise options, particularly in native and Java
heaps for dynamic memory management.

Currently, the mmap_lock is always held during these operations, even when
unnecessary. This causes lock contention and can lead to severe priority
inversion, where low-priority threads—such as Android's HeapTaskDaemon—
hold the lock and block higher-priority threads.

This patch enables the use of per-VMA locks when the advised range lies
entirely within a single VMA, avoiding the need for full VMA traversal. In
practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.

Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
benefits from this per-VMA lock optimization. After extended runtime,
217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
only 1,231 fell back to mmap_lock.

To simplify handling, the implementation falls back to the standard
mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
userfaultfd_remove().

Many thanks to Lorenzo's work[1] on:
"Refactor the madvise() code to retain state about the locking mode
utilised for traversing VMAs.

Then use this mechanism to permit VMA locking to be done later in the
madvise() logic and also to allow altering of the locking mode to permit
falling back to an mmap read lock if required."

One important point, as pointed out by Jann[2], is that
untagged_addr_remote() requires holding mmap_lock. This is because
address tagging on x86 and RISC-V is quite complex.

Until untagged_addr_remote() becomes atomic—which seems unlikely in
the near future—we cannot support per-VMA locks for remote processes.
So for now, only local processes are supported.

Link: https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/ [1]
Link: https://lore.kernel.org/all/CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com/ [2]
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/madvise.c | 196 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 148 insertions(+), 48 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 56d9ca2557b9..a94e6a7ee387 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -48,38 +48,19 @@ struct madvise_walk_private {
 	bool pageout;
 };
 
+enum madvise_lock_mode {
+	MADVISE_NO_LOCK,
+	MADVISE_MMAP_READ_LOCK,
+	MADVISE_MMAP_WRITE_LOCK,
+	MADVISE_VMA_READ_LOCK,
+};
+
 struct madvise_behavior {
 	int behavior;
 	struct mmu_gather *tlb;
+	enum madvise_lock_mode lock_mode;
 };
 
-/*
- * Any behaviour which results in changes to the vma->vm_flags needs to
- * take mmap_lock for writing. Others, which simply traverse vmas, need
- * to only take it for reading.
- */
-static int madvise_need_mmap_write(int behavior)
-{
-	switch (behavior) {
-	case MADV_REMOVE:
-	case MADV_WILLNEED:
-	case MADV_DONTNEED:
-	case MADV_DONTNEED_LOCKED:
-	case MADV_COLD:
-	case MADV_PAGEOUT:
-	case MADV_FREE:
-	case MADV_POPULATE_READ:
-	case MADV_POPULATE_WRITE:
-	case MADV_COLLAPSE:
-	case MADV_GUARD_INSTALL:
-	case MADV_GUARD_REMOVE:
-		return 0;
-	default:
-		/* be safe, default to 1. list exceptions explicitly */
-		return 1;
-	}
-}
-
 #ifdef CONFIG_ANON_VMA_NAME
 struct anon_vma_name *anon_vma_name_alloc(const char *name)
 {
@@ -1486,6 +1467,44 @@ static bool process_madvise_remote_valid(int behavior)
 	}
 }
 
+/*
+ * Try to acquire a VMA read lock if possible.
+ *
+ * We only support this lock over a single VMA, which the input range must
+ * span either partially or fully.
+ *
+ * This function always returns with an appropriate lock held. If a VMA read
+ * lock could be acquired, we return the locked VMA.
+ *
+ * If a VMA read lock could not be acquired, we return NULL and expect caller to
+ * fallback to mmap lock behaviour.
+ */
+static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
+		struct madvise_behavior *madv_behavior,
+		unsigned long start, unsigned long end)
+{
+	struct vm_area_struct *vma;
+
+	vma = lock_vma_under_rcu(mm, start);
+	if (!vma)
+		goto take_mmap_read_lock;
+	/*
+	 * Must span only a single VMA; uffd and remote processes are
+	 * unsupported.
+	 */
+	if (end > vma->vm_end || current->mm != mm ||
+	    userfaultfd_armed(vma)) {
+		vma_end_read(vma);
+		goto take_mmap_read_lock;
+	}
+	return vma;
+
+take_mmap_read_lock:
+	mmap_read_lock(mm);
+	madv_behavior->lock_mode = MADVISE_MMAP_READ_LOCK;
+	return NULL;
+}
+
 /*
  * Walk the vmas in range [start,end), and call the visit function on each one.
  * The visit function will get start and end parameters that cover the overlap
@@ -1496,7 +1515,8 @@ static bool process_madvise_remote_valid(int behavior)
  */
 static
 int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
-		      unsigned long end, void *arg,
+		      unsigned long end, struct madvise_behavior *madv_behavior,
+		      void *arg,
 		      int (*visit)(struct vm_area_struct *vma,
 				   struct vm_area_struct **prev, unsigned long start,
 				   unsigned long end, void *arg))
@@ -1505,6 +1525,21 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 	struct vm_area_struct *prev;
 	unsigned long tmp;
 	int unmapped_error = 0;
+	int error;
+
+	/*
+	 * If VMA read lock is supported, apply madvise to a single VMA
+	 * tentatively, avoiding walking VMAs.
+	 */
+	if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
+		vma = try_vma_read_lock(mm, madv_behavior, start, end);
+		if (vma) {
+			error = madvise_vma_behavior(vma, &prev, start, end,
+				madv_behavior);
+			vma_end_read(vma);
+			return error;
+		}
+	}
 
 	/*
 	 * If the interval [start,end) covers some unmapped address
@@ -1516,8 +1551,6 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 		prev = vma;
 
 	for (;;) {
-		int error;
-
 		/* Still start < end. */
 		if (!vma)
 			return -ENOMEM;
@@ -1598,34 +1631,86 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 	if (end == start)
 		return 0;
 
-	return madvise_walk_vmas(mm, start, end, anon_name,
+	return madvise_walk_vmas(mm, start, end, NULL, anon_name,
 				 madvise_vma_anon_name);
 }
 #endif /* CONFIG_ANON_VMA_NAME */
 
-static int madvise_lock(struct mm_struct *mm, int behavior)
+
+/*
+ * Any behaviour which results in changes to the vma->vm_flags needs to
+ * take mmap_lock for writing. Others, which simply traverse vmas, need
+ * to only take it for reading.
+ */
+static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior)
 {
+	int behavior = madv_behavior->behavior;
+
 	if (is_memory_failure(behavior))
-		return 0;
+		return MADVISE_NO_LOCK;
 
-	if (madvise_need_mmap_write(behavior)) {
+	switch (behavior) {
+	case MADV_REMOVE:
+	case MADV_WILLNEED:
+	case MADV_COLD:
+	case MADV_PAGEOUT:
+	case MADV_FREE:
+	case MADV_POPULATE_READ:
+	case MADV_POPULATE_WRITE:
+	case MADV_COLLAPSE:
+	case MADV_GUARD_INSTALL:
+	case MADV_GUARD_REMOVE:
+		return MADVISE_MMAP_READ_LOCK;
+	case MADV_DONTNEED:
+	case MADV_DONTNEED_LOCKED:
+		return MADVISE_VMA_READ_LOCK;
+	default:
+		return MADVISE_MMAP_WRITE_LOCK;
+	}
+}
+
+static int madvise_lock(struct mm_struct *mm,
+		struct madvise_behavior *madv_behavior)
+{
+	enum madvise_lock_mode lock_mode = get_lock_mode(madv_behavior);
+
+	switch (lock_mode) {
+	case MADVISE_NO_LOCK:
+		break;
+	case MADVISE_MMAP_WRITE_LOCK:
 		if (mmap_write_lock_killable(mm))
 			return -EINTR;
-	} else {
+		break;
+	case MADVISE_MMAP_READ_LOCK:
 		mmap_read_lock(mm);
+		break;
+	case MADVISE_VMA_READ_LOCK:
+		/* We will acquire the lock per-VMA in madvise_walk_vmas(). */
+		break;
 	}
+
+	madv_behavior->lock_mode = lock_mode;
 	return 0;
 }
 
-static void madvise_unlock(struct mm_struct *mm, int behavior)
+static void madvise_unlock(struct mm_struct *mm,
+		struct madvise_behavior *madv_behavior)
 {
-	if (is_memory_failure(behavior))
+	switch (madv_behavior->lock_mode) {
+	case  MADVISE_NO_LOCK:
 		return;
-
-	if (madvise_need_mmap_write(behavior))
+	case MADVISE_MMAP_WRITE_LOCK:
 		mmap_write_unlock(mm);
-	else
+		break;
+	case MADVISE_MMAP_READ_LOCK:
 		mmap_read_unlock(mm);
+		break;
+	case MADVISE_VMA_READ_LOCK:
+		/* We will drop the lock per-VMA in madvise_walk_vmas(). */
+		break;
+	}
+
+	madv_behavior->lock_mode = MADVISE_NO_LOCK;
 }
 
 static bool madvise_batch_tlb_flush(int behavior)
@@ -1710,6 +1795,21 @@ static bool is_madvise_populate(int behavior)
 	}
 }
 
+/*
+ * untagged_addr_remote() assumes mmap_lock is already held. On
+ * architectures like x86 and RISC-V, tagging is tricky because each
+ * mm may have a different tagging mask. However, we might only hold
+ * the per-VMA lock (currently only local processes are supported),
+ * so untagged_addr is used to avoid the mmap_lock assertion for
+ * local processes.
+ */
+static inline unsigned long get_untagged_addr(struct mm_struct *mm,
+		unsigned long start)
+{
+	return current->mm == mm ? untagged_addr(start) :
+				   untagged_addr_remote(mm, start);
+}
+
 static int madvise_do_behavior(struct mm_struct *mm,
 		unsigned long start, size_t len_in,
 		struct madvise_behavior *madv_behavior)
@@ -1721,7 +1821,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
 
 	if (is_memory_failure(behavior))
 		return madvise_inject_error(behavior, start, start + len_in);
-	start = untagged_addr_remote(mm, start);
+	start = get_untagged_addr(mm, start);
 	end = start + PAGE_ALIGN(len_in);
 
 	blk_start_plug(&plug);
@@ -1729,7 +1829,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
 		error = madvise_populate(mm, start, end, behavior);
 	else
 		error = madvise_walk_vmas(mm, start, end, madv_behavior,
-					  madvise_vma_behavior);
+				madv_behavior, madvise_vma_behavior);
 	blk_finish_plug(&plug);
 	return error;
 }
@@ -1817,13 +1917,13 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
-	error = madvise_lock(mm, behavior);
+	error = madvise_lock(mm, &madv_behavior);
 	if (error)
 		return error;
 	madvise_init_tlb(&madv_behavior, mm);
 	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
 	madvise_finish_tlb(&madv_behavior);
-	madvise_unlock(mm, behavior);
+	madvise_unlock(mm, &madv_behavior);
 
 	return error;
 }
@@ -1847,7 +1947,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 
 	total_len = iov_iter_count(iter);
 
-	ret = madvise_lock(mm, behavior);
+	ret = madvise_lock(mm, &madv_behavior);
 	if (ret)
 		return ret;
 	madvise_init_tlb(&madv_behavior, mm);
@@ -1880,8 +1980,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 
 			/* Drop and reacquire lock to unwind race. */
 			madvise_finish_tlb(&madv_behavior);
-			madvise_unlock(mm, behavior);
-			ret = madvise_lock(mm, behavior);
+			madvise_unlock(mm, &madv_behavior);
+			ret = madvise_lock(mm, &madv_behavior);
 			if (ret)
 				goto out;
 			madvise_init_tlb(&madv_behavior, mm);
@@ -1892,7 +1992,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		iov_iter_advance(iter, iter_iov_len(iter));
 	}
 	madvise_finish_tlb(&madv_behavior);
-	madvise_unlock(mm, behavior);
+	madvise_unlock(mm, &madv_behavior);
 
 out:
 	ret = (total_len - iov_iter_count(iter)) ? : ret;
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: use per_vma lock for MADV_DONTNEED
  2025-06-07  0:46 [PATCH v3] mm: use per_vma lock for MADV_DONTNEED Barry Song
@ 2025-06-07  4:04 ` Matthew Wilcox
  2025-06-07  4:53   ` Barry Song
  2025-06-07  6:36   ` Lorenzo Stoakes
  2025-06-07  7:22 ` Lorenzo Stoakes
  1 sibling, 2 replies; 9+ messages in thread
From: Matthew Wilcox @ 2025-06-07  4:04 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	Lorenzo Stoakes, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, Qi Zheng

On Sat, Jun 07, 2025 at 12:46:23PM +1200, Barry Song wrote:
> To simplify handling, the implementation falls back to the standard
> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> userfaultfd_remove().

This feels too complex to me.  Why do we defer grabbing the vma lock
so late, instead of grabbing it at the start like the fault handler does?



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: use per_vma lock for MADV_DONTNEED
  2025-06-07  4:04 ` Matthew Wilcox
@ 2025-06-07  4:53   ` Barry Song
  2025-06-07  7:24     ` Lorenzo Stoakes
  2025-06-07  6:36   ` Lorenzo Stoakes
  1 sibling, 1 reply; 9+ messages in thread
From: Barry Song @ 2025-06-07  4:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	Lorenzo Stoakes, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, Qi Zheng

On Sat, Jun 7, 2025 at 4:05 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Jun 07, 2025 at 12:46:23PM +1200, Barry Song wrote:
> > To simplify handling, the implementation falls back to the standard
> > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> > userfaultfd_remove().
>
> This feels too complex to me.  Why do we defer grabbing the vma lock
> so late, instead of grabbing it at the start like the fault handler does?

Hi Matthew,

It looks like you missed the spot where your comment should have gone:
https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/
So I believe Lorenzo is the best person to respond to your concern.

In both v1 and v2 [1][2], we did try to fall back as early as possible:

[1] https://lore.kernel.org/linux-mm/20250527044145.13153-1-21cnbao@gmail.com/
[2] https://lore.kernel.org/all/20250530104439.64841-1-21cnbao@gmail.com/

But that approach had its own problems:
* It's not extensible to other madvise operations.
* It's not easy to adapt to vector_madvise.

I also initially found the new approach too complex and tried a few
alternatives, but each had its own problems. In the end, Lorenzo's
solution still seems to be the cleanest among them.

I even forgot to move the code below back to visit() from
madvise_vma_behavior(). I had changed it while exploring an
alternative and should have reverted it.

+       if (madv_behavior && madv_behavior->lock_mode ==
MADVISE_VMA_READ_LOCK) {
+               vma = try_vma_read_lock(mm, madv_behavior, start, end);
+               if (vma) {
+                       error = madvise_vma_behavior(vma, &prev, start, end,
+                               madv_behavior);  /* better to be visit() */
+                       vma_end_read(vma);
+                       return error;
+               }
+       }

Thanks
Barry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: use per_vma lock for MADV_DONTNEED
  2025-06-07  4:04 ` Matthew Wilcox
  2025-06-07  4:53   ` Barry Song
@ 2025-06-07  6:36   ` Lorenzo Stoakes
  1 sibling, 0 replies; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-06-07  6:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Barry Song, akpm, linux-mm, linux-kernel, Barry Song,
	Liam R. Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, Qi Zheng

On Sat, Jun 07, 2025 at 05:04:58AM +0100, Matthew Wilcox wrote:
> On Sat, Jun 07, 2025 at 12:46:23PM +1200, Barry Song wrote:
> > To simplify handling, the implementation falls back to the standard
> > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> > userfaultfd_remove().
>
> This feels too complex to me.  Why do we defer grabbing the vma lock
> so late, instead of grabbing it at the start like the fault handler does?
>

The VMA lock is at the granularity of VMAs, so we grab it at the point we look
at VMAs.

The other locks are at the granularity of the virtual address space so we grab
them then.

Doing anything else either results in horrible hacks where we have to grab a
VMA, stash it in some state somewhere and pick it up later (and hope we never
screw up with a dangling ptr etc.), or inserting horrible a horrible if () { /*
duplicated code */ } block.

In future we probably want to look at 'grabbing a bunch of VMAs by RCU', but it
felt prudent to implement this for a single VMA at a time.

Also doing it this way allows for us being able to neatly fall back to an mmap
read lock if a VMA lock could not be acquired for whatever reason.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: use per_vma lock for MADV_DONTNEED
  2025-06-07  0:46 [PATCH v3] mm: use per_vma lock for MADV_DONTNEED Barry Song
  2025-06-07  4:04 ` Matthew Wilcox
@ 2025-06-07  7:22 ` Lorenzo Stoakes
  2025-06-07  7:26   ` Lorenzo Stoakes
  2025-06-07 10:30   ` Barry Song
  1 sibling, 2 replies; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-06-07  7:22 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng

A quick build/selftest run reveals nothing concerning.

On Sat, Jun 07, 2025 at 12:46:23PM +1200, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> Certain madvise operations, especially MADV_DONTNEED, occur far more
> frequently than other madvise options, particularly in native and Java
> heaps for dynamic memory management.
>
> Currently, the mmap_lock is always held during these operations, even when
> unnecessary. This causes lock contention and can lead to severe priority
> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> hold the lock and block higher-priority threads.
>
> This patch enables the use of per-VMA locks when the advised range lies
> entirely within a single VMA, avoiding the need for full VMA traversal. In
> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
>
> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> benefits from this per-VMA lock optimization. After extended runtime,
> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> only 1,231 fell back to mmap_lock.
>
> To simplify handling, the implementation falls back to the standard
> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> userfaultfd_remove().
>
> Many thanks to Lorenzo's work[1] on:
> "Refactor the madvise() code to retain state about the locking mode
> utilised for traversing VMAs.
>
> Then use this mechanism to permit VMA locking to be done later in the
> madvise() logic and also to allow altering of the locking mode to permit
> falling back to an mmap read lock if required."
>
> One important point, as pointed out by Jann[2], is that
> untagged_addr_remote() requires holding mmap_lock. This is because
> address tagging on x86 and RISC-V is quite complex.
>
> Until untagged_addr_remote() becomes atomic—which seems unlikely in
> the near future—we cannot support per-VMA locks for remote processes.
> So for now, only local processes are supported.

Nice commit message!

>
> Link: https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/ [1]
> Link: https://lore.kernel.org/all/CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com/ [2]
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

Since I proposed the approach I am somewhat biased (you should get review from
others too! :) and I would really like Jann to confirm the untagged addr ting is
fine, but LGTM other than a thought/suggestion below, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---

Nit, but be nice to have a change summary here from v1 -> v3 with links to lore
for convenience.


>  mm/madvise.c | 196 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 148 insertions(+), 48 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 56d9ca2557b9..a94e6a7ee387 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -48,38 +48,19 @@ struct madvise_walk_private {
>  	bool pageout;
>  };
>
> +enum madvise_lock_mode {
> +	MADVISE_NO_LOCK,
> +	MADVISE_MMAP_READ_LOCK,
> +	MADVISE_MMAP_WRITE_LOCK,
> +	MADVISE_VMA_READ_LOCK,
> +};
> +
>  struct madvise_behavior {
>  	int behavior;
>  	struct mmu_gather *tlb;
> +	enum madvise_lock_mode lock_mode;
>  };
>
> -/*
> - * Any behaviour which results in changes to the vma->vm_flags needs to
> - * take mmap_lock for writing. Others, which simply traverse vmas, need
> - * to only take it for reading.
> - */
> -static int madvise_need_mmap_write(int behavior)
> -{
> -	switch (behavior) {
> -	case MADV_REMOVE:
> -	case MADV_WILLNEED:
> -	case MADV_DONTNEED:
> -	case MADV_DONTNEED_LOCKED:
> -	case MADV_COLD:
> -	case MADV_PAGEOUT:
> -	case MADV_FREE:
> -	case MADV_POPULATE_READ:
> -	case MADV_POPULATE_WRITE:
> -	case MADV_COLLAPSE:
> -	case MADV_GUARD_INSTALL:
> -	case MADV_GUARD_REMOVE:
> -		return 0;
> -	default:
> -		/* be safe, default to 1. list exceptions explicitly */
> -		return 1;
> -	}
> -}
> -
>  #ifdef CONFIG_ANON_VMA_NAME
>  struct anon_vma_name *anon_vma_name_alloc(const char *name)
>  {
> @@ -1486,6 +1467,44 @@ static bool process_madvise_remote_valid(int behavior)
>  	}
>  }
>
> +/*
> + * Try to acquire a VMA read lock if possible.
> + *
> + * We only support this lock over a single VMA, which the input range must
> + * span either partially or fully.
> + *
> + * This function always returns with an appropriate lock held. If a VMA read
> + * lock could be acquired, we return the locked VMA.
> + *
> + * If a VMA read lock could not be acquired, we return NULL and expect caller to
> + * fallback to mmap lock behaviour.
> + */
> +static struct vm_area_struct *try_vma_read_lock(struct mm_struct *mm,
> +		struct madvise_behavior *madv_behavior,
> +		unsigned long start, unsigned long end)
> +{
> +	struct vm_area_struct *vma;
> +
> +	vma = lock_vma_under_rcu(mm, start);
> +	if (!vma)
> +		goto take_mmap_read_lock;
> +	/*
> +	 * Must span only a single VMA; uffd and remote processes are
> +	 * unsupported.
> +	 */
> +	if (end > vma->vm_end || current->mm != mm ||
> +	    userfaultfd_armed(vma)) {
> +		vma_end_read(vma);
> +		goto take_mmap_read_lock;
> +	}
> +	return vma;
> +
> +take_mmap_read_lock:
> +	mmap_read_lock(mm);
> +	madv_behavior->lock_mode = MADVISE_MMAP_READ_LOCK;
> +	return NULL;
> +}
> +
>  /*
>   * Walk the vmas in range [start,end), and call the visit function on each one.
>   * The visit function will get start and end parameters that cover the overlap
> @@ -1496,7 +1515,8 @@ static bool process_madvise_remote_valid(int behavior)
>   */
>  static
>  int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
> -		      unsigned long end, void *arg,
> +		      unsigned long end, struct madvise_behavior *madv_behavior,
> +		      void *arg,
>  		      int (*visit)(struct vm_area_struct *vma,
>  				   struct vm_area_struct **prev, unsigned long start,
>  				   unsigned long end, void *arg))
> @@ -1505,6 +1525,21 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
>  	struct vm_area_struct *prev;
>  	unsigned long tmp;
>  	int unmapped_error = 0;
> +	int error;
> +
> +	/*
> +	 * If VMA read lock is supported, apply madvise to a single VMA
> +	 * tentatively, avoiding walking VMAs.
> +	 */
> +	if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
> +		vma = try_vma_read_lock(mm, madv_behavior, start, end);
> +		if (vma) {
> +			error = madvise_vma_behavior(vma, &prev, start, end,
> +				madv_behavior);
> +			vma_end_read(vma);
> +			return error;
> +		}
> +	}
>
>  	/*
>  	 * If the interval [start,end) covers some unmapped address
> @@ -1516,8 +1551,6 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
>  		prev = vma;
>
>  	for (;;) {
> -		int error;
> -
>  		/* Still start < end. */
>  		if (!vma)
>  			return -ENOMEM;
> @@ -1598,34 +1631,86 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>  	if (end == start)
>  		return 0;
>
> -	return madvise_walk_vmas(mm, start, end, anon_name,
> +	return madvise_walk_vmas(mm, start, end, NULL, anon_name,
>  				 madvise_vma_anon_name);
>  }
>  #endif /* CONFIG_ANON_VMA_NAME */
>
> -static int madvise_lock(struct mm_struct *mm, int behavior)
> +
> +/*
> + * Any behaviour which results in changes to the vma->vm_flags needs to
> + * take mmap_lock for writing. Others, which simply traverse vmas, need
> + * to only take it for reading.
> + */
> +static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior)

Sort of a nice-to-have/thought but:

Actually, when I proposed the refactor I wondered whether we'd use more state in
madv_behaviour here but turns out we don't so we may as well just switch back to
using int behavior here?

If we do that then we can adjust process_madvise_remote_valid() with:

 static bool process_madvise_remote_valid(int behavior)
 {
+	/* Due to lack of address untag atomicity, we need mmap lock. */
+	VM_WARN_ON_ONCE(madvise_lock(behavior) != MADVISE_VMA_READ_LOCK);
+
	switch (behavior) {
	case MADV_COLD:
	case MADV_PAGEOUT:

Which then self-documents this requirement for any extension to permitted remote
operations that might otherwise use a VMA lock.

We will already catch these though with the mmap assert in
untagged_addr_remote() anyway so these won't get missed and your comment there
explains things, so this whole suggestion is optional.

>  {
> +	int behavior = madv_behavior->behavior;
> +
>  	if (is_memory_failure(behavior))
> -		return 0;
> +		return MADVISE_NO_LOCK;
>
> -	if (madvise_need_mmap_write(behavior)) {
> +	switch (behavior) {
> +	case MADV_REMOVE:
> +	case MADV_WILLNEED:
> +	case MADV_COLD:
> +	case MADV_PAGEOUT:
> +	case MADV_FREE:
> +	case MADV_POPULATE_READ:
> +	case MADV_POPULATE_WRITE:
> +	case MADV_COLLAPSE:
> +	case MADV_GUARD_INSTALL:
> +	case MADV_GUARD_REMOVE:
> +		return MADVISE_MMAP_READ_LOCK;
> +	case MADV_DONTNEED:
> +	case MADV_DONTNEED_LOCKED:
> +		return MADVISE_VMA_READ_LOCK;
> +	default:
> +		return MADVISE_MMAP_WRITE_LOCK;
> +	}
> +}
> +
> +static int madvise_lock(struct mm_struct *mm,
> +		struct madvise_behavior *madv_behavior)
> +{
> +	enum madvise_lock_mode lock_mode = get_lock_mode(madv_behavior);
> +
> +	switch (lock_mode) {
> +	case MADVISE_NO_LOCK:
> +		break;
> +	case MADVISE_MMAP_WRITE_LOCK:
>  		if (mmap_write_lock_killable(mm))
>  			return -EINTR;
> -	} else {
> +		break;
> +	case MADVISE_MMAP_READ_LOCK:
>  		mmap_read_lock(mm);
> +		break;
> +	case MADVISE_VMA_READ_LOCK:
> +		/* We will acquire the lock per-VMA in madvise_walk_vmas(). */
> +		break;
>  	}
> +
> +	madv_behavior->lock_mode = lock_mode;
>  	return 0;
>  }
>
> -static void madvise_unlock(struct mm_struct *mm, int behavior)
> +static void madvise_unlock(struct mm_struct *mm,
> +		struct madvise_behavior *madv_behavior)
>  {
> -	if (is_memory_failure(behavior))
> +	switch (madv_behavior->lock_mode) {
> +	case  MADVISE_NO_LOCK:
>  		return;
> -
> -	if (madvise_need_mmap_write(behavior))
> +	case MADVISE_MMAP_WRITE_LOCK:
>  		mmap_write_unlock(mm);
> -	else
> +		break;
> +	case MADVISE_MMAP_READ_LOCK:
>  		mmap_read_unlock(mm);
> +		break;
> +	case MADVISE_VMA_READ_LOCK:
> +		/* We will drop the lock per-VMA in madvise_walk_vmas(). */
> +		break;
> +	}
> +
> +	madv_behavior->lock_mode = MADVISE_NO_LOCK;
>  }
>
>  static bool madvise_batch_tlb_flush(int behavior)
> @@ -1710,6 +1795,21 @@ static bool is_madvise_populate(int behavior)
>  	}
>  }
>
> +/*
> + * untagged_addr_remote() assumes mmap_lock is already held. On
> + * architectures like x86 and RISC-V, tagging is tricky because each
> + * mm may have a different tagging mask. However, we might only hold
> + * the per-VMA lock (currently only local processes are supported),
> + * so untagged_addr is used to avoid the mmap_lock assertion for
> + * local processes.
> + */
> +static inline unsigned long get_untagged_addr(struct mm_struct *mm,
> +		unsigned long start)
> +{
> +	return current->mm == mm ? untagged_addr(start) :
> +				   untagged_addr_remote(mm, start);

untagged_addr_remote() has an assert on mmap so this should neatly assert that
we never VMA lock w/o mmap lock automatically which is nice.

> +}
> +
>  static int madvise_do_behavior(struct mm_struct *mm,
>  		unsigned long start, size_t len_in,
>  		struct madvise_behavior *madv_behavior)
> @@ -1721,7 +1821,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
>
>  	if (is_memory_failure(behavior))
>  		return madvise_inject_error(behavior, start, start + len_in);
> -	start = untagged_addr_remote(mm, start);
> +	start = get_untagged_addr(mm, start);
>  	end = start + PAGE_ALIGN(len_in);
>
>  	blk_start_plug(&plug);
> @@ -1729,7 +1829,7 @@ static int madvise_do_behavior(struct mm_struct *mm,
>  		error = madvise_populate(mm, start, end, behavior);
>  	else
>  		error = madvise_walk_vmas(mm, start, end, madv_behavior,
> -					  madvise_vma_behavior);
> +				madv_behavior, madvise_vma_behavior);
>  	blk_finish_plug(&plug);
>  	return error;
>  }
> @@ -1817,13 +1917,13 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>
>  	if (madvise_should_skip(start, len_in, behavior, &error))
>  		return error;
> -	error = madvise_lock(mm, behavior);
> +	error = madvise_lock(mm, &madv_behavior);
>  	if (error)
>  		return error;
>  	madvise_init_tlb(&madv_behavior, mm);
>  	error = madvise_do_behavior(mm, start, len_in, &madv_behavior);
>  	madvise_finish_tlb(&madv_behavior);
> -	madvise_unlock(mm, behavior);
> +	madvise_unlock(mm, &madv_behavior);
>
>  	return error;
>  }
> @@ -1847,7 +1947,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>
>  	total_len = iov_iter_count(iter);
>
> -	ret = madvise_lock(mm, behavior);
> +	ret = madvise_lock(mm, &madv_behavior);
>  	if (ret)
>  		return ret;
>  	madvise_init_tlb(&madv_behavior, mm);
> @@ -1880,8 +1980,8 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>
>  			/* Drop and reacquire lock to unwind race. */
>  			madvise_finish_tlb(&madv_behavior);
> -			madvise_unlock(mm, behavior);
> -			ret = madvise_lock(mm, behavior);
> +			madvise_unlock(mm, &madv_behavior);
> +			ret = madvise_lock(mm, &madv_behavior);
>  			if (ret)
>  				goto out;
>  			madvise_init_tlb(&madv_behavior, mm);
> @@ -1892,7 +1992,7 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  		iov_iter_advance(iter, iter_iov_len(iter));
>  	}
>  	madvise_finish_tlb(&madv_behavior);
> -	madvise_unlock(mm, behavior);
> +	madvise_unlock(mm, &madv_behavior);
>
>  out:
>  	ret = (total_len - iov_iter_count(iter)) ? : ret;
> --
> 2.39.3 (Apple Git-146)
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: use per_vma lock for MADV_DONTNEED
  2025-06-07  4:53   ` Barry Song
@ 2025-06-07  7:24     ` Lorenzo Stoakes
  0 siblings, 0 replies; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-06-07  7:24 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, akpm, linux-mm, linux-kernel, Barry Song,
	Liam R. Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng, Qi Zheng

On Sat, Jun 07, 2025 at 04:53:02PM +1200, Barry Song wrote:
> On Sat, Jun 7, 2025 at 4:05 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, Jun 07, 2025 at 12:46:23PM +1200, Barry Song wrote:
> > > To simplify handling, the implementation falls back to the standard
> > > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> > > userfaultfd_remove().
> >
> > This feels too complex to me.  Why do we defer grabbing the vma lock
> > so late, instead of grabbing it at the start like the fault handler does?
>
> Hi Matthew,
>
> It looks like you missed the spot where your comment should have gone:
> https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/
> So I believe Lorenzo is the best person to respond to your concern.
>
> In both v1 and v2 [1][2], we did try to fall back as early as possible:
>
> [1] https://lore.kernel.org/linux-mm/20250527044145.13153-1-21cnbao@gmail.com/
> [2] https://lore.kernel.org/all/20250530104439.64841-1-21cnbao@gmail.com/
>
> But that approach had its own problems:
> * It's not extensible to other madvise operations.
> * It's not easy to adapt to vector_madvise.
>
> I also initially found the new approach too complex and tried a few
> alternatives, but each had its own problems. In the end, Lorenzo's
> solution still seems to be the cleanest among them.
>
> I even forgot to move the code below back to visit() from
> madvise_vma_behavior(). I had changed it while exploring an
> alternative and should have reverted it.
>
> +       if (madv_behavior && madv_behavior->lock_mode ==
> MADVISE_VMA_READ_LOCK) {
> +               vma = try_vma_read_lock(mm, madv_behavior, start, end);
> +               if (vma) {
> +                       error = madvise_vma_behavior(vma, &prev, start, end,
> +                               madv_behavior);  /* better to be visit() */
> +                       vma_end_read(vma);
> +                       return error;
> +               }
> +       }

Ah damn missed that :) and I just tagged haha, yeah that should be visit().

I hate this pattern, maybe will refactor in future... entirely for this rather
hacked in anon_vma_name implementation :(

>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: use per_vma lock for MADV_DONTNEED
  2025-06-07  7:22 ` Lorenzo Stoakes
@ 2025-06-07  7:26   ` Lorenzo Stoakes
  2025-06-07  9:33     ` Barry Song
  2025-06-07 10:30   ` Barry Song
  1 sibling, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-06-07  7:26 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng, Qi Zheng

On Sat, Jun 07, 2025 at 08:22:45AM +0100, Lorenzo Stoakes wrote:
> A quick build/selftest run reveals nothing concerning.
>
> On Sat, Jun 07, 2025 at 12:46:23PM +1200, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Certain madvise operations, especially MADV_DONTNEED, occur far more
> > frequently than other madvise options, particularly in native and Java
> > heaps for dynamic memory management.
> >
> > Currently, the mmap_lock is always held during these operations, even when
> > unnecessary. This causes lock contention and can lead to severe priority
> > inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> > hold the lock and block higher-priority threads.
> >
> > This patch enables the use of per-VMA locks when the advised range lies
> > entirely within a single VMA, avoiding the need for full VMA traversal. In
> > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
> >
> > Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> > benefits from this per-VMA lock optimization. After extended runtime,
> > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> > only 1,231 fell back to mmap_lock.
> >
> > To simplify handling, the implementation falls back to the standard
> > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> > userfaultfd_remove().
> >
> > Many thanks to Lorenzo's work[1] on:
> > "Refactor the madvise() code to retain state about the locking mode
> > utilised for traversing VMAs.
> >
> > Then use this mechanism to permit VMA locking to be done later in the
> > madvise() logic and also to allow altering of the locking mode to permit
> > falling back to an mmap read lock if required."
> >
> > One important point, as pointed out by Jann[2], is that
> > untagged_addr_remote() requires holding mmap_lock. This is because
> > address tagging on x86 and RISC-V is quite complex.
> >
> > Until untagged_addr_remote() becomes atomic—which seems unlikely in
> > the near future—we cannot support per-VMA locks for remote processes.
> > So for now, only local processes are supported.
>
> Nice commit message!
>
> >
> > Link: https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/ [1]
> > Link: https://lore.kernel.org/all/CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com/ [2]
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > Cc: Qi Zheng <zhengqi.arch@bytedance.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>
> Since I proposed the approach I am somewhat biased (you should get review from
> others too! :) and I would really like Jann to confirm the untagged addr ting is
> fine, but LGTM other than a thought/suggestion below, so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Whoops, I should add R-b only applies with the fix-patch proposed in [0].

[0]: https://lore.kernel.org/all/CAGsJ_4x1RbQ+GKKc1rrTaNA8Xd+W8K-Zu6-kwVYNKzB0OWiowQ@mail.gmail.com/

Barry - Probably worth respinning a v4 to pick that up to make things
clear, you can propagate my tag with that.

Cheers, Lorenzo

[snip]


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: use per_vma lock for MADV_DONTNEED
  2025-06-07  7:26   ` Lorenzo Stoakes
@ 2025-06-07  9:33     ` Barry Song
  0 siblings, 0 replies; 9+ messages in thread
From: Barry Song @ 2025-06-07  9:33 UTC (permalink / raw)
  To: lorenzo.stoakes, akpm
  Cc: 21cnbao, Liam.Howlett, david, jannh, linux-kernel, linux-mm,
	lokeshgidra, surenb, v-songbaohua, vbabka, zhengqi.arch,
	zhengtangquan

>
> Whoops, I should add R-b only applies with the fix-patch proposed in [0].
>
> [0]: https://lore.kernel.org/all/CAGsJ_4x1RbQ+GKKc1rrTaNA8Xd+W8K-Zu6-kwVYNKzB0OWiowQ@mail.gmail.com/
>
> Barry - Probably worth respinning a v4 to pick that up to make things
> clear, you can propagate my tag with that.

Sure. It's a small, non-functional change:

diff --git a/mm/madvise.c b/mm/madvise.c
index a94e6a7ee387..8382614b71d1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1534,8 +1534,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 	if (madv_behavior && madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK) {
 		vma = try_vma_read_lock(mm, madv_behavior, start, end);
 		if (vma) {
-			error = madvise_vma_behavior(vma, &prev, start, end,
-				madv_behavior);
+			error = visit(vma, &prev, start, end, arg);
 			vma_end_read(vma);
 			return error;
 		}

>
> Cheers, Lorenzo

Thanks
Barry


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: use per_vma lock for MADV_DONTNEED
  2025-06-07  7:22 ` Lorenzo Stoakes
  2025-06-07  7:26   ` Lorenzo Stoakes
@ 2025-06-07 10:30   ` Barry Song
  1 sibling, 0 replies; 9+ messages in thread
From: Barry Song @ 2025-06-07 10:30 UTC (permalink / raw)
  To: lorenzo.stoakes
  Cc: 21cnbao, Liam.Howlett, akpm, david, jannh, linux-kernel, linux-mm,
	lokeshgidra, surenb, v-songbaohua, vbabka, zhengqi.arch,
	zhengtangquan

> Sort of a nice-to-have/thought but:
>
> Actually, when I proposed the refactor I wondered whether we'd use more state in
> madv_behaviour here but turns out we don't so we may as well just switch back to
> using int behavior here?
>
> If we do that then we can adjust process_madvise_remote_valid() with:
>
>  static bool process_madvise_remote_valid(int behavior)
>  {
> +       /* Due to lack of address untag atomicity, we need mmap lock. */
> +       VM_WARN_ON_ONCE(madvise_lock(behavior) != MADVISE_VMA_READ_LOCK);


process_madvise_remote_valid() is called before vector_madvise(), so I'm not
sure what this code is supposed to do. Are you trying to do something like:

VM_WARN_ON_ONCE(get_lock_mode(behavior) == MADVISE_VMA_READ_LOCK);

If so, that seems problematic — the same madvise operation might be allowed
to use the per-VMA lock for local processes, but disallowed for remote ones.

I suppose this will only start to make sense after we support per-VMA locking
for remote madvise operations such as "case MADV_XXX":

diff --git a/mm/madvise.c b/mm/madvise.c
index 8382614b71d1..9815445284d5 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1641,7 +1641,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  * take mmap_lock for writing. Others, which simply traverse vmas, need
  * to only take it for reading.
  */
-static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior)
+static enum madvise_lock_mode get_lock_mode(struct mm_struct *mm,
+		struct madvise_behavior *madv_behavior)
 {
 	int behavior = madv_behavior->behavior;
 
@@ -1659,6 +1660,9 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
 	case MADV_COLLAPSE:
 	case MADV_GUARD_INSTALL:
 	case MADV_GUARD_REMOVE:
		...
+ 	case MADV_XXX:
+		return current->mm == mm ? MADVISE_VMA_READ_LOCK :
+				MADVISE_MMAP_READ_LOCK;
 	case MADV_DONTNEED:
 	case MADV_DONTNEED_LOCKED:

Thanks
Barry


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-06-07 10:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-07  0:46 [PATCH v3] mm: use per_vma lock for MADV_DONTNEED Barry Song
2025-06-07  4:04 ` Matthew Wilcox
2025-06-07  4:53   ` Barry Song
2025-06-07  7:24     ` Lorenzo Stoakes
2025-06-07  6:36   ` Lorenzo Stoakes
2025-06-07  7:22 ` Lorenzo Stoakes
2025-06-07  7:26   ` Lorenzo Stoakes
2025-06-07  9:33     ` Barry Song
2025-06-07 10:30   ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).