[PATCH RFC] mm: use per_vma lock for MADV

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED
@ 2025-05-27  4:41 Barry Song
  2025-05-27  9:20 ` Lorenzo Stoakes
  0 siblings, 1 reply; 6+ messages in thread
From: Barry Song @ 2025-05-27  4:41 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, Barry Song, Liam R. Howlett, Lorenzo Stoakes,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng

From: Barry Song <v-songbaohua@oppo.com>

Certain madvise operations, especially MADV_DONTNEED, occur far more
frequently than other madvise options, particularly in native and Java
heaps for dynamic memory management.

Currently, the mmap_lock is always held during these operations, even when
unnecessary. This causes lock contention and can lead to severe priority
inversion, where low-priority threads—such as Android's HeapTaskDaemon—
hold the lock and block higher-priority threads.

This patch enables the use of per-VMA locks when the advised range lies
entirely within a single VMA, avoiding the need for full VMA traversal. In
practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.

Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
benefits from this per-VMA lock optimization. After extended runtime,
217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
only 1,231 fell back to mmap_lock.

To simplify handling, the implementation falls back to the standard
mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
userfaultfd_remove().

Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 8433ac9b27e0..da016a1d0434 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 
 	if (madvise_should_skip(start, len_in, behavior, &error))
 		return error;
+
+	/*
+	 * MADV_DONTNEED is commonly used with userspace heaps and most often
+	 * affects a single VMA. In these cases, we can use per-VMA locks to
+	 * reduce contention on the mmap_lock.
+	 */
+	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) {
+		struct vm_area_struct *prev, *vma;
+		unsigned long untagged_start, end;
+
+		untagged_start = untagged_addr(start);
+		end = untagged_start + len_in;
+		vma = lock_vma_under_rcu(mm, untagged_start);
+		if (!vma)
+			goto lock;
+		if (end > vma->vm_end || userfaultfd_armed(vma)) {
+			vma_end_read(vma);
+			goto lock;
+		}
+		if (unlikely(!can_modify_vma_madv(vma, behavior))) {
+			error = -EPERM;
+			vma_end_read(vma);
+			goto out;
+		}
+		madvise_init_tlb(&madv_behavior, mm);
+		error = madvise_dontneed_free(vma, &prev, untagged_start,
+				end, &madv_behavior);
+		madvise_finish_tlb(&madv_behavior);
+		vma_end_read(vma);
+		goto out;
+	}
+
+lock:
 	error = madvise_lock(mm, behavior);
 	if (error)
 		return error;
@@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	madvise_finish_tlb(&madv_behavior);
 	madvise_unlock(mm, behavior);
 
+out:
 	return error;
 }
 
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED
  2025-05-27  4:41 [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED Barry Song
@ 2025-05-27  9:20 ` Lorenzo Stoakes
  2025-05-27 20:40   ` Lokesh Gidra
  2025-05-28  9:36   ` Barry Song
  0 siblings, 2 replies; 6+ messages in thread
From: Lorenzo Stoakes @ 2025-05-27  9:20 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng

Overall - thanks for this, and I'm not sure why we didn't think of doing
this sooner :P this seems like a super valid thing to try to use the vma
lock with.

I see you've cc'd Suren who has the most expertise in this and can
hopefully audit this and ensure all is good, but from the process address
doc (see below), I think we're good to just have the VMA stabilised for a
zap.

On Tue, May 27, 2025 at 04:41:45PM +1200, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> Certain madvise operations, especially MADV_DONTNEED, occur far more
> frequently than other madvise options, particularly in native and Java
> heaps for dynamic memory management.

Ack yeah, I have gathered that this is the case previously.

>
> Currently, the mmap_lock is always held during these operations, even when
> unnecessary. This causes lock contention and can lead to severe priority
> inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> hold the lock and block higher-priority threads.

That's very nasty... we definitely want to eliminate as much mmap_lock
contention as possible.

>
> This patch enables the use of per-VMA locks when the advised range lies
> entirely within a single VMA, avoiding the need for full VMA traversal. In
> practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.

Yeah this single VMA requirement is obviously absolutely key.

As per my docs [0] actually, for zapping a single VMA, 'The VMA need only be
kept stable for this operation.' (I had to look this up to remind myself :P)

[0]: https://kernel.org/doc/html/latest/mm/process_addrs.html

So we actually... should be good here, locking-wise.

>
> Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> benefits from this per-VMA lock optimization. After extended runtime,
> 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> only 1,231 fell back to mmap_lock.

Thanks, this sounds really promising!

I take it then you have as a result, heavily tested this change?

>
> To simplify handling, the implementation falls back to the standard
> mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> userfaultfd_remove().

Oh GOD do I hate how we implement uffd. Have I ever mentioned that? Well,
let me mention it again...

>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 8433ac9b27e0..da016a1d0434 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>
>  	if (madvise_should_skip(start, len_in, behavior, &error))
>  		return error;
> +
> +	/*
> +	 * MADV_DONTNEED is commonly used with userspace heaps and most often
> +	 * affects a single VMA. In these cases, we can use per-VMA locks to
> +	 * reduce contention on the mmap_lock.
> +	 */
> +	if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) {

So firstly doing this here means process_madvise() doesn't get this benefit, and
we're inconsistent between the two which we really want to avoid.

But secondly - we definitely need to find a better way to do this :) this
basically follows the 'ignore the existing approach and throw in an if
(special case) { ... }' pattern that I feel we really need to do all we can
to avoid in the kernel.

This lies the way of uffd, hugetlb, and thus horrors beyond imagining.

I can see why you did this as this is kind of special-cased a bit, and we
already do this kind of thing all over the place but let's try to avoid
this here.

So I suggest:

- Remove any code for this from do_madvise() and thus make it available to
  process_madvise() also.

- Try to avoid the special casing here as much as humanly possible :)

- Update madvise_lock()/unlock() to get passed a pointer to struct
  madvise_behavior to which we can add a boolean or even better I think -
  an enum indicating which lock type was taken (this can simplify
  madvise_unlock() also).

- Update madvise_lock() to do all of the checks below, we already
  effectively do a switch (behavior) so it's not so crazy to do this. And
  you can also do the fallthrough logic there.

- Obviously madvise_unlock() can be updated to do vma_end_read().

> +		struct vm_area_struct *prev, *vma;
> +		unsigned long untagged_start, end;
> +
> +		untagged_start = untagged_addr(start);
> +		end = untagged_start + len_in;
> +		vma = lock_vma_under_rcu(mm, untagged_start);
> +		if (!vma)
> +			goto lock;
> +		if (end > vma->vm_end || userfaultfd_armed(vma)) {
> +			vma_end_read(vma);
> +			goto lock;
> +		}
> +		if (unlikely(!can_modify_vma_madv(vma, behavior))) {
> +			error = -EPERM;
> +			vma_end_read(vma);
> +			goto out;
> +		}
> +		madvise_init_tlb(&madv_behavior, mm);
> +		error = madvise_dontneed_free(vma, &prev, untagged_start,
> +				end, &madv_behavior);
> +		madvise_finish_tlb(&madv_behavior);
> +		vma_end_read(vma);
> +		goto out;
> +	}
> +
> +lock:
>  	error = madvise_lock(mm, behavior);
>  	if (error)
>  		return error;
> @@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>  	madvise_finish_tlb(&madv_behavior);
>  	madvise_unlock(mm, behavior);
>
> +out:
>  	return error;
>  }
>
> --
> 2.39.3 (Apple Git-146)
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED
  2025-05-27  9:20 ` Lorenzo Stoakes
@ 2025-05-27 20:40   ` Lokesh Gidra
  2025-05-28  9:01     ` Barry Song
  2025-05-28  9:36   ` Barry Song
  1 sibling, 1 reply; 6+ messages in thread
From: Lokesh Gidra @ 2025-05-27 20:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Barry Song, akpm, linux-mm, linux-kernel, Barry Song,
	Liam R. Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Tangquan Zheng

Thanks a lot, Barry. This was overdue.

On Tue, May 27, 2025 at 2:20 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Overall - thanks for this, and I'm not sure why we didn't think of doing
> this sooner :P this seems like a super valid thing to try to use the vma
> lock with.
>
> I see you've cc'd Suren who has the most expertise in this and can
> hopefully audit this and ensure all is good, but from the process address
> doc (see below), I think we're good to just have the VMA stabilised for a
> zap.
>
> On Tue, May 27, 2025 at 04:41:45PM +1200, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Certain madvise operations, especially MADV_DONTNEED, occur far more
> > frequently than other madvise options, particularly in native and Java
> > heaps for dynamic memory management.

Actually, it would be great if you can also optimize the MADV_FREE
case. There are quite a few performance critical situations where
MADV_FREE is performed and are done within a single VMA.
>
> Ack yeah, I have gathered that this is the case previously.
>
> >
> > Currently, the mmap_lock is always held during these operations, even when
> > unnecessary. This causes lock contention and can lead to severe priority
> > inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> > hold the lock and block higher-priority threads.
>
> That's very nasty... we definitely want to eliminate as much mmap_lock
> contention as possible.
>
> >
> > This patch enables the use of per-VMA locks when the advised range lies
> > entirely within a single VMA, avoiding the need for full VMA traversal. In
> > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
>
> Yeah this single VMA requirement is obviously absolutely key.
>
> As per my docs [0] actually, for zapping a single VMA, 'The VMA need only be
> kept stable for this operation.' (I had to look this up to remind myself :P)
>
> [0]: https://kernel.org/doc/html/latest/mm/process_addrs.html
>
> So we actually... should be good here, locking-wise.
>
> >
> > Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> > benefits from this per-VMA lock optimization. After extended runtime,
> > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> > only 1,231 fell back to mmap_lock.

I am quite confident that this pattern is not limited to
HeapTaskDaemon. For instance, I'm sure the Scudo allocator must also
be calling MADV_DONTNEED within single VMAs.
>
> Thanks, this sounds really promising!
>
> I take it then you have as a result, heavily tested this change?
>
> >
> > To simplify handling, the implementation falls back to the standard
> > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> > userfaultfd_remove().
>
> Oh GOD do I hate how we implement uffd. Have I ever mentioned that? Well,
> let me mention it again...
>
> >
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 34 insertions(+)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 8433ac9b27e0..da016a1d0434 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >
> >       if (madvise_should_skip(start, len_in, behavior, &error))
> >               return error;
> > +
> > +     /*
> > +      * MADV_DONTNEED is commonly used with userspace heaps and most often
> > +      * affects a single VMA. In these cases, we can use per-VMA locks to
> > +      * reduce contention on the mmap_lock.
> > +      */
> > +     if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) {
>
> So firstly doing this here means process_madvise() doesn't get this benefit, and
> we're inconsistent between the two which we really want to avoid.
>
> But secondly - we definitely need to find a better way to do this :) this
> basically follows the 'ignore the existing approach and throw in an if
> (special case) { ... }' pattern that I feel we really need to do all we can
> to avoid in the kernel.
>
> This lies the way of uffd, hugetlb, and thus horrors beyond imagining.
>
> I can see why you did this as this is kind of special-cased a bit, and we
> already do this kind of thing all over the place but let's try to avoid
> this here.
>
> So I suggest:
>
> - Remove any code for this from do_madvise() and thus make it available to
>   process_madvise() also.
>
> - Try to avoid the special casing here as much as humanly possible :)
>
> - Update madvise_lock()/unlock() to get passed a pointer to struct
>   madvise_behavior to which we can add a boolean or even better I think -
>   an enum indicating which lock type was taken (this can simplify
>   madvise_unlock() also).
>
> - Update madvise_lock() to do all of the checks below, we already
>   effectively do a switch (behavior) so it's not so crazy to do this. And
>   you can also do the fallthrough logic there.
>
> - Obviously madvise_unlock() can be updated to do vma_end_read().
>
> > +             struct vm_area_struct *prev, *vma;
> > +             unsigned long untagged_start, end;
> > +
> > +             untagged_start = untagged_addr(start);
> > +             end = untagged_start + len_in;
> > +             vma = lock_vma_under_rcu(mm, untagged_start);
> > +             if (!vma)
> > +                     goto lock;
> > +             if (end > vma->vm_end || userfaultfd_armed(vma)) {
> > +                     vma_end_read(vma);
> > +                     goto lock;
> > +             }
> > +             if (unlikely(!can_modify_vma_madv(vma, behavior))) {
> > +                     error = -EPERM;
> > +                     vma_end_read(vma);
> > +                     goto out;
> > +             }
> > +             madvise_init_tlb(&madv_behavior, mm);
> > +             error = madvise_dontneed_free(vma, &prev, untagged_start,
> > +                             end, &madv_behavior);
> > +             madvise_finish_tlb(&madv_behavior);
> > +             vma_end_read(vma);
> > +             goto out;
> > +     }
> > +
> > +lock:
> >       error = madvise_lock(mm, behavior);
> >       if (error)
> >               return error;
> > @@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >       madvise_finish_tlb(&madv_behavior);
> >       madvise_unlock(mm, behavior);
> >
> > +out:
> >       return error;
> >  }
> >
> > --
> > 2.39.3 (Apple Git-146)
> >
>
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED
  2025-05-27 20:40   ` Lokesh Gidra
@ 2025-05-28  9:01     ` Barry Song
  0 siblings, 0 replies; 6+ messages in thread
From: Barry Song @ 2025-05-28  9:01 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Lorenzo Stoakes, akpm, linux-mm, linux-kernel, Barry Song,
	Liam R. Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Suren Baghdasaryan, Tangquan Zheng

On Wed, May 28, 2025 at 4:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> Thanks a lot, Barry. This was overdue.
>
> On Tue, May 27, 2025 at 2:20 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Overall - thanks for this, and I'm not sure why we didn't think of doing
> > this sooner :P this seems like a super valid thing to try to use the vma
> > lock with.
> >
> > I see you've cc'd Suren who has the most expertise in this and can
> > hopefully audit this and ensure all is good, but from the process address
> > doc (see below), I think we're good to just have the VMA stabilised for a
> > zap.
> >
> > On Tue, May 27, 2025 at 04:41:45PM +1200, Barry Song wrote:
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > Certain madvise operations, especially MADV_DONTNEED, occur far more
> > > frequently than other madvise options, particularly in native and Java
> > > heaps for dynamic memory management.
>
> Actually, it would be great if you can also optimize the MADV_FREE
> case. There are quite a few performance critical situations where
> MADV_FREE is performed and are done within a single VMA.

Yes, that was definitely on the list. However, the reason MADV_FREE wasn't
included in this patch is because madvise_free_single_vma() calls
walk_page_range(), which requires the mmap_lock to locate the VMA using
find_vma()—which feels redundant and awkward, since we already have the VMA.

We should be able to use walk_page_range_vma() instead, given that the VMA
is already known. But even walk_page_range_vma() asserts that the mmap_lock
is held, so the issue remains.

int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
unsigned long end, const struct mm_walk_ops *ops,
void *private)
{
...
process_mm_walk_lock(walk.mm, ops->walk_lock);
...
return __walk_page_range(start, end, &walk);
}

MADV_DONTNEED doesn't use walk_page_range() at all. So I'd prefer to
settle MADV_DONTNEED first as the initial step.

We need more time to investigate the behavior and requirements of
walk_page_range().

> >
> > Ack yeah, I have gathered that this is the case previously.
> >
> > >
> > > Currently, the mmap_lock is always held during these operations, even when
> > > unnecessary. This causes lock contention and can lead to severe priority
> > > inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> > > hold the lock and block higher-priority threads.
> >
> > That's very nasty... we definitely want to eliminate as much mmap_lock
> > contention as possible.
> >
> > >
> > > This patch enables the use of per-VMA locks when the advised range lies
> > > entirely within a single VMA, avoiding the need for full VMA traversal. In
> > > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
> >
> > Yeah this single VMA requirement is obviously absolutely key.
> >
> > As per my docs [0] actually, for zapping a single VMA, 'The VMA need only be
> > kept stable for this operation.' (I had to look this up to remind myself :P)
> >
> > [0]: https://kernel.org/doc/html/latest/mm/process_addrs.html
> >
> > So we actually... should be good here, locking-wise.
> >
> > >
> > > Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> > > benefits from this per-VMA lock optimization. After extended runtime,
> > > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> > > only 1,231 fell back to mmap_lock.
>
> I am quite confident that this pattern is not limited to
> HeapTaskDaemon. For instance, I'm sure the Scudo allocator must also
> be calling MADV_DONTNEED within single VMAs.

That's right. We've also optimized the native heaps.

> >
> > Thanks, this sounds really promising!
> >
> > I take it then you have as a result, heavily tested this change?
> >
> > >
> > > To simplify handling, the implementation falls back to the standard
> > > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> > > userfaultfd_remove().
> >
> > Oh GOD do I hate how we implement uffd. Have I ever mentioned that? Well,
> > let me mention it again...
> >
> > >
> > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Jann Horn <jannh@google.com>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 34 insertions(+)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 8433ac9b27e0..da016a1d0434 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> > >
> > >       if (madvise_should_skip(start, len_in, behavior, &error))
> > >               return error;
> > > +
> > > +     /*
> > > +      * MADV_DONTNEED is commonly used with userspace heaps and most often
> > > +      * affects a single VMA. In these cases, we can use per-VMA locks to
> > > +      * reduce contention on the mmap_lock.
> > > +      */
> > > +     if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) {
> >
> > So firstly doing this here means process_madvise() doesn't get this benefit, and
> > we're inconsistent between the two which we really want to avoid.
> >
> > But secondly - we definitely need to find a better way to do this :) this
> > basically follows the 'ignore the existing approach and throw in an if
> > (special case) { ... }' pattern that I feel we really need to do all we can
> > to avoid in the kernel.
> >
> > This lies the way of uffd, hugetlb, and thus horrors beyond imagining.
> >
> > I can see why you did this as this is kind of special-cased a bit, and we
> > already do this kind of thing all over the place but let's try to avoid
> > this here.
> >
> > So I suggest:
> >
> > - Remove any code for this from do_madvise() and thus make it available to
> >   process_madvise() also.
> >
> > - Try to avoid the special casing here as much as humanly possible :)
> >
> > - Update madvise_lock()/unlock() to get passed a pointer to struct
> >   madvise_behavior to which we can add a boolean or even better I think -
> >   an enum indicating which lock type was taken (this can simplify
> >   madvise_unlock() also).
> >
> > - Update madvise_lock() to do all of the checks below, we already
> >   effectively do a switch (behavior) so it's not so crazy to do this. And
> >   you can also do the fallthrough logic there.
> >
> > - Obviously madvise_unlock() can be updated to do vma_end_read().
> >
> > > +             struct vm_area_struct *prev, *vma;
> > > +             unsigned long untagged_start, end;
> > > +
> > > +             untagged_start = untagged_addr(start);
> > > +             end = untagged_start + len_in;
> > > +             vma = lock_vma_under_rcu(mm, untagged_start);
> > > +             if (!vma)
> > > +                     goto lock;
> > > +             if (end > vma->vm_end || userfaultfd_armed(vma)) {
> > > +                     vma_end_read(vma);
> > > +                     goto lock;
> > > +             }
> > > +             if (unlikely(!can_modify_vma_madv(vma, behavior))) {
> > > +                     error = -EPERM;
> > > +                     vma_end_read(vma);
> > > +                     goto out;
> > > +             }
> > > +             madvise_init_tlb(&madv_behavior, mm);
> > > +             error = madvise_dontneed_free(vma, &prev, untagged_start,
> > > +                             end, &madv_behavior);
> > > +             madvise_finish_tlb(&madv_behavior);
> > > +             vma_end_read(vma);
> > > +             goto out;
> > > +     }
> > > +
> > > +lock:
> > >       error = madvise_lock(mm, behavior);
> > >       if (error)
> > >               return error;
> > > @@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> > >       madvise_finish_tlb(&madv_behavior);
> > >       madvise_unlock(mm, behavior);
> > >
> > > +out:
> > >       return error;
> > >  }
> > >
> > > --
> > > 2.39.3 (Apple Git-146)
> > >
> >
> > Cheers, Lorenzo

Thanks
Barry


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED
  2025-05-27  9:20 ` Lorenzo Stoakes
  2025-05-27 20:40   ` Lokesh Gidra
@ 2025-05-28  9:36   ` Barry Song
  2025-05-28  9:43     ` Lorenzo Stoakes
  1 sibling, 1 reply; 6+ messages in thread
From: Barry Song @ 2025-05-28  9:36 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng

On Tue, May 27, 2025 at 5:20 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Overall - thanks for this, and I'm not sure why we didn't think of doing
> this sooner :P this seems like a super valid thing to try to use the vma
> lock with.
>
> I see you've cc'd Suren who has the most expertise in this and can
> hopefully audit this and ensure all is good, but from the process address
> doc (see below), I think we're good to just have the VMA stabilised for a
> zap.
>
> On Tue, May 27, 2025 at 04:41:45PM +1200, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Certain madvise operations, especially MADV_DONTNEED, occur far more
> > frequently than other madvise options, particularly in native and Java
> > heaps for dynamic memory management.
>
> Ack yeah, I have gathered that this is the case previously.
>
> >
> > Currently, the mmap_lock is always held during these operations, even when
> > unnecessary. This causes lock contention and can lead to severe priority
> > inversion, where low-priority threads—such as Android's HeapTaskDaemon—
> > hold the lock and block higher-priority threads.
>
> That's very nasty... we definitely want to eliminate as much mmap_lock
> contention as possible.
>
> >
> > This patch enables the use of per-VMA locks when the advised range lies
> > entirely within a single VMA, avoiding the need for full VMA traversal. In
> > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs.
>
> Yeah this single VMA requirement is obviously absolutely key.
>
> As per my docs [0] actually, for zapping a single VMA, 'The VMA need only be
> kept stable for this operation.' (I had to look this up to remind myself :P)
>
> [0]: https://kernel.org/doc/html/latest/mm/process_addrs.html
>
> So we actually... should be good here, locking-wise.
>
> >
> > Tangquan’s testing shows that over 99.5% of memory reclaimed by Android
> > benefits from this per-VMA lock optimization. After extended runtime,
> > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
> > only 1,231 fell back to mmap_lock.
>
> Thanks, this sounds really promising!
>
> I take it then you have as a result, heavily tested this change?

This was extensively tested on an older Android kernel with real devices.
As you know, running the latest mm-unstable on phones is challenging due
to hardware driver constraints. However, I believe the reported percentage
is accurate, since it seems pointless for userspace heaps to free memory
across two or more VMAs. How could it possibly manage a slab-like system
spanning multiple VMAs?

>
> >
> > To simplify handling, the implementation falls back to the standard
> > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
> > userfaultfd_remove().
>
> Oh GOD do I hate how we implement uffd. Have I ever mentioned that? Well,
> let me mention it again...
>
> >
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 34 insertions(+)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 8433ac9b27e0..da016a1d0434 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >
> >       if (madvise_should_skip(start, len_in, behavior, &error))
> >               return error;
> > +
> > +     /*
> > +      * MADV_DONTNEED is commonly used with userspace heaps and most often
> > +      * affects a single VMA. In these cases, we can use per-VMA locks to
> > +      * reduce contention on the mmap_lock.
> > +      */
> > +     if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) {
>
> So firstly doing this here means process_madvise() doesn't get this benefit, and
> we're inconsistent between the two which we really want to avoid.
>
> But secondly - we definitely need to find a better way to do this :) this
> basically follows the 'ignore the existing approach and throw in an if
> (special case) { ... }' pattern that I feel we really need to do all we can
> to avoid in the kernel.
>
> This lies the way of uffd, hugetlb, and thus horrors beyond imagining.
>
> I can see why you did this as this is kind of special-cased a bit, and we
> already do this kind of thing all over the place but let's try to avoid
> this here.
>
> So I suggest:
>
> - Remove any code for this from do_madvise() and thus make it available to
>   process_madvise() also.
>
> - Try to avoid the special casing here as much as humanly possible :)
>
> - Update madvise_lock()/unlock() to get passed a pointer to struct
>   madvise_behavior to which we can add a boolean or even better I think -
>   an enum indicating which lock type was taken (this can simplify
>   madvise_unlock() also).
>
> - Update madvise_lock() to do all of the checks below, we already
>   effectively do a switch (behavior) so it's not so crazy to do this. And
>   you can also do the fallthrough logic there.
>
> - Obviously madvise_unlock() can be updated to do vma_end_read().

I’ve definitely considered refactoring madvise_lock, madvise_do_behavior,
and madvise_unlock to encapsulate the details of the per-VMA locking and
mmap_lock handling within those functions:
madvise_lock(mm, behavior);
madvise_do_behavior(mm, start, len_in, behavior);
madvise_unlock(mm, behavior);

However, I’m a bit concerned that this approach might make the code messier
by introducing extra arguments to handle different code paths. For instance,
madvise_do_behavior might need an additional parameter to determine whether
VMA traversal via madvise_walk_vmas is necessary.

That said, I’ll give it a try and see if it actually turns out to be as ugly
as I fear.


>
> > +             struct vm_area_struct *prev, *vma;
> > +             unsigned long untagged_start, end;
> > +
> > +             untagged_start = untagged_addr(start);
> > +             end = untagged_start + len_in;
> > +             vma = lock_vma_under_rcu(mm, untagged_start);
> > +             if (!vma)
> > +                     goto lock;
> > +             if (end > vma->vm_end || userfaultfd_armed(vma)) {
> > +                     vma_end_read(vma);
> > +                     goto lock;
> > +             }
> > +             if (unlikely(!can_modify_vma_madv(vma, behavior))) {
> > +                     error = -EPERM;
> > +                     vma_end_read(vma);
> > +                     goto out;
> > +             }
> > +             madvise_init_tlb(&madv_behavior, mm);
> > +             error = madvise_dontneed_free(vma, &prev, untagged_start,
> > +                             end, &madv_behavior);
> > +             madvise_finish_tlb(&madv_behavior);
> > +             vma_end_read(vma);
> > +             goto out;
> > +     }
> > +
> > +lock:
> >       error = madvise_lock(mm, behavior);
> >       if (error)
> >               return error;
> > @@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >       madvise_finish_tlb(&madv_behavior);
> >       madvise_unlock(mm, behavior);
> >
> > +out:
> >       return error;
> >  }
> >
> > --
> > 2.39.3 (Apple Git-146)
> >
>
> Cheers, Lorenzo

Thanks
Barry


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED
  2025-05-28  9:36   ` Barry Song
@ 2025-05-28  9:43     ` Lorenzo Stoakes
  0 siblings, 0 replies; 6+ messages in thread
From: Lorenzo Stoakes @ 2025-05-28  9:43 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan,
	Lokesh Gidra, Tangquan Zheng

On Wed, May 28, 2025 at 05:36:40PM +0800, Barry Song wrote:
> On Tue, May 27, 2025 at 5:20 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
[s[nip]
> >
> > Thanks, this sounds really promising!
> >
> > I take it then you have as a result, heavily tested this change?
>
> This was extensively tested on an older Android kernel with real devices.
> As you know, running the latest mm-unstable on phones is challenging due
> to hardware driver constraints. However, I believe the reported percentage
> is accurate, since it seems pointless for userspace heaps to free memory
> across two or more VMAs. How could it possibly manage a slab-like system
> spanning multiple VMAs?

Right, yeah. mremap()'ing anon memory around might get you unexpected splits,
but generally speaking you'd expect them to be single VMAs.

[snip]

> > >  mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 34 insertions(+)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 8433ac9b27e0..da016a1d0434 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> > >
> > >       if (madvise_should_skip(start, len_in, behavior, &error))
> > >               return error;
> > > +
> > > +     /*
> > > +      * MADV_DONTNEED is commonly used with userspace heaps and most often
> > > +      * affects a single VMA. In these cases, we can use per-VMA locks to
> > > +      * reduce contention on the mmap_lock.
> > > +      */
> > > +     if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) {
> >
> > So firstly doing this here means process_madvise() doesn't get this benefit, and
> > we're inconsistent between the two which we really want to avoid.
> >
> > But secondly - we definitely need to find a better way to do this :) this
> > basically follows the 'ignore the existing approach and throw in an if
> > (special case) { ... }' pattern that I feel we really need to do all we can
> > to avoid in the kernel.
> >
> > This lies the way of uffd, hugetlb, and thus horrors beyond imagining.
> >
> > I can see why you did this as this is kind of special-cased a bit, and we
> > already do this kind of thing all over the place but let's try to avoid
> > this here.
> >
> > So I suggest:
> >
> > - Remove any code for this from do_madvise() and thus make it available to
> >   process_madvise() also.
> >
> > - Try to avoid the special casing here as much as humanly possible :)
> >
> > - Update madvise_lock()/unlock() to get passed a pointer to struct
> >   madvise_behavior to which we can add a boolean or even better I think -
> >   an enum indicating which lock type was taken (this can simplify
> >   madvise_unlock() also).
> >
> > - Update madvise_lock() to do all of the checks below, we already
> >   effectively do a switch (behavior) so it's not so crazy to do this. And
> >   you can also do the fallthrough logic there.
> >
> > - Obviously madvise_unlock() can be updated to do vma_end_read().
>
> I’ve definitely considered refactoring madvise_lock, madvise_do_behavior,
> and madvise_unlock to encapsulate the details of the per-VMA locking and
> mmap_lock handling within those functions:
> madvise_lock(mm, behavior);
> madvise_do_behavior(mm, start, len_in, behavior);
> madvise_unlock(mm, behavior);
>
> However, I’m a bit concerned that this approach might make the code messier
> by introducing extra arguments to handle different code paths. For instance,
> madvise_do_behavior might need an additional parameter to determine whether
> VMA traversal via madvise_walk_vmas is necessary.
>
> That said, I’ll give it a try and see if it actually turns out to be as ugly
> as I fear.

Your reticence is understandable, this code can be pretty hairy, however
thanks to SJ's refactoring efforts you can now use the helper struct he
introduced - madvise_behavior - to store this state, so you don't have to
pass parameters really _much_ more than now.

Unfortunately the current version of the patch isn't mergeable as it
stands, so it really does have to be sensibly generalised, however it ends
up looking.

To be clear - I appreciate all your efforts, and sorry to add some extra
work here, but in the long run it'll be worthwhile I feel, and will lay the
foundations for future use of VMA locks :)

Cheers, Lorenzo

[snip]


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-05-28  9:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-27  4:41 [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED Barry Song
2025-05-27  9:20 ` Lorenzo Stoakes
2025-05-27 20:40   ` Lokesh Gidra
2025-05-28  9:01     ` Barry Song
2025-05-28  9:36   ` Barry Song
2025-05-28  9:43     ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).