* [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED @ 2025-05-27 4:41 Barry Song 2025-05-27 9:20 ` Lorenzo Stoakes 0 siblings, 1 reply; 6+ messages in thread From: Barry Song @ 2025-05-27 4:41 UTC (permalink / raw) To: akpm, linux-mm Cc: linux-kernel, Barry Song, Liam R. Howlett, Lorenzo Stoakes, David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng From: Barry Song <v-songbaohua@oppo.com> Certain madvise operations, especially MADV_DONTNEED, occur far more frequently than other madvise options, particularly in native and Java heaps for dynamic memory management. Currently, the mmap_lock is always held during these operations, even when unnecessary. This causes lock contention and can lead to severe priority inversion, where low-priority threads—such as Android's HeapTaskDaemon— hold the lock and block higher-priority threads. This patch enables the use of per-VMA locks when the advised range lies entirely within a single VMA, avoiding the need for full VMA traversal. In practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs. Tangquan’s testing shows that over 99.5% of memory reclaimed by Android benefits from this per-VMA lock optimization. After extended runtime, 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while only 1,231 fell back to mmap_lock. To simplify handling, the implementation falls back to the standard mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of userfaultfd_remove(). Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/mm/madvise.c b/mm/madvise.c index 8433ac9b27e0..da016a1d0434 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh if (madvise_should_skip(start, len_in, behavior, &error)) return error; + + /* + * MADV_DONTNEED is commonly used with userspace heaps and most often + * affects a single VMA. In these cases, we can use per-VMA locks to + * reduce contention on the mmap_lock. + */ + if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) { + struct vm_area_struct *prev, *vma; + unsigned long untagged_start, end; + + untagged_start = untagged_addr(start); + end = untagged_start + len_in; + vma = lock_vma_under_rcu(mm, untagged_start); + if (!vma) + goto lock; + if (end > vma->vm_end || userfaultfd_armed(vma)) { + vma_end_read(vma); + goto lock; + } + if (unlikely(!can_modify_vma_madv(vma, behavior))) { + error = -EPERM; + vma_end_read(vma); + goto out; + } + madvise_init_tlb(&madv_behavior, mm); + error = madvise_dontneed_free(vma, &prev, untagged_start, + end, &madv_behavior); + madvise_finish_tlb(&madv_behavior); + vma_end_read(vma); + goto out; + } + +lock: error = madvise_lock(mm, behavior); if (error) return error; @@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh madvise_finish_tlb(&madv_behavior); madvise_unlock(mm, behavior); +out: return error; } -- 2.39.3 (Apple Git-146) ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED 2025-05-27 4:41 [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED Barry Song @ 2025-05-27 9:20 ` Lorenzo Stoakes 2025-05-27 20:40 ` Lokesh Gidra 2025-05-28 9:36 ` Barry Song 0 siblings, 2 replies; 6+ messages in thread From: Lorenzo Stoakes @ 2025-05-27 9:20 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng Overall - thanks for this, and I'm not sure why we didn't think of doing this sooner :P this seems like a super valid thing to try to use the vma lock with. I see you've cc'd Suren who has the most expertise in this and can hopefully audit this and ensure all is good, but from the process address doc (see below), I think we're good to just have the VMA stabilised for a zap. On Tue, May 27, 2025 at 04:41:45PM +1200, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > Certain madvise operations, especially MADV_DONTNEED, occur far more > frequently than other madvise options, particularly in native and Java > heaps for dynamic memory management. Ack yeah, I have gathered that this is the case previously. > > Currently, the mmap_lock is always held during these operations, even when > unnecessary. This causes lock contention and can lead to severe priority > inversion, where low-priority threads—such as Android's HeapTaskDaemon— > hold the lock and block higher-priority threads. That's very nasty... we definitely want to eliminate as much mmap_lock contention as possible. > > This patch enables the use of per-VMA locks when the advised range lies > entirely within a single VMA, avoiding the need for full VMA traversal. In > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs. Yeah this single VMA requirement is obviously absolutely key. As per my docs [0] actually, for zapping a single VMA, 'The VMA need only be kept stable for this operation.' (I had to look this up to remind myself :P) [0]: https://kernel.org/doc/html/latest/mm/process_addrs.html So we actually... should be good here, locking-wise. > > Tangquan’s testing shows that over 99.5% of memory reclaimed by Android > benefits from this per-VMA lock optimization. After extended runtime, > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while > only 1,231 fell back to mmap_lock. Thanks, this sounds really promising! I take it then you have as a result, heavily tested this change? > > To simplify handling, the implementation falls back to the standard > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of > userfaultfd_remove(). Oh GOD do I hate how we implement uffd. Have I ever mentioned that? Well, let me mention it again... > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: David Hildenbrand <david@redhat.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Jann Horn <jannh@google.com> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Lokesh Gidra <lokeshgidra@google.com> > Cc: Tangquan Zheng <zhengtangquan@oppo.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++ > 1 file changed, 34 insertions(+) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 8433ac9b27e0..da016a1d0434 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > > if (madvise_should_skip(start, len_in, behavior, &error)) > return error; > + > + /* > + * MADV_DONTNEED is commonly used with userspace heaps and most often > + * affects a single VMA. In these cases, we can use per-VMA locks to > + * reduce contention on the mmap_lock. > + */ > + if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) { So firstly doing this here means process_madvise() doesn't get this benefit, and we're inconsistent between the two which we really want to avoid. But secondly - we definitely need to find a better way to do this :) this basically follows the 'ignore the existing approach and throw in an if (special case) { ... }' pattern that I feel we really need to do all we can to avoid in the kernel. This lies the way of uffd, hugetlb, and thus horrors beyond imagining. I can see why you did this as this is kind of special-cased a bit, and we already do this kind of thing all over the place but let's try to avoid this here. So I suggest: - Remove any code for this from do_madvise() and thus make it available to process_madvise() also. - Try to avoid the special casing here as much as humanly possible :) - Update madvise_lock()/unlock() to get passed a pointer to struct madvise_behavior to which we can add a boolean or even better I think - an enum indicating which lock type was taken (this can simplify madvise_unlock() also). - Update madvise_lock() to do all of the checks below, we already effectively do a switch (behavior) so it's not so crazy to do this. And you can also do the fallthrough logic there. - Obviously madvise_unlock() can be updated to do vma_end_read(). > + struct vm_area_struct *prev, *vma; > + unsigned long untagged_start, end; > + > + untagged_start = untagged_addr(start); > + end = untagged_start + len_in; > + vma = lock_vma_under_rcu(mm, untagged_start); > + if (!vma) > + goto lock; > + if (end > vma->vm_end || userfaultfd_armed(vma)) { > + vma_end_read(vma); > + goto lock; > + } > + if (unlikely(!can_modify_vma_madv(vma, behavior))) { > + error = -EPERM; > + vma_end_read(vma); > + goto out; > + } > + madvise_init_tlb(&madv_behavior, mm); > + error = madvise_dontneed_free(vma, &prev, untagged_start, > + end, &madv_behavior); > + madvise_finish_tlb(&madv_behavior); > + vma_end_read(vma); > + goto out; > + } > + > +lock: > error = madvise_lock(mm, behavior); > if (error) > return error; > @@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > madvise_finish_tlb(&madv_behavior); > madvise_unlock(mm, behavior); > > +out: > return error; > } > > -- > 2.39.3 (Apple Git-146) > Cheers, Lorenzo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED 2025-05-27 9:20 ` Lorenzo Stoakes @ 2025-05-27 20:40 ` Lokesh Gidra 2025-05-28 9:01 ` Barry Song 2025-05-28 9:36 ` Barry Song 1 sibling, 1 reply; 6+ messages in thread From: Lokesh Gidra @ 2025-05-27 20:40 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Barry Song, akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Tangquan Zheng Thanks a lot, Barry. This was overdue. On Tue, May 27, 2025 at 2:20 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > Overall - thanks for this, and I'm not sure why we didn't think of doing > this sooner :P this seems like a super valid thing to try to use the vma > lock with. > > I see you've cc'd Suren who has the most expertise in this and can > hopefully audit this and ensure all is good, but from the process address > doc (see below), I think we're good to just have the VMA stabilised for a > zap. > > On Tue, May 27, 2025 at 04:41:45PM +1200, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > Certain madvise operations, especially MADV_DONTNEED, occur far more > > frequently than other madvise options, particularly in native and Java > > heaps for dynamic memory management. Actually, it would be great if you can also optimize the MADV_FREE case. There are quite a few performance critical situations where MADV_FREE is performed and are done within a single VMA. > > Ack yeah, I have gathered that this is the case previously. > > > > > Currently, the mmap_lock is always held during these operations, even when > > unnecessary. This causes lock contention and can lead to severe priority > > inversion, where low-priority threads—such as Android's HeapTaskDaemon— > > hold the lock and block higher-priority threads. > > That's very nasty... we definitely want to eliminate as much mmap_lock > contention as possible. > > > > > This patch enables the use of per-VMA locks when the advised range lies > > entirely within a single VMA, avoiding the need for full VMA traversal. In > > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs. > > Yeah this single VMA requirement is obviously absolutely key. > > As per my docs [0] actually, for zapping a single VMA, 'The VMA need only be > kept stable for this operation.' (I had to look this up to remind myself :P) > > [0]: https://kernel.org/doc/html/latest/mm/process_addrs.html > > So we actually... should be good here, locking-wise. > > > > > Tangquan’s testing shows that over 99.5% of memory reclaimed by Android > > benefits from this per-VMA lock optimization. After extended runtime, > > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while > > only 1,231 fell back to mmap_lock. I am quite confident that this pattern is not limited to HeapTaskDaemon. For instance, I'm sure the Scudo allocator must also be calling MADV_DONTNEED within single VMAs. > > Thanks, this sounds really promising! > > I take it then you have as a result, heavily tested this change? > > > > > To simplify handling, the implementation falls back to the standard > > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of > > userfaultfd_remove(). > > Oh GOD do I hate how we implement uffd. Have I ever mentioned that? Well, > let me mention it again... > > > > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: Jann Horn <jannh@google.com> > > Cc: Suren Baghdasaryan <surenb@google.com> > > Cc: Lokesh Gidra <lokeshgidra@google.com> > > Cc: Tangquan Zheng <zhengtangquan@oppo.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 34 insertions(+) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 8433ac9b27e0..da016a1d0434 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > > > > if (madvise_should_skip(start, len_in, behavior, &error)) > > return error; > > + > > + /* > > + * MADV_DONTNEED is commonly used with userspace heaps and most often > > + * affects a single VMA. In these cases, we can use per-VMA locks to > > + * reduce contention on the mmap_lock. > > + */ > > + if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) { > > So firstly doing this here means process_madvise() doesn't get this benefit, and > we're inconsistent between the two which we really want to avoid. > > But secondly - we definitely need to find a better way to do this :) this > basically follows the 'ignore the existing approach and throw in an if > (special case) { ... }' pattern that I feel we really need to do all we can > to avoid in the kernel. > > This lies the way of uffd, hugetlb, and thus horrors beyond imagining. > > I can see why you did this as this is kind of special-cased a bit, and we > already do this kind of thing all over the place but let's try to avoid > this here. > > So I suggest: > > - Remove any code for this from do_madvise() and thus make it available to > process_madvise() also. > > - Try to avoid the special casing here as much as humanly possible :) > > - Update madvise_lock()/unlock() to get passed a pointer to struct > madvise_behavior to which we can add a boolean or even better I think - > an enum indicating which lock type was taken (this can simplify > madvise_unlock() also). > > - Update madvise_lock() to do all of the checks below, we already > effectively do a switch (behavior) so it's not so crazy to do this. And > you can also do the fallthrough logic there. > > - Obviously madvise_unlock() can be updated to do vma_end_read(). > > > + struct vm_area_struct *prev, *vma; > > + unsigned long untagged_start, end; > > + > > + untagged_start = untagged_addr(start); > > + end = untagged_start + len_in; > > + vma = lock_vma_under_rcu(mm, untagged_start); > > + if (!vma) > > + goto lock; > > + if (end > vma->vm_end || userfaultfd_armed(vma)) { > > + vma_end_read(vma); > > + goto lock; > > + } > > + if (unlikely(!can_modify_vma_madv(vma, behavior))) { > > + error = -EPERM; > > + vma_end_read(vma); > > + goto out; > > + } > > + madvise_init_tlb(&madv_behavior, mm); > > + error = madvise_dontneed_free(vma, &prev, untagged_start, > > + end, &madv_behavior); > > + madvise_finish_tlb(&madv_behavior); > > + vma_end_read(vma); > > + goto out; > > + } > > + > > +lock: > > error = madvise_lock(mm, behavior); > > if (error) > > return error; > > @@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > > madvise_finish_tlb(&madv_behavior); > > madvise_unlock(mm, behavior); > > > > +out: > > return error; > > } > > > > -- > > 2.39.3 (Apple Git-146) > > > > Cheers, Lorenzo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED 2025-05-27 20:40 ` Lokesh Gidra @ 2025-05-28 9:01 ` Barry Song 0 siblings, 0 replies; 6+ messages in thread From: Barry Song @ 2025-05-28 9:01 UTC (permalink / raw) To: Lokesh Gidra Cc: Lorenzo Stoakes, akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Tangquan Zheng On Wed, May 28, 2025 at 4:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > Thanks a lot, Barry. This was overdue. > > On Tue, May 27, 2025 at 2:20 AM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > > > Overall - thanks for this, and I'm not sure why we didn't think of doing > > this sooner :P this seems like a super valid thing to try to use the vma > > lock with. > > > > I see you've cc'd Suren who has the most expertise in this and can > > hopefully audit this and ensure all is good, but from the process address > > doc (see below), I think we're good to just have the VMA stabilised for a > > zap. > > > > On Tue, May 27, 2025 at 04:41:45PM +1200, Barry Song wrote: > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > Certain madvise operations, especially MADV_DONTNEED, occur far more > > > frequently than other madvise options, particularly in native and Java > > > heaps for dynamic memory management. > > Actually, it would be great if you can also optimize the MADV_FREE > case. There are quite a few performance critical situations where > MADV_FREE is performed and are done within a single VMA. Yes, that was definitely on the list. However, the reason MADV_FREE wasn't included in this patch is because madvise_free_single_vma() calls walk_page_range(), which requires the mmap_lock to locate the VMA using find_vma()—which feels redundant and awkward, since we already have the VMA. We should be able to use walk_page_range_vma() instead, given that the VMA is already known. But even walk_page_range_vma() asserts that the mmap_lock is held, so the issue remains. int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, void *private) { ... process_mm_walk_lock(walk.mm, ops->walk_lock); ... return __walk_page_range(start, end, &walk); } MADV_DONTNEED doesn't use walk_page_range() at all. So I'd prefer to settle MADV_DONTNEED first as the initial step. We need more time to investigate the behavior and requirements of walk_page_range(). > > > > Ack yeah, I have gathered that this is the case previously. > > > > > > > > Currently, the mmap_lock is always held during these operations, even when > > > unnecessary. This causes lock contention and can lead to severe priority > > > inversion, where low-priority threads—such as Android's HeapTaskDaemon— > > > hold the lock and block higher-priority threads. > > > > That's very nasty... we definitely want to eliminate as much mmap_lock > > contention as possible. > > > > > > > > This patch enables the use of per-VMA locks when the advised range lies > > > entirely within a single VMA, avoiding the need for full VMA traversal. In > > > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs. > > > > Yeah this single VMA requirement is obviously absolutely key. > > > > As per my docs [0] actually, for zapping a single VMA, 'The VMA need only be > > kept stable for this operation.' (I had to look this up to remind myself :P) > > > > [0]: https://kernel.org/doc/html/latest/mm/process_addrs.html > > > > So we actually... should be good here, locking-wise. > > > > > > > > Tangquan’s testing shows that over 99.5% of memory reclaimed by Android > > > benefits from this per-VMA lock optimization. After extended runtime, > > > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while > > > only 1,231 fell back to mmap_lock. > > I am quite confident that this pattern is not limited to > HeapTaskDaemon. For instance, I'm sure the Scudo allocator must also > be calling MADV_DONTNEED within single VMAs. That's right. We've also optimized the native heaps. > > > > Thanks, this sounds really promising! > > > > I take it then you have as a result, heavily tested this change? > > > > > > > > To simplify handling, the implementation falls back to the standard > > > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of > > > userfaultfd_remove(). > > > > Oh GOD do I hate how we implement uffd. Have I ever mentioned that? Well, > > let me mention it again... > > > > > > > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > Cc: David Hildenbrand <david@redhat.com> > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > > Cc: Jann Horn <jannh@google.com> > > > Cc: Suren Baghdasaryan <surenb@google.com> > > > Cc: Lokesh Gidra <lokeshgidra@google.com> > > > Cc: Tangquan Zheng <zhengtangquan@oppo.com> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > > --- > > > mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++ > > > 1 file changed, 34 insertions(+) > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index 8433ac9b27e0..da016a1d0434 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > > > > > > if (madvise_should_skip(start, len_in, behavior, &error)) > > > return error; > > > + > > > + /* > > > + * MADV_DONTNEED is commonly used with userspace heaps and most often > > > + * affects a single VMA. In these cases, we can use per-VMA locks to > > > + * reduce contention on the mmap_lock. > > > + */ > > > + if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) { > > > > So firstly doing this here means process_madvise() doesn't get this benefit, and > > we're inconsistent between the two which we really want to avoid. > > > > But secondly - we definitely need to find a better way to do this :) this > > basically follows the 'ignore the existing approach and throw in an if > > (special case) { ... }' pattern that I feel we really need to do all we can > > to avoid in the kernel. > > > > This lies the way of uffd, hugetlb, and thus horrors beyond imagining. > > > > I can see why you did this as this is kind of special-cased a bit, and we > > already do this kind of thing all over the place but let's try to avoid > > this here. > > > > So I suggest: > > > > - Remove any code for this from do_madvise() and thus make it available to > > process_madvise() also. > > > > - Try to avoid the special casing here as much as humanly possible :) > > > > - Update madvise_lock()/unlock() to get passed a pointer to struct > > madvise_behavior to which we can add a boolean or even better I think - > > an enum indicating which lock type was taken (this can simplify > > madvise_unlock() also). > > > > - Update madvise_lock() to do all of the checks below, we already > > effectively do a switch (behavior) so it's not so crazy to do this. And > > you can also do the fallthrough logic there. > > > > - Obviously madvise_unlock() can be updated to do vma_end_read(). > > > > > + struct vm_area_struct *prev, *vma; > > > + unsigned long untagged_start, end; > > > + > > > + untagged_start = untagged_addr(start); > > > + end = untagged_start + len_in; > > > + vma = lock_vma_under_rcu(mm, untagged_start); > > > + if (!vma) > > > + goto lock; > > > + if (end > vma->vm_end || userfaultfd_armed(vma)) { > > > + vma_end_read(vma); > > > + goto lock; > > > + } > > > + if (unlikely(!can_modify_vma_madv(vma, behavior))) { > > > + error = -EPERM; > > > + vma_end_read(vma); > > > + goto out; > > > + } > > > + madvise_init_tlb(&madv_behavior, mm); > > > + error = madvise_dontneed_free(vma, &prev, untagged_start, > > > + end, &madv_behavior); > > > + madvise_finish_tlb(&madv_behavior); > > > + vma_end_read(vma); > > > + goto out; > > > + } > > > + > > > +lock: > > > error = madvise_lock(mm, behavior); > > > if (error) > > > return error; > > > @@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > > > madvise_finish_tlb(&madv_behavior); > > > madvise_unlock(mm, behavior); > > > > > > +out: > > > return error; > > > } > > > > > > -- > > > 2.39.3 (Apple Git-146) > > > > > > > Cheers, Lorenzo Thanks Barry ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED 2025-05-27 9:20 ` Lorenzo Stoakes 2025-05-27 20:40 ` Lokesh Gidra @ 2025-05-28 9:36 ` Barry Song 2025-05-28 9:43 ` Lorenzo Stoakes 1 sibling, 1 reply; 6+ messages in thread From: Barry Song @ 2025-05-28 9:36 UTC (permalink / raw) To: Lorenzo Stoakes Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng On Tue, May 27, 2025 at 5:20 PM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > Overall - thanks for this, and I'm not sure why we didn't think of doing > this sooner :P this seems like a super valid thing to try to use the vma > lock with. > > I see you've cc'd Suren who has the most expertise in this and can > hopefully audit this and ensure all is good, but from the process address > doc (see below), I think we're good to just have the VMA stabilised for a > zap. > > On Tue, May 27, 2025 at 04:41:45PM +1200, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > Certain madvise operations, especially MADV_DONTNEED, occur far more > > frequently than other madvise options, particularly in native and Java > > heaps for dynamic memory management. > > Ack yeah, I have gathered that this is the case previously. > > > > > Currently, the mmap_lock is always held during these operations, even when > > unnecessary. This causes lock contention and can lead to severe priority > > inversion, where low-priority threads—such as Android's HeapTaskDaemon— > > hold the lock and block higher-priority threads. > > That's very nasty... we definitely want to eliminate as much mmap_lock > contention as possible. > > > > > This patch enables the use of per-VMA locks when the advised range lies > > entirely within a single VMA, avoiding the need for full VMA traversal. In > > practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs. > > Yeah this single VMA requirement is obviously absolutely key. > > As per my docs [0] actually, for zapping a single VMA, 'The VMA need only be > kept stable for this operation.' (I had to look this up to remind myself :P) > > [0]: https://kernel.org/doc/html/latest/mm/process_addrs.html > > So we actually... should be good here, locking-wise. > > > > > Tangquan’s testing shows that over 99.5% of memory reclaimed by Android > > benefits from this per-VMA lock optimization. After extended runtime, > > 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while > > only 1,231 fell back to mmap_lock. > > Thanks, this sounds really promising! > > I take it then you have as a result, heavily tested this change? This was extensively tested on an older Android kernel with real devices. As you know, running the latest mm-unstable on phones is challenging due to hardware driver constraints. However, I believe the reported percentage is accurate, since it seems pointless for userspace heaps to free memory across two or more VMAs. How could it possibly manage a slab-like system spanning multiple VMAs? > > > > > To simplify handling, the implementation falls back to the standard > > mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of > > userfaultfd_remove(). > > Oh GOD do I hate how we implement uffd. Have I ever mentioned that? Well, > let me mention it again... > > > > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: Jann Horn <jannh@google.com> > > Cc: Suren Baghdasaryan <surenb@google.com> > > Cc: Lokesh Gidra <lokeshgidra@google.com> > > Cc: Tangquan Zheng <zhengtangquan@oppo.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 34 insertions(+) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 8433ac9b27e0..da016a1d0434 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > > > > if (madvise_should_skip(start, len_in, behavior, &error)) > > return error; > > + > > + /* > > + * MADV_DONTNEED is commonly used with userspace heaps and most often > > + * affects a single VMA. In these cases, we can use per-VMA locks to > > + * reduce contention on the mmap_lock. > > + */ > > + if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) { > > So firstly doing this here means process_madvise() doesn't get this benefit, and > we're inconsistent between the two which we really want to avoid. > > But secondly - we definitely need to find a better way to do this :) this > basically follows the 'ignore the existing approach and throw in an if > (special case) { ... }' pattern that I feel we really need to do all we can > to avoid in the kernel. > > This lies the way of uffd, hugetlb, and thus horrors beyond imagining. > > I can see why you did this as this is kind of special-cased a bit, and we > already do this kind of thing all over the place but let's try to avoid > this here. > > So I suggest: > > - Remove any code for this from do_madvise() and thus make it available to > process_madvise() also. > > - Try to avoid the special casing here as much as humanly possible :) > > - Update madvise_lock()/unlock() to get passed a pointer to struct > madvise_behavior to which we can add a boolean or even better I think - > an enum indicating which lock type was taken (this can simplify > madvise_unlock() also). > > - Update madvise_lock() to do all of the checks below, we already > effectively do a switch (behavior) so it's not so crazy to do this. And > you can also do the fallthrough logic there. > > - Obviously madvise_unlock() can be updated to do vma_end_read(). I’ve definitely considered refactoring madvise_lock, madvise_do_behavior, and madvise_unlock to encapsulate the details of the per-VMA locking and mmap_lock handling within those functions: madvise_lock(mm, behavior); madvise_do_behavior(mm, start, len_in, behavior); madvise_unlock(mm, behavior); However, I’m a bit concerned that this approach might make the code messier by introducing extra arguments to handle different code paths. For instance, madvise_do_behavior might need an additional parameter to determine whether VMA traversal via madvise_walk_vmas is necessary. That said, I’ll give it a try and see if it actually turns out to be as ugly as I fear. > > > + struct vm_area_struct *prev, *vma; > > + unsigned long untagged_start, end; > > + > > + untagged_start = untagged_addr(start); > > + end = untagged_start + len_in; > > + vma = lock_vma_under_rcu(mm, untagged_start); > > + if (!vma) > > + goto lock; > > + if (end > vma->vm_end || userfaultfd_armed(vma)) { > > + vma_end_read(vma); > > + goto lock; > > + } > > + if (unlikely(!can_modify_vma_madv(vma, behavior))) { > > + error = -EPERM; > > + vma_end_read(vma); > > + goto out; > > + } > > + madvise_init_tlb(&madv_behavior, mm); > > + error = madvise_dontneed_free(vma, &prev, untagged_start, > > + end, &madv_behavior); > > + madvise_finish_tlb(&madv_behavior); > > + vma_end_read(vma); > > + goto out; > > + } > > + > > +lock: > > error = madvise_lock(mm, behavior); > > if (error) > > return error; > > @@ -1825,6 +1858,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > > madvise_finish_tlb(&madv_behavior); > > madvise_unlock(mm, behavior); > > > > +out: > > return error; > > } > > > > -- > > 2.39.3 (Apple Git-146) > > > > Cheers, Lorenzo Thanks Barry ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED 2025-05-28 9:36 ` Barry Song @ 2025-05-28 9:43 ` Lorenzo Stoakes 0 siblings, 0 replies; 6+ messages in thread From: Lorenzo Stoakes @ 2025-05-28 9:43 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, linux-kernel, Barry Song, Liam R. Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn, Suren Baghdasaryan, Lokesh Gidra, Tangquan Zheng On Wed, May 28, 2025 at 05:36:40PM +0800, Barry Song wrote: > On Tue, May 27, 2025 at 5:20 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: [s[nip] > > > > Thanks, this sounds really promising! > > > > I take it then you have as a result, heavily tested this change? > > This was extensively tested on an older Android kernel with real devices. > As you know, running the latest mm-unstable on phones is challenging due > to hardware driver constraints. However, I believe the reported percentage > is accurate, since it seems pointless for userspace heaps to free memory > across two or more VMAs. How could it possibly manage a slab-like system > spanning multiple VMAs? Right, yeah. mremap()'ing anon memory around might get you unexpected splits, but generally speaking you'd expect them to be single VMAs. [snip] > > > mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++ > > > 1 file changed, 34 insertions(+) > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index 8433ac9b27e0..da016a1d0434 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -1817,6 +1817,39 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh > > > > > > if (madvise_should_skip(start, len_in, behavior, &error)) > > > return error; > > > + > > > + /* > > > + * MADV_DONTNEED is commonly used with userspace heaps and most often > > > + * affects a single VMA. In these cases, we can use per-VMA locks to > > > + * reduce contention on the mmap_lock. > > > + */ > > > + if (behavior == MADV_DONTNEED || behavior == MADV_DONTNEED_LOCKED) { > > > > So firstly doing this here means process_madvise() doesn't get this benefit, and > > we're inconsistent between the two which we really want to avoid. > > > > But secondly - we definitely need to find a better way to do this :) this > > basically follows the 'ignore the existing approach and throw in an if > > (special case) { ... }' pattern that I feel we really need to do all we can > > to avoid in the kernel. > > > > This lies the way of uffd, hugetlb, and thus horrors beyond imagining. > > > > I can see why you did this as this is kind of special-cased a bit, and we > > already do this kind of thing all over the place but let's try to avoid > > this here. > > > > So I suggest: > > > > - Remove any code for this from do_madvise() and thus make it available to > > process_madvise() also. > > > > - Try to avoid the special casing here as much as humanly possible :) > > > > - Update madvise_lock()/unlock() to get passed a pointer to struct > > madvise_behavior to which we can add a boolean or even better I think - > > an enum indicating which lock type was taken (this can simplify > > madvise_unlock() also). > > > > - Update madvise_lock() to do all of the checks below, we already > > effectively do a switch (behavior) so it's not so crazy to do this. And > > you can also do the fallthrough logic there. > > > > - Obviously madvise_unlock() can be updated to do vma_end_read(). > > I’ve definitely considered refactoring madvise_lock, madvise_do_behavior, > and madvise_unlock to encapsulate the details of the per-VMA locking and > mmap_lock handling within those functions: > madvise_lock(mm, behavior); > madvise_do_behavior(mm, start, len_in, behavior); > madvise_unlock(mm, behavior); > > However, I’m a bit concerned that this approach might make the code messier > by introducing extra arguments to handle different code paths. For instance, > madvise_do_behavior might need an additional parameter to determine whether > VMA traversal via madvise_walk_vmas is necessary. > > That said, I’ll give it a try and see if it actually turns out to be as ugly > as I fear. Your reticence is understandable, this code can be pretty hairy, however thanks to SJ's refactoring efforts you can now use the helper struct he introduced - madvise_behavior - to store this state, so you don't have to pass parameters really _much_ more than now. Unfortunately the current version of the patch isn't mergeable as it stands, so it really does have to be sensibly generalised, however it ends up looking. To be clear - I appreciate all your efforts, and sorry to add some extra work here, but in the long run it'll be worthwhile I feel, and will lay the foundations for future use of VMA locks :) Cheers, Lorenzo [snip] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-05-28 9:44 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-27 4:41 [PATCH RFC] mm: use per_vma lock for MADV_DONTNEED Barry Song 2025-05-27 9:20 ` Lorenzo Stoakes 2025-05-27 20:40 ` Lokesh Gidra 2025-05-28 9:01 ` Barry Song 2025-05-28 9:36 ` Barry Song 2025-05-28 9:43 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).