[DISCUSSION] Unconditionally lock folios when calling rmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
@ 2025-08-22 17:29 Lokesh Gidra
  2025-08-22 17:44 ` Matthew Wilcox
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Lokesh Gidra @ 2025-08-22 17:29 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes, Andrew Morton
  Cc: Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT,
	Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm,
	linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka,
	Liam R. Howlett

Hi all,

Currently, some callers of rmap_walk() conditionally avoid try-locking
non-ksm anon folios. This necessitates serialization through anon_vma
write-lock elsewhere when folio->mapping and/or folio->index (fields
involved in rmap_walk()) are to be updated. This hurts scalability due
to coarse granularity of the lock. For instance, when multiple threads
invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
from the same src VMA, they all contend for the corresponding
anon_vma’s lock. Field traces for arm64 android devices reveal over
30ms of uninterruptible sleep in the main UI thread, leading to janky
user interactions.

Among all rmap_walk() callers that don’t lock anon folios,
folio_referenced() is the most critical (others are
page_idle_clear_pte_refs(), damon_folio_young(), and
damon_folio_mkold()). The relevant code in folio_referenced() is:

if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
        we_locked = folio_trylock(folio);
        if (!we_locked)
                return 1;
}

It’s unclear why locking anon_vma exclusively (when updating
folio->mapping, like in uffd MOVE) is beneficial over walking rmap
with folio locked. It’s in the reclaim path, so should not be a
critical path that necessitates some special treatment, unless I’m
missing something.

Therefore, I propose simplifying the locking mechanism by ensuring the
folio is locked before calling rmap_walk(). This helps avoid locking
anon_vma when updating folio->mapping, which, for instance, will help
eliminate the uninterruptible sleep observed in the field traces
mentioned earlier. Furthermore, it enables us to simplify the code in
folio_lock_anon_vma_read() by removing the re-check to ensure that the
field hasn’t changed under us.

Thanks,
Lokesh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-22 17:29 [DISCUSSION] Unconditionally lock folios when calling rmap_walk() Lokesh Gidra
@ 2025-08-22 17:44 ` Matthew Wilcox
  2025-08-22 18:08   ` Lorenzo Stoakes
  2025-08-24  4:18 ` Lokesh Gidra
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2025-08-22 17:44 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: David Hildenbrand, Lorenzo Stoakes, Andrew Morton, Harry Yoo,
	Zi Yan, Barry Song, open list:MEMORY MANAGEMENT, Peter Xu,
	Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel,
	Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett

On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote:
> Currently, some callers of rmap_walk() conditionally avoid try-locking
> non-ksm anon folios. This necessitates serialization through anon_vma

... this seems awfully familiar.  Why did you send it again after
a bunch of people had already said useful things on that thread?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-22 17:44 ` Matthew Wilcox
@ 2025-08-22 18:08   ` Lorenzo Stoakes
  0 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2025-08-22 18:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lokesh Gidra, David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan,
	Barry Song, open list:MEMORY MANAGEMENT, Peter Xu,
	Suren Baghdasaryan, Kalesh Singh, android-mm, linux-kernel,
	Jann Horn, Rik van Riel, Vlastimil Babka, Liam R. Howlett

On Fri, Aug 22, 2025 at 06:44:13PM +0100, Matthew Wilcox wrote:
> On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote:
> > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > non-ksm anon folios. This necessitates serialization through anon_vma
>
> ... this seems awfully familiar.  Why did you send it again after
> a bunch of people had already said useful things on that thread?

This is on me - I asked him to resend because the original was mislabelled and
didn't cc the right people, and lei blew up when I tried to grab it, so I asked
various people to hold off on reply/resend replies on resent thread and asked
Lokesh to resend.

Apologies for loss of context. rmap locking is important to me as it is deeply
sensitive and has caused numerous painful issues in the past so want to make
sure myself, David and the reivewers are all included.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-22 17:29 [DISCUSSION] Unconditionally lock folios when calling rmap_walk() Lokesh Gidra
  2025-08-22 17:44 ` Matthew Wilcox
@ 2025-08-24  4:18 ` Lokesh Gidra
  2025-08-24  5:31   ` Harry Yoo
  2025-08-25 15:19 ` David Hildenbrand
  2025-08-26 15:52 ` Lorenzo Stoakes
  3 siblings, 1 reply; 21+ messages in thread
From: Lokesh Gidra @ 2025-08-24  4:18 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Lorenzo Stoakes, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT,
	Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm,
	linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka,
	Liam R. Howlett, David Hildenbrand, Andrew Morton

On Fri, Aug 22, 2025 at 10:29 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> Hi all,
>
> Currently, some callers of rmap_walk() conditionally avoid try-locking
> non-ksm anon folios. This necessitates serialization through anon_vma
> write-lock elsewhere when folio->mapping and/or folio->index (fields
> involved in rmap_walk()) are to be updated. This hurts scalability due
> to coarse granularity of the lock. For instance, when multiple threads
> invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> from the same src VMA, they all contend for the corresponding
> anon_vma’s lock. Field traces for arm64 android devices reveal over
> 30ms of uninterruptible sleep in the main UI thread, leading to janky
> user interactions.
>
> Among all rmap_walk() callers that don’t lock anon folios,
> folio_referenced() is the most critical (others are
> page_idle_clear_pte_refs(), damon_folio_young(), and
> damon_folio_mkold()). The relevant code in folio_referenced() is:
>
> if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
>         we_locked = folio_trylock(folio);
>         if (!we_locked)
>                 return 1;
> }
>
> It’s unclear why locking anon_vma exclusively (when updating
> folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> with folio locked. It’s in the reclaim path, so should not be a
> critical path that necessitates some special treatment, unless I’m
> missing something.
>
> Therefore, I propose simplifying the locking mechanism by ensuring the
> folio is locked before calling rmap_walk(). This helps avoid locking
> anon_vma when updating folio->mapping, which, for instance, will help
> eliminate the uninterruptible sleep observed in the field traces
> mentioned earlier. Furthermore, it enables us to simplify the code in
> folio_lock_anon_vma_read() by removing the re-check to ensure that the
> field hasn’t changed under us.
Hi Harry,

Your comment [1] in the other thread was quite useful and also needed
to be responded to. So bringing it here for continuing discussion.

It seems from your comment that you misunderstood my proposal. I am
not suggesting replacing anon_vma lock with folio lock during rmap
walk. Clearly, it is essential for all the reasons that you
enumerated. My proposal is to lock anon folios during rmap_walk(),
like file and KSM folios.

This helps in improving scalability (and also simplifying code in
folio_lock_anon_vma_read()) as then we can serialize on folio lock
instead of anon_vma lock when moving the folio to a different root
anon_vma in folio_move_anon_rmap() [2].

[1] https://lore.kernel.org/all/aKhIL3OguViS9myH@hyeyoo/
[2] https://lore.kernel.org/all/e5d41fbe-a91b-9491-7b93-733f67e75a54@redhat.com/
>
> Thanks,
> Lokesh


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-24  4:18 ` Lokesh Gidra
@ 2025-08-24  5:31   ` Harry Yoo
  2025-08-24  6:45     ` Lokesh Gidra
  0 siblings, 1 reply; 21+ messages in thread
From: Harry Yoo @ 2025-08-24  5:31 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Lorenzo Stoakes, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT,
	Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm,
	linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka,
	Liam R. Howlett, David Hildenbrand, Andrew Morton

On Sat, Aug 23, 2025 at 09:18:11PM -0700, Lokesh Gidra wrote:
> On Fri, Aug 22, 2025 at 10:29 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > Hi all,
> >
> > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > non-ksm anon folios. This necessitates serialization through anon_vma
> > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > involved in rmap_walk()) are to be updated. This hurts scalability due
> > to coarse granularity of the lock. For instance, when multiple threads
> > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > from the same src VMA, they all contend for the corresponding
> > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > user interactions.
> >
> > Among all rmap_walk() callers that don’t lock anon folios,
> > folio_referenced() is the most critical (others are
> > page_idle_clear_pte_refs(), damon_folio_young(), and
> > damon_folio_mkold()). The relevant code in folio_referenced() is:
> >
> > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> >         we_locked = folio_trylock(folio);
> >         if (!we_locked)
> >                 return 1;
> > }
> >
> > It’s unclear why locking anon_vma exclusively (when updating
> > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > with folio locked. It’s in the reclaim path, so should not be a
> > critical path that necessitates some special treatment, unless I’m
> > missing something.
> >
> > Therefore, I propose simplifying the locking mechanism by ensuring the
> > folio is locked before calling rmap_walk(). This helps avoid locking
> > anon_vma when updating folio->mapping, which, for instance, will help
> > eliminate the uninterruptible sleep observed in the field traces
> > mentioned earlier. Furthermore, it enables us to simplify the code in
> > folio_lock_anon_vma_read() by removing the re-check to ensure that the
> > field hasn’t changed under us.
> Hi Harry,
>
> Your comment [1] in the other thread was quite useful and also needed
> to be responded to. So bringing it here for continuing discussion.

Hi Lokesh,

Here I'm quoting my previous comment for discussion. I should have done it
earlier but you know, it was Friday night in Korea :)

My previous comment was:
  Simply acquiring the folio lock instead of anon_vma lock isn't enough
  1) because the kernel can't stablize anon_vma without anon_vma lock
  (an anon_vma cannot be freed while someone's holding anon_vma lock,
  see anon_vma_free()).

  2) without anon_vma lock the kernel can't reliably unmap folios because
  they can be mapped to other processes (by fork()) while the kernel is
  iterating list of VMAs that can possibly map the folio. fork() doens't
  and shouldn't acquire folio lock.

  3) Holding anon_vma lock also prevents anon_vma_chains from
     being freed while holding the lock.

  [Are there more things to worry about that I missed?
   Please add them if so]

  Any idea to relax locking requirements while addressing these
  requirements?

  If some users don't care about missing some PTE A bits due to race
  against fork() (perhaps folio_referenced()?), a crazy idea might be to
  RCU-protect anon_vma_chains (but then we need to check if the VMA is
  still alive) and use refcount to stablize anon_vmas?

> It seems from your comment that you misunderstood my proposal. I am
> not suggesting replacing anon_vma lock with folio lock during rmap
> walk. Clearly, it is essential for all the reasons that you
> enumerated. My proposal is to lock anon folios during rmap_walk(),
> like file and KSM folios.

Still not sure if I follow your proposal. Let's clarify a little bit.

As anon_vma lock is reader-writer semaphore, maybe you're saying
1) readers should acquire both folio lock and anon_vma lock, and

> This helps in improving scalability (and also simplifying code in
> folio_lock_anon_vma_read()) as then we can serialize on folio lock
> instead of anon_vma lock when moving the folio to a different root
> anon_vma in folio_move_anon_rmap() [2].

2) some of existing writers (e.g., move_pages_pte() in mm/userfaultfd.c)
   simply update folio->index and folio->mapping, and they should be able
   to run in parallel if they're not updating the same folio,
   by taking folio lock and avoiding anon_vma lock?

I see a comment in move_pages_pte():
/*
 * folio_referenced walks the anon_vma chain
 * without the folio lock. Serialize against it with
 * the anon_vma lock, the folio lock is not enough.
 */

> [1] https://lore.kernel.org/all/aKhIL3OguViS9myH@hyeyoo
> [2] https://lore.kernel.org/all/e5d41fbe-a91b-9491-7b93-733f67e75a54@redhat.com
> > > > Thanks,
> > Lokesh

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-24  5:31   ` Harry Yoo
@ 2025-08-24  6:45     ` Lokesh Gidra
  2025-08-25 10:52       ` Harry Yoo
  0 siblings, 1 reply; 21+ messages in thread
From: Lokesh Gidra @ 2025-08-24  6:45 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Lorenzo Stoakes, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT,
	Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm,
	linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka,
	Liam R. Howlett, David Hildenbrand, Andrew Morton

On Sat, Aug 23, 2025 at 10:31 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Sat, Aug 23, 2025 at 09:18:11PM -0700, Lokesh Gidra wrote:
> > On Fri, Aug 22, 2025 at 10:29 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > >
> > > Hi all,
> > >
> > > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > > non-ksm anon folios. This necessitates serialization through anon_vma
> > > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > > involved in rmap_walk()) are to be updated. This hurts scalability due
> > > to coarse granularity of the lock. For instance, when multiple threads
> > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > > from the same src VMA, they all contend for the corresponding
> > > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > > user interactions.
> > >
> > > Among all rmap_walk() callers that don’t lock anon folios,
> > > folio_referenced() is the most critical (others are
> > > page_idle_clear_pte_refs(), damon_folio_young(), and
> > > damon_folio_mkold()). The relevant code in folio_referenced() is:
> > >
> > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > >         we_locked = folio_trylock(folio);
> > >         if (!we_locked)
> > >                 return 1;
> > > }
> > >
> > > It’s unclear why locking anon_vma exclusively (when updating
> > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > > with folio locked. It’s in the reclaim path, so should not be a
> > > critical path that necessitates some special treatment, unless I’m
> > > missing something.
> > >
> > > Therefore, I propose simplifying the locking mechanism by ensuring the
> > > folio is locked before calling rmap_walk(). This helps avoid locking
> > > anon_vma when updating folio->mapping, which, for instance, will help
> > > eliminate the uninterruptible sleep observed in the field traces
> > > mentioned earlier. Furthermore, it enables us to simplify the code in
> > > folio_lock_anon_vma_read() by removing the re-check to ensure that the
> > > field hasn’t changed under us.
> > Hi Harry,
> >
> > Your comment [1] in the other thread was quite useful and also needed
> > to be responded to. So bringing it here for continuing discussion.
>
> Hi Lokesh,
>
> Here I'm quoting my previous comment for discussion. I should have done it
> earlier but you know, it was Friday night in Korea :)

No problem at all. :)
>
> My previous comment was:
>   Simply acquiring the folio lock instead of anon_vma lock isn't enough
>   1) because the kernel can't stablize anon_vma without anon_vma lock
>   (an anon_vma cannot be freed while someone's holding anon_vma lock,
>   see anon_vma_free()).
>
>   2) without anon_vma lock the kernel can't reliably unmap folios because
>   they can be mapped to other processes (by fork()) while the kernel is
>   iterating list of VMAs that can possibly map the folio. fork() doens't
>   and shouldn't acquire folio lock.
>
>   3) Holding anon_vma lock also prevents anon_vma_chains from
>      being freed while holding the lock.
>
>   [Are there more things to worry about that I missed?
>    Please add them if so]
>
>   Any idea to relax locking requirements while addressing these
>   requirements?
>
>   If some users don't care about missing some PTE A bits due to race
>   against fork() (perhaps folio_referenced()?), a crazy idea might be to
>   RCU-protect anon_vma_chains (but then we need to check if the VMA is
>   still alive) and use refcount to stablize anon_vmas?
>
> > It seems from your comment that you misunderstood my proposal. I am
> > not suggesting replacing anon_vma lock with folio lock during rmap
> > walk. Clearly, it is essential for all the reasons that you
> > enumerated. My proposal is to lock anon folios during rmap_walk(),
> > like file and KSM folios.
>
> Still not sure if I follow your proposal. Let's clarify a little bit.
>
> As anon_vma lock is reader-writer semaphore, maybe you're saying
> 1) readers should acquire both folio lock and anon_vma lock, and
>
> > This helps in improving scalability (and also simplifying code in
> > folio_lock_anon_vma_read()) as then we can serialize on folio lock
> > instead of anon_vma lock when moving the folio to a different root
> > anon_vma in folio_move_anon_rmap() [2].
>
> 2) some of existing writers (e.g., move_pages_pte() in mm/userfaultfd.c)
>    simply update folio->index and folio->mapping, and they should be able
>    to run in parallel if they're not updating the same folio,
>    by taking folio lock and avoiding anon_vma lock?

Yes, that's exactly what I am hoping to achieve.
>
> I see a comment in move_pages_pte():
> /*
>  * folio_referenced walks the anon_vma chain
>  * without the folio lock. Serialize against it with
>  * the anon_vma lock, the folio lock is not enough.
>  */
>
> > [1] https://lore.kernel.org/all/aKhIL3OguViS9myH@hyeyoo
> > [2] https://lore.kernel.org/all/e5d41fbe-a91b-9491-7b93-733f67e75a54@redhat.com
> > > > > Thanks,
> > > Lokesh
>
> --
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-24  6:45     ` Lokesh Gidra
@ 2025-08-25 10:52       ` Harry Yoo
  2025-08-28 11:32         ` Lorenzo Stoakes
  0 siblings, 1 reply; 21+ messages in thread
From: Harry Yoo @ 2025-08-25 10:52 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Lorenzo Stoakes, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT,
	Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm,
	linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka,
	Liam R. Howlett, David Hildenbrand, Andrew Morton

On Sat, Aug 23, 2025 at 11:45:03PM -0700, Lokesh Gidra wrote:
> On Sat, Aug 23, 2025 at 10:31 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Sat, Aug 23, 2025 at 09:18:11PM -0700, Lokesh Gidra wrote:
> > > On Fri, Aug 22, 2025 at 10:29 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > > > non-ksm anon folios. This necessitates serialization through anon_vma
> > > > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > > > involved in rmap_walk()) are to be updated. This hurts scalability due
> > > > to coarse granularity of the lock. For instance, when multiple threads
> > > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > > > from the same src VMA, they all contend for the corresponding
> > > > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > > > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > > > user interactions.
> > > >
> > > > Among all rmap_walk() callers that don’t lock anon folios,
> > > > folio_referenced() is the most critical (others are
> > > > page_idle_clear_pte_refs(), damon_folio_young(), and
> > > > damon_folio_mkold()). The relevant code in folio_referenced() is:
> > > >
> > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > > >         we_locked = folio_trylock(folio);
> > > >         if (!we_locked)
> > > >                 return 1;
> > > > }
> > > >
> > > > It’s unclear why locking anon_vma exclusively (when updating
> > > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > > > with folio locked. It’s in the reclaim path, so should not be a
> > > > critical path that necessitates some special treatment, unless I’m
> > > > missing something.
> > > >
> > > > Therefore, I propose simplifying the locking mechanism by ensuring the
> > > > folio is locked before calling rmap_walk(). This helps avoid locking
> > > > anon_vma when updating folio->mapping, which, for instance, will help
> > > > eliminate the uninterruptible sleep observed in the field traces
> > > > mentioned earlier. Furthermore, it enables us to simplify the code in
> > > > folio_lock_anon_vma_read() by removing the re-check to ensure that the
> > > > field hasn’t changed under us.
> > > Hi Harry,
> > >
> > > Your comment [1] in the other thread was quite useful and also needed
> > > to be responded to. So bringing it here for continuing discussion.
> >
> > Hi Lokesh,
> >
> > Here I'm quoting my previous comment for discussion. I should have done it
> > earlier but you know, it was Friday night in Korea :)
> 
> No problem at all. :)
> >
> > My previous comment was:
> >   Simply acquiring the folio lock instead of anon_vma lock isn't enough
> >   1) because the kernel can't stablize anon_vma without anon_vma lock
> >   (an anon_vma cannot be freed while someone's holding anon_vma lock,
> >   see anon_vma_free()).
> >
> >   2) without anon_vma lock the kernel can't reliably unmap folios because
> >   they can be mapped to other processes (by fork()) while the kernel is
> >   iterating list of VMAs that can possibly map the folio. fork() doens't
> >   and shouldn't acquire folio lock.
> >
> >   3) Holding anon_vma lock also prevents anon_vma_chains from
> >      being freed while holding the lock.
> >
> >   [Are there more things to worry about that I missed?
> >    Please add them if so]
> >
> >   Any idea to relax locking requirements while addressing these
> >   requirements?
> >
> >   If some users don't care about missing some PTE A bits due to race
> >   against fork() (perhaps folio_referenced()?), a crazy idea might be to
> >   RCU-protect anon_vma_chains (but then we need to check if the VMA is
> >   still alive) and use refcount to stablize anon_vmas?
> >
> > > It seems from your comment that you misunderstood my proposal. I am
> > > not suggesting replacing anon_vma lock with folio lock during rmap
> > > walk. Clearly, it is essential for all the reasons that you
> > > enumerated. My proposal is to lock anon folios during rmap_walk(),
> > > like file and KSM folios.
> >
> > Still not sure if I follow your proposal. Let's clarify a little bit.
> >
> > As anon_vma lock is reader-writer semaphore, maybe you're saying
> > 1) readers should acquire both folio lock and anon_vma lock, and
> >
> > > This helps in improving scalability (and also simplifying code in
> > > folio_lock_anon_vma_read()) as then we can serialize on folio lock
> > > instead of anon_vma lock when moving the folio to a different root
> > > anon_vma in folio_move_anon_rmap() [2].
> >
> > 2) some of existing writers (e.g., move_pages_pte() in mm/userfaultfd.c)
> >    simply update folio->index and folio->mapping, and they should be able
> >    to run in parallel if they're not updating the same folio,
> >    by taking folio lock and avoiding anon_vma lock?
> 
> Yes, that's exactly what I am hoping to achieve.

I think technically that should work. I can't think of a way this could
go wrong (Or maybe I just lack imagination :/ ).

Adding folio_lock/unlock in folio_referenced() is likely to affect
reclamation performance. How bad is it? I don't know, need data.

And adding cost only for the uffd case might not be justifiable...

> > I see a comment in move_pages_pte():
> > /*
> >  * folio_referenced walks the anon_vma chain
> >  * without the folio lock. Serialize against it with
> >  * the anon_vma lock, the folio lock is not enough.
> >  */
> >
> > > [1] https://lore.kernel.org/all/aKhIL3OguViS9myH@hyeyoo
> > > [2] https://lore.kernel.org/all/e5d41fbe-a91b-9491-7b93-733f67e75a54@redhat.com
> > > > > > Thanks,
> > > > Lokesh

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-25 10:52       ` Harry Yoo
@ 2025-08-28 11:32         ` Lorenzo Stoakes
  0 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2025-08-28 11:32 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Lokesh Gidra, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT,
	Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm,
	linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka,
	Liam R. Howlett, David Hildenbrand, Andrew Morton

On Mon, Aug 25, 2025 at 07:52:10PM +0900, Harry Yoo wrote:
> Adding folio_lock/unlock in folio_referenced() is likely to affect
> reclamation performance. How bad is it? I don't know, need data.
>
> And adding cost only for the uffd case might not be justifiable...

Exactly.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-22 17:29 [DISCUSSION] Unconditionally lock folios when calling rmap_walk() Lokesh Gidra
  2025-08-22 17:44 ` Matthew Wilcox
  2025-08-24  4:18 ` Lokesh Gidra
@ 2025-08-25 15:19 ` David Hildenbrand
  2025-08-25 18:46   ` Lokesh Gidra
  2025-08-28 12:04   ` Lorenzo Stoakes
  2025-08-26 15:52 ` Lorenzo Stoakes
  3 siblings, 2 replies; 21+ messages in thread
From: David Hildenbrand @ 2025-08-25 15:19 UTC (permalink / raw)
  To: Lokesh Gidra, Lorenzo Stoakes, Andrew Morton
  Cc: Harry Yoo, Zi Yan, Barry Song, open list:MEMORY MANAGEMENT,
	Peter Xu, Suren Baghdasaryan, Kalesh Singh, android-mm,
	linux-kernel, Jann Horn, Rik van Riel, Vlastimil Babka,
	Liam R. Howlett

On 22.08.25 19:29, Lokesh Gidra wrote:
> Hi all,
> 
> Currently, some callers of rmap_walk() conditionally avoid try-locking
> non-ksm anon folios. This necessitates serialization through anon_vma
> write-lock elsewhere when folio->mapping and/or folio->index (fields
> involved in rmap_walk()) are to be updated. This hurts scalability due
> to coarse granularity of the lock. For instance, when multiple threads
> invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> from the same src VMA, they all contend for the corresponding
> anon_vma’s lock. Field traces for arm64 android devices reveal over
> 30ms of uninterruptible sleep in the main UI thread, leading to janky
> user interactions.
> 
> Among all rmap_walk() callers that don’t lock anon folios,
> folio_referenced() is the most critical (others are
> page_idle_clear_pte_refs(), damon_folio_young(), and
> damon_folio_mkold()). The relevant code in folio_referenced() is:
> 
> if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
>          we_locked = folio_trylock(folio);
>          if (!we_locked)
>                  return 1;
> }
> 
> It’s unclear why locking anon_vma exclusively (when updating
> folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> with folio locked. It’s in the reclaim path, so should not be a
> critical path that necessitates some special treatment, unless I’m
> missing something.
> 
> Therefore, I propose simplifying the locking mechanism by ensuring the
> folio is locked before calling rmap_walk().

Essentially, what you mean is roughly:

diff --git a/mm/rmap.c b/mm/rmap.c
index 34333ae3bd80f..0800e73c0796e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1005,7 +1005,7 @@ int folio_referenced(struct folio *folio, int is_locked,
         if (!folio_raw_mapping(folio))
                 return 0;

-       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
+       if (!is_locked) {
                 we_locked = folio_trylock(folio);
                 if (!we_locked)
                         return 1;

The downside of that change is that ordinary (!ksm) folios will observe being locked
when we are actually only trying to asses if they were referenced.

Does it matter?

I can only speculate that it might have been very relevant before
6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive").

Essentially any R/O fault would have resulted in us copying the page, simply because
there is concurrent folio_referenced() happening.

Before 09854ba94c6a ("mm: do_wp_page() simplification") that wasn't an issue, but
it would have meant that the write fault would be stuck until folio_referenced()
would be done, which is also suboptimal.

So likely, avoiding grabbing the folio lock was beneficial.

Today, this would only affect R/O pages after fork (PageAnonExclusive not set).

Staring at shrink_active_list()->folio_referenced(), we isolate the folio first
(grabbing reference+clearing LRU), so do_wp_page()->wp_can_reuse_anon_folio()
would already refuse to reuse immediately, because it would spot a raised reference.
The folio lock does not make a difference anymore.

Is there any other anon-specific (!ksm) folio locking? Nothing exciting comes to mind,
except maybe some folio splitting or khugepaged that maybe would have to wait.

But khugepaged would already also fail to isolate these folios, so probably it's not that
relevant anymore ...

-- 
Cheers

David / dhildenb

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-25 15:19 ` David Hildenbrand
@ 2025-08-25 18:46   ` Lokesh Gidra
  2025-08-28 12:04   ` Lorenzo Stoakes
  1 sibling, 0 replies; 21+ messages in thread
From: Lokesh Gidra @ 2025-08-25 18:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Harry Yoo, Zi Yan, Barry Song,
	open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan,
	Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel,
	Vlastimil Babka, Liam R. Howlett

On Mon, Aug 25, 2025 at 8:19 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.08.25 19:29, Lokesh Gidra wrote:
> > Hi all,
> >
> > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > non-ksm anon folios. This necessitates serialization through anon_vma
> > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > involved in rmap_walk()) are to be updated. This hurts scalability due
> > to coarse granularity of the lock. For instance, when multiple threads
> > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > from the same src VMA, they all contend for the corresponding
> > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > user interactions.
> >
> > Among all rmap_walk() callers that don’t lock anon folios,
> > folio_referenced() is the most critical (others are
> > page_idle_clear_pte_refs(), damon_folio_young(), and
> > damon_folio_mkold()). The relevant code in folio_referenced() is:
> >
> > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> >          we_locked = folio_trylock(folio);
> >          if (!we_locked)
> >                  return 1;
> > }
> >
> > It’s unclear why locking anon_vma exclusively (when updating
> > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > with folio locked. It’s in the reclaim path, so should not be a
> > critical path that necessitates some special treatment, unless I’m
> > missing something.
> >
> > Therefore, I propose simplifying the locking mechanism by ensuring the
> > folio is locked before calling rmap_walk().
>
> Essentially, what you mean is roughly:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 34333ae3bd80f..0800e73c0796e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1005,7 +1005,7 @@ int folio_referenced(struct folio *folio, int is_locked,
>          if (!folio_raw_mapping(folio))
>                  return 0;
>
> -       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> +       if (!is_locked) {
>                  we_locked = folio_trylock(folio);
>                  if (!we_locked)
>                          return 1;
>
>
> The downside of that change is that ordinary (!ksm) folios will observe being locked
> when we are actually only trying to asses if they were referenced.
>
> Does it matter?
>
> I can only speculate that it might have been very relevant before
> 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive").
>
> Essentially any R/O fault would have resulted in us copying the page, simply because
> there is concurrent folio_referenced() happening.
>
> Before 09854ba94c6a ("mm: do_wp_page() simplification") that wasn't an issue, but
> it would have meant that the write fault would be stuck until folio_referenced()
> would be done, which is also suboptimal.
>
> So likely, avoiding grabbing the folio lock was beneficial.
>
>
> Today, this would only affect R/O pages after fork (PageAnonExclusive not set).
>
>
> Staring at shrink_active_list()->folio_referenced(), we isolate the folio first
> (grabbing reference+clearing LRU), so do_wp_page()->wp_can_reuse_anon_folio()
> would already refuse to reuse immediately, because it would spot a raised reference.
> The folio lock does not make a difference anymore.
>
>
> Is there any other anon-specific (!ksm) folio locking? Nothing exciting comes to mind,
> except maybe some folio splitting or khugepaged that maybe would have to wait.
>
> But khugepaged would already also fail to isolate these folios, so probably it's not that
> relevant anymore ...

Thanks so much for your thorough analysis. Very useful!

For folio splitting, it seems anon_vma lock is acquired exclusively,
so it serializes against folio_referenced() anyways.
>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-25 15:19 ` David Hildenbrand
  2025-08-25 18:46   ` Lokesh Gidra
@ 2025-08-28 12:04   ` Lorenzo Stoakes
  2025-08-29  0:23     ` Lokesh Gidra
  1 sibling, 1 reply; 21+ messages in thread
From: Lorenzo Stoakes @ 2025-08-28 12:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lokesh Gidra, Andrew Morton, Harry Yoo, Zi Yan, Barry Song,
	open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan,
	Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel,
	Vlastimil Babka, Liam R. Howlett

On Mon, Aug 25, 2025 at 05:19:05PM +0200, David Hildenbrand wrote:
> On 22.08.25 19:29, Lokesh Gidra wrote:
> > Hi all,
> >
> > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > non-ksm anon folios. This necessitates serialization through anon_vma
> > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > involved in rmap_walk()) are to be updated. This hurts scalability due
> > to coarse granularity of the lock. For instance, when multiple threads
> > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > from the same src VMA, they all contend for the corresponding
> > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > user interactions.
> >
> > Among all rmap_walk() callers that don’t lock anon folios,
> > folio_referenced() is the most critical (others are
> > page_idle_clear_pte_refs(), damon_folio_young(), and
> > damon_folio_mkold()). The relevant code in folio_referenced() is:
> >
> > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> >          we_locked = folio_trylock(folio);
> >          if (!we_locked)
> >                  return 1;
> > }
> >
> > It’s unclear why locking anon_vma exclusively (when updating
> > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > with folio locked. It’s in the reclaim path, so should not be a
> > critical path that necessitates some special treatment, unless I’m
> > missing something.
> >
> > Therefore, I propose simplifying the locking mechanism by ensuring the
> > folio is locked before calling rmap_walk().
>
> Essentially, what you mean is roughly:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 34333ae3bd80f..0800e73c0796e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1005,7 +1005,7 @@ int folio_referenced(struct folio *folio, int is_locked,
>         if (!folio_raw_mapping(folio))
>                 return 0;
> -       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> +       if (!is_locked) {
>                 we_locked = folio_trylock(folio);
>                 if (!we_locked)
>                         return 1;
>
>
> The downside of that change is that ordinary (!ksm) folios will observe being locked

Well anon folios, I guess this is what you meant :)

> when we are actually only trying to asses if they were referenced.
>
> Does it matter?

Also another downside is we try lock and abort if we fail, so we've made this
conditional on that.

But surely this is going to impact reclaim performance esp. under heavy memory
pressure? It is at least a trylock.

It's dangerous waters, and I'd really want some detailed data + analysis to
prove the point here, I don't think theorising about it is enough.

>
> I can only speculate that it might have been very relevant before
> 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive").
>
> Essentially any R/O fault would have resulted in us copying the page, simply because
> there is concurrent folio_referenced() happening.

Fun.

Thing is we now have to consider _every case_ where a contention might cause an
issue.

One thing I _was_ concerned about was:

- uffd move locks folios
- now folio_referenced() 'fails' returning 1

But case 2 is only in shrink_active_list() which uses this as a boolean...

OK so maybe fine for this one.

I really do also hate that any future callers are going to possibly be confused
about how this function works, but I guess it was already 'weird' for
file-backed/KSM.

So the issue remains really - folio lock contention as a result of this.

It's one thing to theorise, but you may be forgetting something... and then
we've changed an absolutely core thing to suit a niche UFFD use case.

I do wonder if we can identify this case and handle things differently.

Perhaps even saying 'try and get the rmap lock, but if there's "too much"
contention, grab the folio lock.

>
> Before 09854ba94c6a ("mm: do_wp_page() simplification") that wasn't an issue, but
> it would have meant that the write fault would be stuck until folio_referenced()
> would be done, which is also suboptimal.
>
> So likely, avoiding grabbing the folio lock was beneficial.
>
>
> Today, this would only affect R/O pages after fork (PageAnonExclusive not set).

Hm probably less of a problem that.

>
>
> Staring at shrink_active_list()->folio_referenced(), we isolate the folio first
> (grabbing reference+clearing LRU), so do_wp_page()->wp_can_reuse_anon_folio()
> would already refuse to reuse immediately, because it would spot a raised reference.
> The folio lock does not make a difference anymore.

folio_check_references() we're good with anyway as folio already locked.

So obviously shrink_active_list() is the only caller we really care about.

That at least reduces this case, but we then have to deal with the fact we're
contending this lock elsewhere.

>
>
> Is there any other anon-specific (!ksm) folio locking? Nothing exciting comes to mind,
> except maybe some folio splitting or khugepaged that maybe would have to wait.
>
> But khugepaged would already also fail to isolate these folios, so probably it's not that
> relevant anymore ...

This is it... there's a lot of possibilities and we need to tread extremely
carefully.

if we could find a way to make uffd deal with this one way or another (or
possibly - detecting heavy anon vma lock contention) maybe that'd be
better... but then adding more complexity obv.

>
> --
> Cheers
>
> David / dhildenb
>

I mean having said all the above and also in other thread - I am open to being
convinced I'm wrong and this is ok.

Obviously removing the complicated special case for anon would _in general_ be
nice, it's just very sensitive stuff :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-28 12:04   ` Lorenzo Stoakes
@ 2025-08-29  0:23     ` Lokesh Gidra
  2025-08-29  8:42       ` David Hildenbrand
  2025-08-29  9:01       ` Lorenzo Stoakes
  0 siblings, 2 replies; 21+ messages in thread
From: Lokesh Gidra @ 2025-08-29  0:23 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song,
	open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan,
	Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel,
	Vlastimil Babka, Liam R. Howlett

On Thu, Aug 28, 2025 at 5:04 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Aug 25, 2025 at 05:19:05PM +0200, David Hildenbrand wrote:
> > On 22.08.25 19:29, Lokesh Gidra wrote:
> > > Hi all,
> > >
> > > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > > non-ksm anon folios. This necessitates serialization through anon_vma
> > > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > > involved in rmap_walk()) are to be updated. This hurts scalability due
> > > to coarse granularity of the lock. For instance, when multiple threads
> > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > > from the same src VMA, they all contend for the corresponding
> > > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > > user interactions.
> > >
> > > Among all rmap_walk() callers that don’t lock anon folios,
> > > folio_referenced() is the most critical (others are
> > > page_idle_clear_pte_refs(), damon_folio_young(), and
> > > damon_folio_mkold()). The relevant code in folio_referenced() is:
> > >
> > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > >          we_locked = folio_trylock(folio);
> > >          if (!we_locked)
> > >                  return 1;
> > > }
> > >
> > > It’s unclear why locking anon_vma exclusively (when updating
> > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > > with folio locked. It’s in the reclaim path, so should not be a
> > > critical path that necessitates some special treatment, unless I’m
> > > missing something.
> > >
> > > Therefore, I propose simplifying the locking mechanism by ensuring the
> > > folio is locked before calling rmap_walk().
> >
> > Essentially, what you mean is roughly:
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 34333ae3bd80f..0800e73c0796e 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1005,7 +1005,7 @@ int folio_referenced(struct folio *folio, int is_locked,
> >         if (!folio_raw_mapping(folio))
> >                 return 0;
> > -       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > +       if (!is_locked) {
> >                 we_locked = folio_trylock(folio);
> >                 if (!we_locked)
> >                         return 1;
> >
> >
> > The downside of that change is that ordinary (!ksm) folios will observe being locked
>
> Well anon folios, I guess this is what you meant :)
>
> > when we are actually only trying to asses if they were referenced.
> >
> > Does it matter?
>
> Also another downside is we try lock and abort if we fail, so we've made this
> conditional on that.
>
> But surely this is going to impact reclaim performance esp. under heavy memory
> pressure? It is at least a trylock.
>
> It's dangerous waters, and I'd really want some detailed data + analysis to
> prove the point here, I don't think theorising about it is enough.
>
> >
> > I can only speculate that it might have been very relevant before
> > 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive").
> >
> > Essentially any R/O fault would have resulted in us copying the page, simply because
> > there is concurrent folio_referenced() happening.
>
> Fun.
>
> Thing is we now have to consider _every case_ where a contention might cause an
> issue.
>
> One thing I _was_ concerned about was:
>
> - uffd move locks folios
> - now folio_referenced() 'fails' returning 1
>
> But case 2 is only in shrink_active_list() which uses this as a boolean...
>
> OK so maybe fine for this one.

For shrink_active_list() it seems like doesn't matter what
folio_referenced() returns unless it's an executable file-backed
folio.
>
> I really do also hate that any future callers are going to possibly be confused
> about how this function works, but I guess it was already 'weird' for
> file-backed/KSM.

Actually, wouldn't the simplification remove the already existing
confusion, rather than adding to it? :)
We can then simply say, rmap_walk() expects folio to be locked.

>
> So the issue remains really - folio lock contention as a result of this.
>
> It's one thing to theorise, but you may be forgetting something... and then
> we've changed an absolutely core thing to suit a niche UFFD use case.

I really wish there was a way to avoid this within the UFFD code :( In
fact, the real pain point is multiple UFFD threads contending for
write-lock on anon_vma, even when they don't need to serialize among
themselves.
>
> I do wonder if we can identify this case and handle things differently.
>
> Perhaps even saying 'try and get the rmap lock, but if there's "too much"
> contention, grab the folio lock.

Can you please elaborate what you mean? Where do you mean we can
possibly do something like this?

UFFD move only works on PageAnonExclusive folios. So, would it help
(in terms of avoiding contention) if we were to change the condition:

diff --git a/mm/rmap.c b/mm/rmap.c
index 568198e9efc2..1638e27347e7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1006,7 +1006,8 @@ int folio_referenced(struct folio *folio, int is_locked,
        if (!folio_raw_mapping(folio))
                return 0;

-       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
+       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio) ||
+                          PageAnonExclusive(&folio->page))) {
                we_locked = folio_trylock(folio);
                if (!we_locked)
                        return 1;

Obviously, this is opposite of simplification :)

But as we know that shrink_active_list() uses this as a boolean, so do
we even need to walk rmap for PageAnonExclusive folios? Can't we
simply do:

diff --git a/mm/rmap.c b/mm/rmap.c
index 568198e9efc2..a26523de321f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1006,10 +1006,14 @@ int folio_referenced(struct folio *folio, int is_locked,
        if (!folio_raw_mapping(folio))
                return 0;

-       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
-               we_locked = folio_trylock(folio);
-               if (!we_locked)
+       if (!is_locked) {
+               if (!folio_test_anon(folio) || folio_test_ksm(folio)) {
+                       we_locked = folio_trylock(folio);
+                       if (!we_locked)
+                               return 1;
+               } else if (PageAnonExclusive(&folio->page)) {
                        return 1;
+               }
        }

        rmap_walk(folio, &rwc);

I'm not at all an expert on this, so pardon my ignorance if this is wrong.
>
> >
> > Before 09854ba94c6a ("mm: do_wp_page() simplification") that wasn't an issue, but
> > it would have meant that the write fault would be stuck until folio_referenced()
> > would be done, which is also suboptimal.
> >
> > So likely, avoiding grabbing the folio lock was beneficial.
> >
> >
> > Today, this would only affect R/O pages after fork (PageAnonExclusive not set).
>
> Hm probably less of a problem that.
>
> >
> >
> > Staring at shrink_active_list()->folio_referenced(), we isolate the folio first
> > (grabbing reference+clearing LRU), so do_wp_page()->wp_can_reuse_anon_folio()
> > would already refuse to reuse immediately, because it would spot a raised reference.
> > The folio lock does not make a difference anymore.
>
> folio_check_references() we're good with anyway as folio already locked.
>
> So obviously shrink_active_list() is the only caller we really care about.
>
> That at least reduces this case, but we then have to deal with the fact we're
> contending this lock elsewhere.
>
> >
> >
> > Is there any other anon-specific (!ksm) folio locking? Nothing exciting comes to mind,
> > except maybe some folio splitting or khugepaged that maybe would have to wait.
> >
> > But khugepaged would already also fail to isolate these folios, so probably it's not that
> > relevant anymore ...
>
> This is it... there's a lot of possibilities and we need to tread extremely
> carefully.
>
> if we could find a way to make uffd deal with this one way or another (or
> possibly - detecting heavy anon vma lock contention) maybe that'd be
> better... but then adding more complexity obv.
>
> >
> > --
> > Cheers
> >
> > David / dhildenb
> >
>
> I mean having said all the above and also in other thread - I am open to being
> convinced I'm wrong and this is ok.
>
> Obviously removing the complicated special case for anon would _in general_ be
> nice, it's just very sensitive stuff :)
>
> Cheers, Lorenzo


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-29  0:23     ` Lokesh Gidra
@ 2025-08-29  8:42       ` David Hildenbrand
  2025-08-29  9:04         ` Lorenzo Stoakes
  2025-08-29  9:01       ` Lorenzo Stoakes
  1 sibling, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2025-08-29  8:42 UTC (permalink / raw)
  To: Lokesh Gidra, Lorenzo Stoakes
  Cc: Andrew Morton, Harry Yoo, Zi Yan, Barry Song,
	open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan,
	Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel,
	Vlastimil Babka, Liam R. Howlett

>>
>> I do wonder if we can identify this case and handle things differently.
>>
>> Perhaps even saying 'try and get the rmap lock, but if there's "too much"
>> contention, grab the folio lock.
> 
> Can you please elaborate what you mean? Where do you mean we can
> possibly do something like this?
> 
> UFFD move only works on PageAnonExclusive folios. So, would it help
> (in terms of avoiding contention) if we were to change the condition:

I think we shouldn't be using PAE here. Once could consider using 
folio_maybe_mapped_shared(), and assume contention on the folio lock if 
it is maybe mapped shared.

But the real question is with whom we would be contending for the folio 
lock.

Is it really other processes mapping that folio? I'm not so sure.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-29  8:42       ` David Hildenbrand
@ 2025-08-29  9:04         ` Lorenzo Stoakes
  2025-09-02 18:59           ` Lokesh Gidra
  0 siblings, 1 reply; 21+ messages in thread
From: Lorenzo Stoakes @ 2025-08-29  9:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lokesh Gidra, Andrew Morton, Harry Yoo, Zi Yan, Barry Song,
	open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan,
	Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel,
	Vlastimil Babka, Liam R. Howlett

On Fri, Aug 29, 2025 at 10:42:45AM +0200, David Hildenbrand wrote:
> > >
> > > I do wonder if we can identify this case and handle things differently.
> > >
> > > Perhaps even saying 'try and get the rmap lock, but if there's "too much"
> > > contention, grab the folio lock.
> >
> > Can you please elaborate what you mean? Where do you mean we can
> > possibly do something like this?
> >
> > UFFD move only works on PageAnonExclusive folios. So, would it help
> > (in terms of avoiding contention) if we were to change the condition:
>
> I think we shouldn't be using PAE here. Once could consider using
> folio_maybe_mapped_shared(), and assume contention on the folio lock if it
> is maybe mapped shared.

Interesting!

>
> But the real question is with whom we would be contending for the folio
> lock.
>
> Is it really other processes mapping that folio? I'm not so sure.

Yeah, I might go off and do some research myself on this, actually. Nail down
wehre this might actually happen.

Generally I'm softening on this and maybe we're good with the proposed change.

But still want to be super careful here... :)

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-29  9:04         ` Lorenzo Stoakes
@ 2025-09-02 18:59           ` Lokesh Gidra
  2025-09-02 19:01             ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Lokesh Gidra @ 2025-09-02 18:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song,
	open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan,
	Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel,
	Vlastimil Babka, Liam R. Howlett

On Fri, Aug 29, 2025 at 2:04 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Aug 29, 2025 at 10:42:45AM +0200, David Hildenbrand wrote:
> > > >
> > > > I do wonder if we can identify this case and handle things differently.
> > > >
> > > > Perhaps even saying 'try and get the rmap lock, but if there's "too much"
> > > > contention, grab the folio lock.
> > >
> > > Can you please elaborate what you mean? Where do you mean we can
> > > possibly do something like this?
> > >
> > > UFFD move only works on PageAnonExclusive folios. So, would it help
> > > (in terms of avoiding contention) if we were to change the condition:
> >
> > I think we shouldn't be using PAE here. Once could consider using
> > folio_maybe_mapped_shared(), and assume contention on the folio lock if it
> > is maybe mapped shared.
>
> Interesting!
>
> >
> > But the real question is with whom we would be contending for the folio
> > lock.
> >
> > Is it really other processes mapping that folio? I'm not so sure.
>
> Yeah, I might go off and do some research myself on this, actually. Nail down
> wehre this might actually happen.
>
> Generally I'm softening on this and maybe we're good with the proposed change.
>
> But still want to be super careful here... :)
>
Anxiously waiting for your assessment. Fingers crossed :)
> >
> > --
> > Cheers
> >
> > David / dhildenb
> >
>
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-09-02 18:59           ` Lokesh Gidra
@ 2025-09-02 19:01             ` David Hildenbrand
  2025-09-02 19:04               ` Lokesh Gidra
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2025-09-02 19:01 UTC (permalink / raw)
  To: Lokesh Gidra, Lorenzo Stoakes
  Cc: Andrew Morton, Harry Yoo, Zi Yan, Barry Song,
	open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan,
	Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel,
	Vlastimil Babka, Liam R. Howlett

On 02.09.25 20:59, Lokesh Gidra wrote:
> On Fri, Aug 29, 2025 at 2:04 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>>
>> On Fri, Aug 29, 2025 at 10:42:45AM +0200, David Hildenbrand wrote:
>>>>>
>>>>> I do wonder if we can identify this case and handle things differently.
>>>>>
>>>>> Perhaps even saying 'try and get the rmap lock, but if there's "too much"
>>>>> contention, grab the folio lock.
>>>>
>>>> Can you please elaborate what you mean? Where do you mean we can
>>>> possibly do something like this?
>>>>
>>>> UFFD move only works on PageAnonExclusive folios. So, would it help
>>>> (in terms of avoiding contention) if we were to change the condition:
>>>
>>> I think we shouldn't be using PAE here. Once could consider using
>>> folio_maybe_mapped_shared(), and assume contention on the folio lock if it
>>> is maybe mapped shared.
>>
>> Interesting!
>>
>>>
>>> But the real question is with whom we would be contending for the folio
>>> lock.
>>>
>>> Is it really other processes mapping that folio? I'm not so sure.
>>
>> Yeah, I might go off and do some research myself on this, actually. Nail down
>> wehre this might actually happen.
>>
>> Generally I'm softening on this and maybe we're good with the proposed change.
>>
>> But still want to be super careful here... :)
>>
> Anxiously waiting for your assessment. Fingers crossed :)

I'd suggest you prepare an RFC patch where you neatly summarize all we 
learned so far. :)

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-09-02 19:01             ` David Hildenbrand
@ 2025-09-02 19:04               ` Lokesh Gidra
  0 siblings, 0 replies; 21+ messages in thread
From: Lokesh Gidra @ 2025-09-02 19:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Harry Yoo, Zi Yan, Barry Song,
	open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan,
	Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel,
	Vlastimil Babka, Liam R. Howlett

On Tue, Sep 2, 2025 at 12:01 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.09.25 20:59, Lokesh Gidra wrote:
> > On Fri, Aug 29, 2025 at 2:04 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> >>
> >> On Fri, Aug 29, 2025 at 10:42:45AM +0200, David Hildenbrand wrote:
> >>>>>
> >>>>> I do wonder if we can identify this case and handle things differently.
> >>>>>
> >>>>> Perhaps even saying 'try and get the rmap lock, but if there's "too much"
> >>>>> contention, grab the folio lock.
> >>>>
> >>>> Can you please elaborate what you mean? Where do you mean we can
> >>>> possibly do something like this?
> >>>>
> >>>> UFFD move only works on PageAnonExclusive folios. So, would it help
> >>>> (in terms of avoiding contention) if we were to change the condition:
> >>>
> >>> I think we shouldn't be using PAE here. Once could consider using
> >>> folio_maybe_mapped_shared(), and assume contention on the folio lock if it
> >>> is maybe mapped shared.
> >>
> >> Interesting!
> >>
> >>>
> >>> But the real question is with whom we would be contending for the folio
> >>> lock.
> >>>
> >>> Is it really other processes mapping that folio? I'm not so sure.
> >>
> >> Yeah, I might go off and do some research myself on this, actually. Nail down
> >> wehre this might actually happen.
> >>
> >> Generally I'm softening on this and maybe we're good with the proposed change.
> >>
> >> But still want to be super careful here... :)
> >>
> > Anxiously waiting for your assessment. Fingers crossed :)
>
> I'd suggest you prepare an RFC patch where you neatly summarize all we
> learned so far. :)
>
Sounds good. Will do. Thanks.
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-29  0:23     ` Lokesh Gidra
  2025-08-29  8:42       ` David Hildenbrand
@ 2025-08-29  9:01       ` Lorenzo Stoakes
  1 sibling, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2025-08-29  9:01 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song,
	open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan,
	Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel,
	Vlastimil Babka, Liam R. Howlett

On Thu, Aug 28, 2025 at 05:23:56PM -0700, Lokesh Gidra wrote:
> On Thu, Aug 28, 2025 at 5:04 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Mon, Aug 25, 2025 at 05:19:05PM +0200, David Hildenbrand wrote:
> > > On 22.08.25 19:29, Lokesh Gidra wrote:
> > > > Hi all,
> > > >
> > > > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > > > non-ksm anon folios. This necessitates serialization through anon_vma
> > > > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > > > involved in rmap_walk()) are to be updated. This hurts scalability due
> > > > to coarse granularity of the lock. For instance, when multiple threads
> > > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > > > from the same src VMA, they all contend for the corresponding
> > > > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > > > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > > > user interactions.
> > > >
> > > > Among all rmap_walk() callers that don’t lock anon folios,
> > > > folio_referenced() is the most critical (others are
> > > > page_idle_clear_pte_refs(), damon_folio_young(), and
> > > > damon_folio_mkold()). The relevant code in folio_referenced() is:
> > > >
> > > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > > >          we_locked = folio_trylock(folio);
> > > >          if (!we_locked)
> > > >                  return 1;
> > > > }
> > > >
> > > > It’s unclear why locking anon_vma exclusively (when updating
> > > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > > > with folio locked. It’s in the reclaim path, so should not be a
> > > > critical path that necessitates some special treatment, unless I’m
> > > > missing something.
> > > >
> > > > Therefore, I propose simplifying the locking mechanism by ensuring the
> > > > folio is locked before calling rmap_walk().
> > >
> > > Essentially, what you mean is roughly:
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 34333ae3bd80f..0800e73c0796e 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1005,7 +1005,7 @@ int folio_referenced(struct folio *folio, int is_locked,
> > >         if (!folio_raw_mapping(folio))
> > >                 return 0;
> > > -       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > > +       if (!is_locked) {
> > >                 we_locked = folio_trylock(folio);
> > >                 if (!we_locked)
> > >                         return 1;
> > >
> > >
> > > The downside of that change is that ordinary (!ksm) folios will observe being locked
> >
> > Well anon folios, I guess this is what you meant :)
> >
> > > when we are actually only trying to asses if they were referenced.
> > >
> > > Does it matter?
> >
> > Also another downside is we try lock and abort if we fail, so we've made this
> > conditional on that.
> >
> > But surely this is going to impact reclaim performance esp. under heavy memory
> > pressure? It is at least a trylock.
> >
> > It's dangerous waters, and I'd really want some detailed data + analysis to
> > prove the point here, I don't think theorising about it is enough.
> >
> > >
> > > I can only speculate that it might have been very relevant before
> > > 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive").
> > >
> > > Essentially any R/O fault would have resulted in us copying the page, simply because
> > > there is concurrent folio_referenced() happening.
> >
> > Fun.
> >
> > Thing is we now have to consider _every case_ where a contention might cause an
> > issue.
> >
> > One thing I _was_ concerned about was:
> >
> > - uffd move locks folios
> > - now folio_referenced() 'fails' returning 1
> >
> > But case 2 is only in shrink_active_list() which uses this as a boolean...
> >
> > OK so maybe fine for this one.
>
> For shrink_active_list() it seems like doesn't matter what
> folio_referenced() returns unless it's an executable file-backed
> folio.

Yes agreed, was chatting with David I think yesterday (my review load atm makes
remembering when I did stuff harder :P) and was going through the code as a
result and had a closer look and you're right.

So it returning 1 is fine.

> >
> > I really do also hate that any future callers are going to possibly be confused
> > about how this function works, but I guess it was already 'weird' for
> > file-backed/KSM.
>
> Actually, wouldn't the simplification remove the already existing
> confusion, rather than adding to it? :)
> We can then simply say, rmap_walk() expects folio to be locked.

Yeah it does simplify in that sense, the real issue is - will we see contention
in some workloads.

I'm sort of gradually softening on this as we talk... but I feel like we
really need to check this more thoroughly.

>
> >
> > So the issue remains really - folio lock contention as a result of this.
> >
> > It's one thing to theorise, but you may be forgetting something... and then
> > we've changed an absolutely core thing to suit a niche UFFD use case.
>
> I really wish there was a way to avoid this within the UFFD code :( In
> fact, the real pain point is multiple UFFD threads contending for
> write-lock on anon_vma, even when they don't need to serialize among
> themselves.
> >
> > I do wonder if we can identify this case and handle things differently.
> >
> > Perhaps even saying 'try and get the rmap lock, but if there's "too much"
> > contention, grab the folio lock.
>
> Can you please elaborate what you mean? Where do you mean we can
> possibly do something like this?

It's vague hand waving, but generally if we could detect contention then we
could conditionally change how we handle this, perhaps...

>
> UFFD move only works on PageAnonExclusive folios. So, would it help
> (in terms of avoiding contention) if we were to change the condition:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 568198e9efc2..1638e27347e7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1006,7 +1006,8 @@ int folio_referenced(struct folio *folio, int is_locked,
>         if (!folio_raw_mapping(folio))
>                 return 0;
>
> -       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> +       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio) ||
> +                          PageAnonExclusive(&folio->page))) {
>                 we_locked = folio_trylock(folio);
>                 if (!we_locked)
>                         return 1;
>
> Obviously, this is opposite of simplification :)
>
> But as we know that shrink_active_list() uses this as a boolean, so do
> we even need to walk rmap for PageAnonExclusive folios? Can't we
> simply do:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 568198e9efc2..a26523de321f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1006,10 +1006,14 @@ int folio_referenced(struct folio *folio, int is_locked,
>         if (!folio_raw_mapping(folio))
>                 return 0;
>
> -       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> -               we_locked = folio_trylock(folio);
> -               if (!we_locked)
> +       if (!is_locked) {
> +               if (!folio_test_anon(folio) || folio_test_ksm(folio)) {
> +                       we_locked = folio_trylock(folio);
> +                       if (!we_locked)
> +                               return 1;
> +               } else if (PageAnonExclusive(&folio->page)) {
>                         return 1;
> +               }
>         }
>
>         rmap_walk(folio, &rwc);
>
> I'm not at all an expert on this, so pardon my ignorance if this is wrong.

I see David's replied and he _is_ the expert on PAE so will see :)

> >
> > >
> > > Before 09854ba94c6a ("mm: do_wp_page() simplification") that wasn't an issue, but
> > > it would have meant that the write fault would be stuck until folio_referenced()
> > > would be done, which is also suboptimal.
> > >
> > > So likely, avoiding grabbing the folio lock was beneficial.
> > >
> > >
> > > Today, this would only affect R/O pages after fork (PageAnonExclusive not set).
> >
> > Hm probably less of a problem that.
> >
> > >
> > >
> > > Staring at shrink_active_list()->folio_referenced(), we isolate the folio first
> > > (grabbing reference+clearing LRU), so do_wp_page()->wp_can_reuse_anon_folio()
> > > would already refuse to reuse immediately, because it would spot a raised reference.
> > > The folio lock does not make a difference anymore.
> >
> > folio_check_references() we're good with anyway as folio already locked.
> >
> > So obviously shrink_active_list() is the only caller we really care about.
> >
> > That at least reduces this case, but we then have to deal with the fact we're
> > contending this lock elsewhere.
> >
> > >
> > >
> > > Is there any other anon-specific (!ksm) folio locking? Nothing exciting comes to mind,
> > > except maybe some folio splitting or khugepaged that maybe would have to wait.
> > >
> > > But khugepaged would already also fail to isolate these folios, so probably it's not that
> > > relevant anymore ...
> >
> > This is it... there's a lot of possibilities and we need to tread extremely
> > carefully.
> >
> > if we could find a way to make uffd deal with this one way or another (or
> > possibly - detecting heavy anon vma lock contention) maybe that'd be
> > better... but then adding more complexity obv.
> >
> > >
> > > --
> > > Cheers
> > >
> > > David / dhildenb
> > >
> >
> > I mean having said all the above and also in other thread - I am open to being
> > convinced I'm wrong and this is ok.
> >
> > Obviously removing the complicated special case for anon would _in general_ be
> > nice, it's just very sensitive stuff :)
> >
> > Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-22 17:29 [DISCUSSION] Unconditionally lock folios when calling rmap_walk() Lokesh Gidra
                   ` (2 preceding siblings ...)
  2025-08-25 15:19 ` David Hildenbrand
@ 2025-08-26 15:52 ` Lorenzo Stoakes
  2025-08-26 22:23   ` Lokesh Gidra
  3 siblings, 1 reply; 21+ messages in thread
From: Lorenzo Stoakes @ 2025-08-26 15:52 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song,
	open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan,
	Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel,
	Vlastimil Babka, Liam R. Howlett

On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote:
> Hi all,
>
> Currently, some callers of rmap_walk() conditionally avoid try-locking
> non-ksm anon folios. This necessitates serialization through anon_vma
> write-lock elsewhere when folio->mapping and/or folio->index (fields
> involved in rmap_walk()) are to be updated. This hurts scalability due
> to coarse granularity of the lock. For instance, when multiple threads
> invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> from the same src VMA, they all contend for the corresponding
> anon_vma’s lock. Field traces for arm64 android devices reveal over
> 30ms of uninterruptible sleep in the main UI thread, leading to janky
> user interactions.

Can we clarify whether this is simply an example, or rather the entire
motivating reason for raising this issue?

It's important, because it strikes me that this is a very specific use
case, and you're now suggesting changing core locking to suit it.

While this is a discussion, and I'm glad you raised it, I think it's
important in these cases to really exhaustively examine all of the possible
consequences.

OK so to clarify:

- You want to traverse the rmap entirely without any rmap locks whatsoever
  for anon, relying solely on the folio lock to serialise, because
  otherwise rmap read locks here block other rmap write lock calls.

- You want to unconditionally folio lock all anon and kSM folios for at
  least folio_referenced().

In order to resolve a scalability issue specific to a uffd usecase?

Is this the case? Happy to be corrected if I've misinterpreted.

I don't see how this could possibly work, unless I'm missing something
here, because:

1. When we lock anon_vma's it's at the root which covers all anon_vma's
   covering parent/children of forked processes.

2. We do "top down" operations that acquire the rmap lock on the assumption
   we have exclusive access to the rmapping that have nothing to do with
   the folio nor could we even know what the folio is at this point.

3. We manipulate higher level page tables on the basis that the rmap lock
   excludes other page table walkers.

So this proposal seems to violate all of that?

For instance, in many VMA operations we perform:

anon_vma_interval_tree_pre_update_vma()

and

anon_vma_interval_tree_post_update_vma()

Which removes _all_ R/B tree mappings.

So you can now race with this (it of course doesn't care about folio lock)
and then get completely incorrect results?

This seems fairly disasterous?

In free_pgtables() also we call unlink_anon_vmas() which iterates through
the vma->anon_vma_chain and uses the anon lock to tear down higher order
page tables which you now might race with and that seems even more
disasterous...

>
> Among all rmap_walk() callers that don’t lock anon folios,
> folio_referenced() is the most critical (others are
> page_idle_clear_pte_refs(), damon_folio_young(), and
> damon_folio_mkold()). The relevant code in folio_referenced() is:
>
> if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
>         we_locked = folio_trylock(folio);
>         if (!we_locked)
>                 return 1;
> }
>
> It’s unclear why locking anon_vma exclusively (when updating
> folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> with folio locked. It’s in the reclaim path, so should not be a
> critical path that necessitates some special treatment, unless I’m
> missing something.
> Therefore, I propose simplifying the locking mechanism by ensuring the
> folio is locked before calling rmap_walk(). This helps avoid locking
> anon_vma when updating folio->mapping, which, for instance, will help
> eliminate the uninterruptible sleep observed in the field traces
> mentioned earlier. Furthermore, it enables us to simplify the code in
> folio_lock_anon_vma_read() by removing the re-check to ensure that the
> field hasn’t changed under us.

I mean this is why I get confused here though, because you seem to be
saying 'don't take rmap lock at all' to referencing
folio_lock_anon_vma_read()?

Perhaps I misinterpreted (forgive me if so) and indeed you meant this, but
then I don't see how you impact contention on the anon_vma lock by making
this change?

I think in general - let's clarify what exactly you intend to do here, and
then we need to delineate what we need to confirm and test to have any
confidence in making such a change.

anon_vma locks (and rmap locks) are very very sensitive in general and
we've had actual security issues come up due to race windows emerging from
inappropriate handling, not to mention that performance around this
obviously matters a great deal.

So we must tread carefully here.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-26 15:52 ` Lorenzo Stoakes
@ 2025-08-26 22:23   ` Lokesh Gidra
  2025-08-28 11:31     ` Lorenzo Stoakes
  0 siblings, 1 reply; 21+ messages in thread
From: Lokesh Gidra @ 2025-08-26 22:23 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song,
	open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan,
	Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel,
	Vlastimil Babka, Liam R. Howlett

On Tue, Aug 26, 2025 at 8:52 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote:
> > Hi all,
> >
> > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > non-ksm anon folios. This necessitates serialization through anon_vma
> > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > involved in rmap_walk()) are to be updated. This hurts scalability due
> > to coarse granularity of the lock. For instance, when multiple threads
> > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > from the same src VMA, they all contend for the corresponding
> > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > user interactions.
>
> Can we clarify whether this is simply an example, or rather the entire
> motivating reason for raising this issue?
>
When I started off I thought maybe there are other cases too, but it
looks like as of now only uffd MOVE updates folio->mapping to a
different root anon_vma.

> It's important, because it strikes me that this is a very specific use
> case, and you're now suggesting changing core locking to suit it.
>
> While this is a discussion, and I'm glad you raised it, I think it's
> important in these cases to really exhaustively examine all of the possible
> consequences.
>
> OK so to clarify:
>
> - You want to traverse the rmap entirely without any rmap locks whatsoever
>   for anon, relying solely on the folio lock to serialise, because
>   otherwise rmap read locks here block other rmap write lock calls.
>
There is a misunderstanding. I'm suggesting locking *both* folio as
well as anon_vma during rmap walk. To avoid any confusion, here are
the simplifications in mm/rmap.c that I suggest:

diff --git a/mm/rmap.c b/mm/rmap.c
index 568198e9efc2..81c177b0cddf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -547,7 +547,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
struct folio *folio,
        struct anon_vma *root_anon_vma;
        unsigned long anon_mapping;

-retry:
        rcu_read_lock();
        anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
        if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
@@ -558,17 +557,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
struct folio *folio,
        anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
        root_anon_vma = READ_ONCE(anon_vma->root);
        if (down_read_trylock(&root_anon_vma->rwsem)) {
-               /*
-                * folio_move_anon_rmap() might have changed the anon_vma as we
-                * might not hold the folio lock here.
-                */
-               if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
-                            anon_mapping)) {
-                       up_read(&root_anon_vma->rwsem);
-                       rcu_read_unlock();
-                       goto retry;
-               }
-
                /*
                 * If the folio is still mapped, then this anon_vma is still
                 * its anon_vma, and holding the mutex ensures that it will
@@ -603,18 +591,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
struct folio *folio,
        rcu_read_unlock();
        anon_vma_lock_read(anon_vma);

-       /*
-        * folio_move_anon_rmap() might have changed the anon_vma as we might
-        * not hold the folio lock here.
-        */
-       if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
-                    anon_mapping)) {
-               anon_vma_unlock_read(anon_vma);
-               put_anon_vma(anon_vma);
-               anon_vma = NULL;
-               goto retry;
-       }
-
        if (atomic_dec_and_test(&anon_vma->refcount)) {
                /*
                 * Oops, we held the last refcount, release the lock
@@ -1006,7 +982,7 @@ int folio_referenced(struct folio *folio, int is_locked,
        if (!folio_raw_mapping(folio))
                return 0;

-       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
+       if (!is_locked) {
                we_locked = folio_trylock(folio);
                if (!we_locked)
                        return 1;

> - You want to unconditionally folio lock all anon and kSM folios for at
>   least folio_referenced().
>
Actually file and KSM folios are always locked today. The anon folios
are conditionally left out. So my proposal actually standardizes this
locking, which is an overall simplification.

> In order to resolve a scalability issue specific to a uffd usecase?
>
With the requirement of locking anon_vma in write mode, uffd MOVE
currently is unusable in practice due to poor scalability. The above
change in mm/rmap.c allows us to make the following improvement to
MOVE ioctl:

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 45e6290e2e8b..c4fc87d73ab7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1192,7 +1192,6 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
        pmd_t dummy_pmdval;
        pmd_t dst_pmdval;
        struct folio *src_folio = NULL;
-       struct anon_vma *src_anon_vma = NULL;
        struct mmu_notifier_range range;
        int err = 0;

@@ -1353,28 +1352,6 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
                        goto retry;
                }

-               if (!src_anon_vma) {
-                       /*
-                        * folio_referenced walks the anon_vma chain
-                        * without the folio lock. Serialize against it with
-                        * the anon_vma lock, the folio lock is not enough.
-                        */
-                       src_anon_vma = folio_get_anon_vma(src_folio);
-                       if (!src_anon_vma) {
-                               /* page was unmapped from under us */
-                               err = -EAGAIN;
-                               goto out;
-                       }
-                       if (!anon_vma_trylock_write(src_anon_vma)) {
-                               pte_unmap(src_pte);
-                               pte_unmap(dst_pte);
-                               src_pte = dst_pte = NULL;
-                               /* now we can block and wait */
-                               anon_vma_lock_write(src_anon_vma);
-                               goto retry;
-                       }
-               }
-
                err = move_present_pte(mm,  dst_vma, src_vma,
                                       dst_addr, src_addr, dst_pte, src_pte,
                                       orig_dst_pte, orig_src_pte, dst_pmd,
@@ -1445,10 +1422,6 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
        }

 out:
-       if (src_anon_vma) {
-               anon_vma_unlock_write(src_anon_vma);
-               put_anon_vma(src_anon_vma);
-       }
        if (src_folio) {
                folio_unlock(src_folio);
                folio_put(src_folio);


> Is this the case? Happy to be corrected if I've misinterpreted.
>
> I don't see how this could possibly work, unless I'm missing something
> here, because:
>
> 1. When we lock anon_vma's it's at the root which covers all anon_vma's
>    covering parent/children of forked processes.
>
> 2. We do "top down" operations that acquire the rmap lock on the assumption
>    we have exclusive access to the rmapping that have nothing to do with
>    the folio nor could we even know what the folio is at this point.
>
> 3. We manipulate higher level page tables on the basis that the rmap lock
>    excludes other page table walkers.
>
> So this proposal seems to violate all of that?
>
> For instance, in many VMA operations we perform:
>
> anon_vma_interval_tree_pre_update_vma()
>
> and
>
> anon_vma_interval_tree_post_update_vma()
>
> Which removes _all_ R/B tree mappings.
>
> So you can now race with this (it of course doesn't care about folio lock)
> and then get completely incorrect results?
>
> This seems fairly disasterous?
>
> In free_pgtables() also we call unlink_anon_vmas() which iterates through
> the vma->anon_vma_chain and uses the anon lock to tear down higher order
> page tables which you now might race with and that seems even more
> disasterous...
>
>
> >
> > Among all rmap_walk() callers that don’t lock anon folios,
> > folio_referenced() is the most critical (others are
> > page_idle_clear_pte_refs(), damon_folio_young(), and
> > damon_folio_mkold()). The relevant code in folio_referenced() is:
> >
> > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> >         we_locked = folio_trylock(folio);
> >         if (!we_locked)
> >                 return 1;
> > }
> >
> > It’s unclear why locking anon_vma exclusively (when updating
> > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > with folio locked. It’s in the reclaim path, so should not be a
> > critical path that necessitates some special treatment, unless I’m
> > missing something.
> > Therefore, I propose simplifying the locking mechanism by ensuring the
> > folio is locked before calling rmap_walk(). This helps avoid locking
> > anon_vma when updating folio->mapping, which, for instance, will help
> > eliminate the uninterruptible sleep observed in the field traces
> > mentioned earlier. Furthermore, it enables us to simplify the code in
> > folio_lock_anon_vma_read() by removing the re-check to ensure that the
> > field hasn’t changed under us.
>
>
> I mean this is why I get confused here though, because you seem to be
> saying 'don't take rmap lock at all' to referencing
> folio_lock_anon_vma_read()?
>
> Perhaps I misinterpreted (forgive me if so) and indeed you meant this, but
> then I don't see how you impact contention on the anon_vma lock by making
> this change?
>
> I think in general - let's clarify what exactly you intend to do here, and
> then we need to delineate what we need to confirm and test to have any
> confidence in making such a change.
>
> anon_vma locks (and rmap locks) are very very sensitive in general and
> we've had actual security issues come up due to race windows emerging from
> inappropriate handling, not to mention that performance around this
> obviously matters a great deal.

I couldn't agree more. My changes seemed to simplify, otherwise I
wouldn't have suggested this. And David's reply yesterday gives
confidence that it wouldn't negatively affect performance either.

Thanks,
Lokesh
>
> So we must tread carefully here.
>
> Thanks, Lorenzo


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
  2025-08-26 22:23   ` Lokesh Gidra
@ 2025-08-28 11:31     ` Lorenzo Stoakes
  0 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2025-08-28 11:31 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: David Hildenbrand, Andrew Morton, Harry Yoo, Zi Yan, Barry Song,
	open list:MEMORY MANAGEMENT, Peter Xu, Suren Baghdasaryan,
	Kalesh Singh, android-mm, linux-kernel, Jann Horn, Rik van Riel,
	Vlastimil Babka, Liam R. Howlett

On Tue, Aug 26, 2025 at 03:23:28PM -0700, Lokesh Gidra wrote:
> On Tue, Aug 26, 2025 at 8:52 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Fri, Aug 22, 2025 at 10:29:52AM -0700, Lokesh Gidra wrote:
> > > Hi all,
> > >
> > > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > > non-ksm anon folios. This necessitates serialization through anon_vma
> > > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > > involved in rmap_walk()) are to be updated. This hurts scalability due
> > > to coarse granularity of the lock. For instance, when multiple threads
> > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > > from the same src VMA, they all contend for the corresponding
> > > anon_vma’s lock. Field traces for arm64 android devices reveal over

Hmm, I started by responding below but now have a vague thought of:

What if we find a way to somehow detect this scenario, and mark the
anon_vma in some way to indicate that a folio lock should be tried?

That'd be a lot less egregious than changing things fundamentally for
everyone.

> > > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > > user interactions.
> >
> > Can we clarify whether this is simply an example, or rather the entire
> > motivating reason for raising this issue?
> >
> When I started off I thought maybe there are other cases too, but it
> looks like as of now only uffd MOVE updates folio->mapping to a
> different root anon_vma.

Yup, I mean I looked into making mremap() do it, but it was insanely
difficult to make it work (sadly!) But indeed.

I think it's important to highlight that this is the use case.

I wonder if we can't do something specific to uffd then that would be less
potentially problematic for the rest of core.

Because I just don't really see this as upstreamable otherwise.

>
> > It's important, because it strikes me that this is a very specific use
> > case, and you're now suggesting changing core locking to suit it.
> >
> > While this is a discussion, and I'm glad you raised it, I think it's
> > important in these cases to really exhaustively examine all of the possible
> > consequences.
> >
> > OK so to clarify:
> >
> > - You want to traverse the rmap entirely without any rmap locks whatsoever
> >   for anon, relying solely on the folio lock to serialise, because
> >   otherwise rmap read locks here block other rmap write lock calls.
> >
> There is a misunderstanding. I'm suggesting locking *both* folio as
> well as anon_vma during rmap walk. To avoid any confusion, here are
> the simplifications in mm/rmap.c that I suggest:

OK. Well that's less extreme :)

But then, if we're taking both locks, how does this prevent contention on
the anon_vma lock?

Even so, this is adding a bunch of overhead.

>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 568198e9efc2..81c177b0cddf 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -547,7 +547,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
> struct folio *folio,
>         struct anon_vma *root_anon_vma;
>         unsigned long anon_mapping;
>
> -retry:
>         rcu_read_lock();
>         anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
>         if ((anon_mapping & FOLIO_MAPPING_FLAGS) != FOLIO_MAPPING_ANON)
> @@ -558,17 +557,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
> struct folio *folio,
>         anon_vma = (struct anon_vma *) (anon_mapping - FOLIO_MAPPING_ANON);
>         root_anon_vma = READ_ONCE(anon_vma->root);
>         if (down_read_trylock(&root_anon_vma->rwsem)) {
> -               /*
> -                * folio_move_anon_rmap() might have changed the anon_vma as we
> -                * might not hold the folio lock here.
> -                */
> -               if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> -                            anon_mapping)) {
> -                       up_read(&root_anon_vma->rwsem);
> -                       rcu_read_unlock();
> -                       goto retry;
> -               }
> -
>                 /*
>                  * If the folio is still mapped, then this anon_vma is still
>                  * its anon_vma, and holding the mutex ensures that it will
> @@ -603,18 +591,6 @@ struct anon_vma *folio_lock_anon_vma_read(const
> struct folio *folio,
>         rcu_read_unlock();
>         anon_vma_lock_read(anon_vma);
>
> -       /*
> -        * folio_move_anon_rmap() might have changed the anon_vma as we might
> -        * not hold the folio lock here.
> -        */
> -       if (unlikely((unsigned long)READ_ONCE(folio->mapping) !=
> -                    anon_mapping)) {
> -               anon_vma_unlock_read(anon_vma);
> -               put_anon_vma(anon_vma);
> -               anon_vma = NULL;
> -               goto retry;
> -       }
> -
>         if (atomic_dec_and_test(&anon_vma->refcount)) {
>                 /*
>                  * Oops, we held the last refcount, release the lock
> @@ -1006,7 +982,7 @@ int folio_referenced(struct folio *folio, int is_locked,
>         if (!folio_raw_mapping(folio))
>                 return 0;
>
> -       if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> +       if (!is_locked) {
>                 we_locked = folio_trylock(folio);
>                 if (!we_locked)
>                         return 1;

This is still a really big change, we're going to be contending the folio
lock potentially a LOT more, for the sake of a very specific and peculiar
uffd use case.

It's hard to justify. And any such justification would need _really
serious_ testing on very many arches/workloads to even come close to being
ok in my view.

This is a pretty huge ask + it's for a specific use case.

>
> > - You want to unconditionally folio lock all anon and kSM folios for at
> >   least folio_referenced().
> >
> Actually file and KSM folios are always locked today. The anon folios
> are conditionally left out. So my proposal actually standardizes this
> locking, which is an overall simplification.

Right yes sorry misspoke I meant to say anon.

This is a HUGE exception though, because that covers the majority of a
process's memory allocation.

>
> > In order to resolve a scalability issue specific to a uffd usecase?
> >
> With the requirement of locking anon_vma in write mode, uffd MOVE
> currently is unusable in practice due to poor scalability. The above
> change in mm/rmap.c allows us to make the following improvement to
> MOVE ioctl:
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 45e6290e2e8b..c4fc87d73ab7 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1192,7 +1192,6 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>         pmd_t dummy_pmdval;
>         pmd_t dst_pmdval;
>         struct folio *src_folio = NULL;
> -       struct anon_vma *src_anon_vma = NULL;
>         struct mmu_notifier_range range;
>         int err = 0;
>
> @@ -1353,28 +1352,6 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                         goto retry;
>                 }
>
> -               if (!src_anon_vma) {
> -                       /*
> -                        * folio_referenced walks the anon_vma chain
> -                        * without the folio lock. Serialize against it with
> -                        * the anon_vma lock, the folio lock is not enough.
> -                        */
> -                       src_anon_vma = folio_get_anon_vma(src_folio);
> -                       if (!src_anon_vma) {
> -                               /* page was unmapped from under us */
> -                               err = -EAGAIN;
> -                               goto out;
> -                       }
> -                       if (!anon_vma_trylock_write(src_anon_vma)) {
> -                               pte_unmap(src_pte);
> -                               pte_unmap(dst_pte);
> -                               src_pte = dst_pte = NULL;
> -                               /* now we can block and wait */
> -                               anon_vma_lock_write(src_anon_vma);
> -                               goto retry;
> -                       }
> -               }
> -
>                 err = move_present_pte(mm,  dst_vma, src_vma,
>                                        dst_addr, src_addr, dst_pte, src_pte,
>                                        orig_dst_pte, orig_src_pte, dst_pmd,
> @@ -1445,10 +1422,6 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>         }
>
>  out:
> -       if (src_anon_vma) {
> -               anon_vma_unlock_write(src_anon_vma);
> -               put_anon_vma(src_anon_vma);
> -       }
>         if (src_folio) {
>                 folio_unlock(src_folio);
>                 folio_put(src_folio);

Right, but again it's a niche use case, sorry. Changing how _the whole
system_ does rmap to suit a very specific use case isn't really a viable
approach.

>
>
> > Is this the case? Happy to be corrected if I've misinterpreted.
> >
> > I don't see how this could possibly work, unless I'm missing something
> > here, because:
> >
> > 1. When we lock anon_vma's it's at the root which covers all anon_vma's
> >    covering parent/children of forked processes.
> >
> > 2. We do "top down" operations that acquire the rmap lock on the assumption
> >    we have exclusive access to the rmapping that have nothing to do with
> >    the folio nor could we even know what the folio is at this point.
> >
> > 3. We manipulate higher level page tables on the basis that the rmap lock
> >    excludes other page table walkers.
> >
> > So this proposal seems to violate all of that?
> >
> > For instance, in many VMA operations we perform:
> >
> > anon_vma_interval_tree_pre_update_vma()
> >
> > and
> >
> > anon_vma_interval_tree_post_update_vma()
> >
> > Which removes _all_ R/B tree mappings.
> >
> > So you can now race with this (it of course doesn't care about folio lock)
> > and then get completely incorrect results?
> >
> > This seems fairly disasterous?
> >
> > In free_pgtables() also we call unlink_anon_vmas() which iterates through
> > the vma->anon_vma_chain and uses the anon lock to tear down higher order
> > page tables which you now might race with and that seems even more
> > disasterous...
> >
> >
> > >
> > > Among all rmap_walk() callers that don’t lock anon folios,
> > > folio_referenced() is the most critical (others are
> > > page_idle_clear_pte_refs(), damon_folio_young(), and
> > > damon_folio_mkold()). The relevant code in folio_referenced() is:
> > >
> > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > >         we_locked = folio_trylock(folio);
> > >         if (!we_locked)
> > >                 return 1;
> > > }
> > >
> > > It’s unclear why locking anon_vma exclusively (when updating
> > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > > with folio locked. It’s in the reclaim path, so should not be a
> > > critical path that necessitates some special treatment, unless I’m
> > > missing something.
> > > Therefore, I propose simplifying the locking mechanism by ensuring the
> > > folio is locked before calling rmap_walk(). This helps avoid locking
> > > anon_vma when updating folio->mapping, which, for instance, will help
> > > eliminate the uninterruptible sleep observed in the field traces
> > > mentioned earlier. Furthermore, it enables us to simplify the code in
> > > folio_lock_anon_vma_read() by removing the re-check to ensure that the
> > > field hasn’t changed under us.
> >
> >
> > I mean this is why I get confused here though, because you seem to be
> > saying 'don't take rmap lock at all' to referencing
> > folio_lock_anon_vma_read()?
> >
> > Perhaps I misinterpreted (forgive me if so) and indeed you meant this, but
> > then I don't see how you impact contention on the anon_vma lock by making
> > this change?
> >
> > I think in general - let's clarify what exactly you intend to do here, and
> > then we need to delineate what we need to confirm and test to have any
> > confidence in making such a change.
> >
> > anon_vma locks (and rmap locks) are very very sensitive in general and
> > we've had actual security issues come up due to race windows emerging from
> > inappropriate handling, not to mention that performance around this
> > obviously matters a great deal.
>
> I couldn't agree more. My changes seemed to simplify, otherwise I
> wouldn't have suggested this. And David's reply yesterday gives
> confidence that it wouldn't negatively affect performance either.

This isn't a simplification though, this is taking a new lock in a core mm
code path _for everyone_ for a specific UFFD use case. Everyone _but_ people
using this UFFD stuff will just pay an overhead.

Another aspect is you're now making it much more likely that taking the
lock will fail, since it's a trylock...

It really needs very deep analysis and justification for me to be any way
convinced this is ok.

It's hard to justify and I think any workable solution would need to know
this case applied.

A very simple but horrible thing could be to have a config flag to
enable/disable this. An even more really truly horrible thing would be a
prctl()... but let's not.

>
> Thanks,
> Lokesh
> >
> > So we must tread carefully here.
> >
> > Thanks, Lorenzo

I remain very very skeptical of this as-is. Of course always happy to be
convinced otherwise...


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-09-02 19:05 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22 17:29 [DISCUSSION] Unconditionally lock folios when calling rmap_walk() Lokesh Gidra
2025-08-22 17:44 ` Matthew Wilcox
2025-08-22 18:08   ` Lorenzo Stoakes
2025-08-24  4:18 ` Lokesh Gidra
2025-08-24  5:31   ` Harry Yoo
2025-08-24  6:45     ` Lokesh Gidra
2025-08-25 10:52       ` Harry Yoo
2025-08-28 11:32         ` Lorenzo Stoakes
2025-08-25 15:19 ` David Hildenbrand
2025-08-25 18:46   ` Lokesh Gidra
2025-08-28 12:04   ` Lorenzo Stoakes
2025-08-29  0:23     ` Lokesh Gidra
2025-08-29  8:42       ` David Hildenbrand
2025-08-29  9:04         ` Lorenzo Stoakes
2025-09-02 18:59           ` Lokesh Gidra
2025-09-02 19:01             ` David Hildenbrand
2025-09-02 19:04               ` Lokesh Gidra
2025-08-29  9:01       ` Lorenzo Stoakes
2025-08-26 15:52 ` Lorenzo Stoakes
2025-08-26 22:23   ` Lokesh Gidra
2025-08-28 11:31     ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).