linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lokesh Gidra <lokeshgidra@google.com>
To: Harry Yoo <harry.yoo@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Zi Yan <ziy@nvidia.com>,  Barry Song <21cnbao@gmail.com>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>,
	Peter Xu <peterx@redhat.com>,
	 Suren Baghdasaryan <surenb@google.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	 android-mm <android-mm@google.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	 Jann Horn <jannh@google.com>, Rik van Riel <riel@surriel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	 "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	David Hildenbrand <david@redhat.com>,
	 Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
Date: Sat, 23 Aug 2025 23:45:03 -0700	[thread overview]
Message-ID: <CA+EESO6PFJ9A0kbRWd-ARBLmQ9pwNOF=GBuAzMCOyFvps4euGA@mail.gmail.com> (raw)
In-Reply-To: <aKqjWqn8lrITKI7P@hyeyoo>

On Sat, Aug 23, 2025 at 10:31 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Sat, Aug 23, 2025 at 09:18:11PM -0700, Lokesh Gidra wrote:
> > On Fri, Aug 22, 2025 at 10:29 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > >
> > > Hi all,
> > >
> > > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > > non-ksm anon folios. This necessitates serialization through anon_vma
> > > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > > involved in rmap_walk()) are to be updated. This hurts scalability due
> > > to coarse granularity of the lock. For instance, when multiple threads
> > > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > > from the same src VMA, they all contend for the corresponding
> > > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > > user interactions.
> > >
> > > Among all rmap_walk() callers that don’t lock anon folios,
> > > folio_referenced() is the most critical (others are
> > > page_idle_clear_pte_refs(), damon_folio_young(), and
> > > damon_folio_mkold()). The relevant code in folio_referenced() is:
> > >
> > > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> > >         we_locked = folio_trylock(folio);
> > >         if (!we_locked)
> > >                 return 1;
> > > }
> > >
> > > It’s unclear why locking anon_vma exclusively (when updating
> > > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > > with folio locked. It’s in the reclaim path, so should not be a
> > > critical path that necessitates some special treatment, unless I’m
> > > missing something.
> > >
> > > Therefore, I propose simplifying the locking mechanism by ensuring the
> > > folio is locked before calling rmap_walk(). This helps avoid locking
> > > anon_vma when updating folio->mapping, which, for instance, will help
> > > eliminate the uninterruptible sleep observed in the field traces
> > > mentioned earlier. Furthermore, it enables us to simplify the code in
> > > folio_lock_anon_vma_read() by removing the re-check to ensure that the
> > > field hasn’t changed under us.
> > Hi Harry,
> >
> > Your comment [1] in the other thread was quite useful and also needed
> > to be responded to. So bringing it here for continuing discussion.
>
> Hi Lokesh,
>
> Here I'm quoting my previous comment for discussion. I should have done it
> earlier but you know, it was Friday night in Korea :)

No problem at all. :)
>
> My previous comment was:
>   Simply acquiring the folio lock instead of anon_vma lock isn't enough
>   1) because the kernel can't stablize anon_vma without anon_vma lock
>   (an anon_vma cannot be freed while someone's holding anon_vma lock,
>   see anon_vma_free()).
>
>   2) without anon_vma lock the kernel can't reliably unmap folios because
>   they can be mapped to other processes (by fork()) while the kernel is
>   iterating list of VMAs that can possibly map the folio. fork() doens't
>   and shouldn't acquire folio lock.
>
>   3) Holding anon_vma lock also prevents anon_vma_chains from
>      being freed while holding the lock.
>
>   [Are there more things to worry about that I missed?
>    Please add them if so]
>
>   Any idea to relax locking requirements while addressing these
>   requirements?
>
>   If some users don't care about missing some PTE A bits due to race
>   against fork() (perhaps folio_referenced()?), a crazy idea might be to
>   RCU-protect anon_vma_chains (but then we need to check if the VMA is
>   still alive) and use refcount to stablize anon_vmas?
>
> > It seems from your comment that you misunderstood my proposal. I am
> > not suggesting replacing anon_vma lock with folio lock during rmap
> > walk. Clearly, it is essential for all the reasons that you
> > enumerated. My proposal is to lock anon folios during rmap_walk(),
> > like file and KSM folios.
>
> Still not sure if I follow your proposal. Let's clarify a little bit.
>
> As anon_vma lock is reader-writer semaphore, maybe you're saying
> 1) readers should acquire both folio lock and anon_vma lock, and
>
> > This helps in improving scalability (and also simplifying code in
> > folio_lock_anon_vma_read()) as then we can serialize on folio lock
> > instead of anon_vma lock when moving the folio to a different root
> > anon_vma in folio_move_anon_rmap() [2].
>
> 2) some of existing writers (e.g., move_pages_pte() in mm/userfaultfd.c)
>    simply update folio->index and folio->mapping, and they should be able
>    to run in parallel if they're not updating the same folio,
>    by taking folio lock and avoiding anon_vma lock?

Yes, that's exactly what I am hoping to achieve.
>
> I see a comment in move_pages_pte():
> /*
>  * folio_referenced walks the anon_vma chain
>  * without the folio lock. Serialize against it with
>  * the anon_vma lock, the folio lock is not enough.
>  */
>
> > [1] https://lore.kernel.org/all/aKhIL3OguViS9myH@hyeyoo
> > [2] https://lore.kernel.org/all/e5d41fbe-a91b-9491-7b93-733f67e75a54@redhat.com
> > > > > Thanks,
> > > Lokesh
>
> --
> Cheers,
> Harry / Hyeonggon

  reply	other threads:[~2025-08-24  6:45 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-22 17:29 [DISCUSSION] Unconditionally lock folios when calling rmap_walk() Lokesh Gidra
2025-08-22 17:44 ` Matthew Wilcox
2025-08-22 18:08   ` Lorenzo Stoakes
2025-08-24  4:18 ` Lokesh Gidra
2025-08-24  5:31   ` Harry Yoo
2025-08-24  6:45     ` Lokesh Gidra [this message]
2025-08-25 10:52       ` Harry Yoo
2025-08-28 11:32         ` Lorenzo Stoakes
2025-08-25 15:19 ` David Hildenbrand
2025-08-25 18:46   ` Lokesh Gidra
2025-08-28 12:04   ` Lorenzo Stoakes
2025-08-29  0:23     ` Lokesh Gidra
2025-08-29  8:42       ` David Hildenbrand
2025-08-29  9:04         ` Lorenzo Stoakes
2025-09-02 18:59           ` Lokesh Gidra
2025-09-02 19:01             ` David Hildenbrand
2025-09-02 19:04               ` Lokesh Gidra
2025-08-29  9:01       ` Lorenzo Stoakes
2025-08-26 15:52 ` Lorenzo Stoakes
2025-08-26 22:23   ` Lokesh Gidra
2025-08-28 11:31     ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+EESO6PFJ9A0kbRWd-ARBLmQ9pwNOF=GBuAzMCOyFvps4euGA@mail.gmail.com' \
    --to=lokeshgidra@google.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=android-mm@google.com \
    --cc=david@redhat.com \
    --cc=harry.yoo@oracle.com \
    --cc=jannh@google.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).