Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Harry Yoo <harry.yoo@oracle.com>
To: Lokesh Gidra <lokeshgidra@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Zi Yan <ziy@nvidia.com>, Barry Song <21cnbao@gmail.com>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>,
	Peter Xu <peterx@redhat.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	android-mm <android-mm@google.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Jann Horn <jannh@google.com>, Rik van Riel <riel@surriel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [DISCUSSION] Unconditionally lock folios when calling rmap_walk()
Date: Sun, 24 Aug 2025 14:31:10 +0900	[thread overview]
Message-ID: <aKqjWqn8lrITKI7P@hyeyoo> (raw)
In-Reply-To: <CA+EESO7j4dY3KjBWybTG6uQmXJ8kyhBrid3rTk5XAP7poZOhYQ@mail.gmail.com>

On Sat, Aug 23, 2025 at 09:18:11PM -0700, Lokesh Gidra wrote:
> On Fri, Aug 22, 2025 at 10:29 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > Hi all,
> >
> > Currently, some callers of rmap_walk() conditionally avoid try-locking
> > non-ksm anon folios. This necessitates serialization through anon_vma
> > write-lock elsewhere when folio->mapping and/or folio->index (fields
> > involved in rmap_walk()) are to be updated. This hurts scalability due
> > to coarse granularity of the lock. For instance, when multiple threads
> > invoke userfaultfd’s MOVE ioctl simultaneously to move distinct pages
> > from the same src VMA, they all contend for the corresponding
> > anon_vma’s lock. Field traces for arm64 android devices reveal over
> > 30ms of uninterruptible sleep in the main UI thread, leading to janky
> > user interactions.
> >
> > Among all rmap_walk() callers that don’t lock anon folios,
> > folio_referenced() is the most critical (others are
> > page_idle_clear_pte_refs(), damon_folio_young(), and
> > damon_folio_mkold()). The relevant code in folio_referenced() is:
> >
> > if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) {
> >         we_locked = folio_trylock(folio);
> >         if (!we_locked)
> >                 return 1;
> > }
> >
> > It’s unclear why locking anon_vma exclusively (when updating
> > folio->mapping, like in uffd MOVE) is beneficial over walking rmap
> > with folio locked. It’s in the reclaim path, so should not be a
> > critical path that necessitates some special treatment, unless I’m
> > missing something.
> >
> > Therefore, I propose simplifying the locking mechanism by ensuring the
> > folio is locked before calling rmap_walk(). This helps avoid locking
> > anon_vma when updating folio->mapping, which, for instance, will help
> > eliminate the uninterruptible sleep observed in the field traces
> > mentioned earlier. Furthermore, it enables us to simplify the code in
> > folio_lock_anon_vma_read() by removing the re-check to ensure that the
> > field hasn’t changed under us.
> Hi Harry,
>
> Your comment [1] in the other thread was quite useful and also needed
> to be responded to. So bringing it here for continuing discussion.

Hi Lokesh,

Here I'm quoting my previous comment for discussion. I should have done it
earlier but you know, it was Friday night in Korea :)

My previous comment was:
  Simply acquiring the folio lock instead of anon_vma lock isn't enough
  1) because the kernel can't stablize anon_vma without anon_vma lock
  (an anon_vma cannot be freed while someone's holding anon_vma lock,
  see anon_vma_free()).

  2) without anon_vma lock the kernel can't reliably unmap folios because
  they can be mapped to other processes (by fork()) while the kernel is
  iterating list of VMAs that can possibly map the folio. fork() doens't
  and shouldn't acquire folio lock.

  3) Holding anon_vma lock also prevents anon_vma_chains from
     being freed while holding the lock.

  [Are there more things to worry about that I missed?
   Please add them if so]

  Any idea to relax locking requirements while addressing these
  requirements?

  If some users don't care about missing some PTE A bits due to race
  against fork() (perhaps folio_referenced()?), a crazy idea might be to
  RCU-protect anon_vma_chains (but then we need to check if the VMA is
  still alive) and use refcount to stablize anon_vmas?

> It seems from your comment that you misunderstood my proposal. I am
> not suggesting replacing anon_vma lock with folio lock during rmap
> walk. Clearly, it is essential for all the reasons that you
> enumerated. My proposal is to lock anon folios during rmap_walk(),
> like file and KSM folios.

Still not sure if I follow your proposal. Let's clarify a little bit.

As anon_vma lock is reader-writer semaphore, maybe you're saying
1) readers should acquire both folio lock and anon_vma lock, and

> This helps in improving scalability (and also simplifying code in
> folio_lock_anon_vma_read()) as then we can serialize on folio lock
> instead of anon_vma lock when moving the folio to a different root
> anon_vma in folio_move_anon_rmap() [2].

2) some of existing writers (e.g., move_pages_pte() in mm/userfaultfd.c)
   simply update folio->index and folio->mapping, and they should be able
   to run in parallel if they're not updating the same folio,
   by taking folio lock and avoiding anon_vma lock?

I see a comment in move_pages_pte():
/*
 * folio_referenced walks the anon_vma chain
 * without the folio lock. Serialize against it with
 * the anon_vma lock, the folio lock is not enough.
 */

> [1] https://lore.kernel.org/all/aKhIL3OguViS9myH@hyeyoo
> [2] https://lore.kernel.org/all/e5d41fbe-a91b-9491-7b93-733f67e75a54@redhat.com
> > > > Thanks,
> > Lokesh

-- 
Cheers,
Harry / Hyeonggon

next prev parent reply	other threads:[~2025-08-24  5:31 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-22 17:29 [DISCUSSION] Unconditionally lock folios when calling rmap_walk() Lokesh Gidra
2025-08-22 17:44 ` Matthew Wilcox
2025-08-22 18:08   ` Lorenzo Stoakes
2025-08-24  4:18 ` Lokesh Gidra
2025-08-24  5:31   ` Harry Yoo [this message]
2025-08-24  6:45     ` Lokesh Gidra
2025-08-25 10:52       ` Harry Yoo
2025-08-28 11:32         ` Lorenzo Stoakes
2025-08-25 15:19 ` David Hildenbrand
2025-08-25 18:46   ` Lokesh Gidra
2025-08-28 12:04   ` Lorenzo Stoakes
2025-08-29  0:23     ` Lokesh Gidra
2025-08-29  8:42       ` David Hildenbrand
2025-08-29  9:04         ` Lorenzo Stoakes
2025-09-02 18:59           ` Lokesh Gidra
2025-09-02 19:01             ` David Hildenbrand
2025-09-02 19:04               ` Lokesh Gidra
2025-08-29  9:01       ` Lorenzo Stoakes
2025-08-26 15:52 ` Lorenzo Stoakes
2025-08-26 22:23   ` Lokesh Gidra
2025-08-28 11:31     ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aKqjWqn8lrITKI7P@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=android-mm@google.com \
    --cc=david@redhat.com \
    --cc=jannh@google.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lokeshgidra@google.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).