From: Lorenzo Stoakes <ljs@kernel.org>
To: Rik van Riel <riel@surriel.com>
Cc: linux-kernel@vger.kernel.org, x86@kernel.org, linux-mm@kvack.org,
Thomas Gleixner <tglx@kernel.org>,
Ingo Molnar <mingo@redhat.com>, Dmitry Ilvokhin <d@ilvokhin.com>,
Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
"Liam R. Howlett" <liam@infradead.org>,
Vlastimil Babka <vbabka@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
kernel-team@meta.com
Subject: Re: [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock
Date: Thu, 25 Jun 2026 08:34:42 +0100 [thread overview]
Message-ID: <ajzUCZSRU3h_UdjR@lucifer> (raw)
In-Reply-To: <20260625015053.2445008-3-riel@surriel.com>
Rik, it really would have helped if you'd replied to review :)
On Wed, Jun 24, 2026 at 09:50:52PM -0400, Rik van Riel wrote:
> folio_walk_start() asserts the mmap lock is held. For callers that only
> need to read a single, already-present page, the mmap lock is a heavy and
> often badly contended hammer. Such a caller can instead hold the per-VMA
> lock, which keeps the VMA itself stable.
<newline>
> The per-VMA lock does not, however, keep the page tables walked below that
> VMA from being freed. A concurrent munmap() or THP collapse of an
> adjacent region in the same mm can free a shared upper-level table, and
Yeah I need to update the documentation on this at
https://docs.kernel.org/mm/process_addrs.html it's more subtle than written
there.
Firstly you're wrong about munmap() - it acquires the VMA lock of the VMAs freed
in the range and will only remove an upper level table if the entire range is
spanned.
And that's the only way higher level tables can be removed.
PTE page tables can be removed via MADV_DONTNEED, but that a. acquires the VMA
lock and b. frees the PTE page table under RCU.
A THP collapse can happen concurrently, but PTEs are freed under RCU so you
don't need to do this GUP fast imitating stuff.
> THP collapse (collapse_huge_page() -> retract_page_tables()) frees page
> tables of VMAs whose lock it does not hold. Page table freeing
retract_page_tables() -> pte_free_defer() -> RCU
try_collapse_pte_mapped_thp() -> pte_free_defer() -> RCU
> synchronizes against lockless walkers the way gup_fast relies on:
> tlb_remove_table_sync_one() sends an IPI and waits for every CPU to enable
> interrupts, so a walker that keeps interrupts disabled across the walk
> cannot be observing a table that is about to be freed. rcu_read_lock() is
> not sufficient -- it does not block that IPI -- so the caller must keep
Yes it is?
I mean unless I'm missing something here.
> interrupts disabled, not merely hold an RCU read-side critical section.
>
> Add an FW_VMA_LOCKED flag. When passed, folio_walk_start() asserts the
> per-VMA lock and that interrupts are disabled, instead of asserting the
> mmap lock; it requires CONFIG_MMU_GATHER_RCU_TABLE_FREE and refuses
> hugetlb VMAs (PMD sharing maps page tables this VMA's lock does not
> cover). The caller must keep interrupts disabled until folio_walk_end().
>
> No existing caller passes FW_VMA_LOCKED, so behaviour is unchanged.
>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
> include/linux/pagewalk.h | 7 +++++++
> mm/pagewalk.c | 29 +++++++++++++++++++++++++++--
> 2 files changed, 34 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index b41d7265c01b..d0387470d732 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -150,6 +150,13 @@ typedef int __bitwise folio_walk_flags_t;
>
> /* Walk shared zeropages (small + huge) as well. */
> #define FW_ZEROPAGE ((__force folio_walk_flags_t)BIT(0))
> +/*
> + * The caller holds the per-VMA lock instead of the mmap lock, with interrupts
> + * disabled across the walk (until folio_walk_end()) to serialize against page
> + * table freeing, the same way gup_fast does. Only valid with RCU-freed page
> + * tables (CONFIG_MMU_GATHER_RCU_TABLE_FREE) and not for hugetlb.
> + */
> +#define FW_VMA_LOCKED ((__force folio_walk_flags_t)BIT(1))
>
> enum folio_walk_level {
> FW_LEVEL_PTE,
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 3ae2586ff45b..ab1e81983cb8 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -890,7 +890,10 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
> * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
> * not correspond to the first physical entry of a logical hugetlb entry.
> *
> - * The mmap lock must be held in read mode.
> + * The mmap lock must be held in read mode. Alternatively, if @FW_VMA_LOCKED is
> + * passed, the VMA's per-VMA lock must be held and interrupts must be disabled
> + * across the walk and until folio_walk_end() (only supported with RCU-freed page
> + * tables, i.e. CONFIG_MMU_GATHER_RCU_TABLE_FREE, and not for hugetlb).
> *
> * Return: folio pointer on success, otherwise NULL.
> */
> @@ -908,7 +911,29 @@ struct folio *folio_walk_start(struct folio_walk *fw,
> pgd_t *pgdp;
> p4d_t *p4dp;
>
> - mmap_assert_locked(vma->vm_mm);
> + if (flags & FW_VMA_LOCKED) {
> + /*
> + * Lockless walk under the per-VMA lock instead of the mmap
> + * lock. The VMA lock keeps the VMA stable, but the page tables
> + * walked below it can still be freed concurrently: a munmap() or
> + * THP collapse of an adjacent region in the same mm can free a
> + * shared upper-level table, and collapse_huge_page() ->
> + * retract_page_tables() frees page tables of VMAs whose lock it
> + * does not hold. Page table freeing serializes against lockless
> + * walkers via tlb_remove_table_sync_one(), which IPIs and waits
> + * for every CPU to enable interrupts; an RCU read-side critical
> + * section does not block that IPI, so the caller must keep
> + * interrupts disabled across the whole walk, like gup_fast.
> + * Hugetlb (PMD sharing) maps page tables not covered by this
> + * VMA's lock and is not supported.
> + */
This is an unreadable wall of text, if it's AI generated please edit before
sending.
> + VM_WARN_ON_ONCE(!IS_ENABLED(CONFIG_MMU_GATHER_RCU_TABLE_FREE));
> + VM_WARN_ON_ONCE(is_vm_hugetlb_page(vma));
> + lockdep_assert_irqs_disabled();
> + vma_assert_locked(vma);
> + } else {
> + mmap_assert_locked(vma->vm_mm);
> + }
> vma_pgtable_walk_begin(vma);
>
> if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
> --
> 2.53.0-Meta
>
Thanks, Lorenzo
next prev parent reply other threads:[~2026-06-25 7:34 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-25 1:50 [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
2026-06-25 1:50 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
2026-06-25 1:50 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
2026-06-25 7:34 ` Lorenzo Stoakes [this message]
2026-06-25 1:50 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
2026-06-25 7:39 ` Lorenzo Stoakes
2026-06-25 6:32 ` [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock David Hildenbrand (Arm)
2026-06-25 7:47 ` Lorenzo Stoakes
-- strict thread matches above, loose matches on Subject: below --
2026-06-16 19:02 [PATCH " Rik van Riel
2026-06-16 19:02 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the " Rik van Riel
2026-06-19 12:34 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ajzUCZSRU3h_UdjR@lucifer \
--to=ljs@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=bp@alien8.de \
--cc=d@ilvokhin.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@kernel.org \
--cc=kernel-team@meta.com \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@redhat.com \
--cc=riel@surriel.com \
--cc=surenb@google.com \
--cc=tglx@kernel.org \
--cc=vbabka@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox