All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: Rik van Riel <riel@surriel.com>,
	x86@kernel.org, linux-mm@kvack.org,
	"Thomas Gleixner" <tglx@kernel.org>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Dmitry Ilvokhin" <d@ilvokhin.com>,
	"Borislav Petkov" <bp@alien8.de>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"David Hildenbrand" <david@kernel.org>,
	"Lorenzo Stoakes" <ljs@kernel.org>,
	"Liam R. Howlett" <liam@infradead.org>,
	"Vlastimil Babka" <vbabka@kernel.org>,
	"Suren Baghdasaryan" <surenb@google.com>,
	kernel-team@meta.com
Subject: [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock
Date: Wed, 24 Jun 2026 21:50:52 -0400	[thread overview]
Message-ID: <20260625015053.2445008-3-riel@surriel.com> (raw)
In-Reply-To: <20260625015053.2445008-1-riel@surriel.com>

folio_walk_start() asserts the mmap lock is held.  For callers that only
need to read a single, already-present page, the mmap lock is a heavy and
often badly contended hammer.  Such a caller can instead hold the per-VMA
lock, which keeps the VMA itself stable.

The per-VMA lock does not, however, keep the page tables walked below that
VMA from being freed.  A concurrent munmap() or THP collapse of an
adjacent region in the same mm can free a shared upper-level table, and
THP collapse (collapse_huge_page() -> retract_page_tables()) frees page
tables of VMAs whose lock it does not hold.  Page table freeing
synchronizes against lockless walkers the way gup_fast relies on:
tlb_remove_table_sync_one() sends an IPI and waits for every CPU to enable
interrupts, so a walker that keeps interrupts disabled across the walk
cannot be observing a table that is about to be freed.  rcu_read_lock() is
not sufficient -- it does not block that IPI -- so the caller must keep
interrupts disabled, not merely hold an RCU read-side critical section.

Add an FW_VMA_LOCKED flag.  When passed, folio_walk_start() asserts the
per-VMA lock and that interrupts are disabled, instead of asserting the
mmap lock; it requires CONFIG_MMU_GATHER_RCU_TABLE_FREE and refuses
hugetlb VMAs (PMD sharing maps page tables this VMA's lock does not
cover).  The caller must keep interrupts disabled until folio_walk_end().

No existing caller passes FW_VMA_LOCKED, so behaviour is unchanged.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/pagewalk.h |  7 +++++++
 mm/pagewalk.c            | 29 +++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index b41d7265c01b..d0387470d732 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -150,6 +150,13 @@ typedef int __bitwise folio_walk_flags_t;
 
 /* Walk shared zeropages (small + huge) as well. */
 #define FW_ZEROPAGE			((__force folio_walk_flags_t)BIT(0))
+/*
+ * The caller holds the per-VMA lock instead of the mmap lock, with interrupts
+ * disabled across the walk (until folio_walk_end()) to serialize against page
+ * table freeing, the same way gup_fast does. Only valid with RCU-freed page
+ * tables (CONFIG_MMU_GATHER_RCU_TABLE_FREE) and not for hugetlb.
+ */
+#define FW_VMA_LOCKED			((__force folio_walk_flags_t)BIT(1))
 
 enum folio_walk_level {
 	FW_LEVEL_PTE,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..ab1e81983cb8 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -890,7 +890,10 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
  * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
  * not correspond to the first physical entry of a logical hugetlb entry.
  *
- * The mmap lock must be held in read mode.
+ * The mmap lock must be held in read mode. Alternatively, if @FW_VMA_LOCKED is
+ * passed, the VMA's per-VMA lock must be held and interrupts must be disabled
+ * across the walk and until folio_walk_end() (only supported with RCU-freed page
+ * tables, i.e. CONFIG_MMU_GATHER_RCU_TABLE_FREE, and not for hugetlb).
  *
  * Return: folio pointer on success, otherwise NULL.
  */
@@ -908,7 +911,29 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 	pgd_t *pgdp;
 	p4d_t *p4dp;
 
-	mmap_assert_locked(vma->vm_mm);
+	if (flags & FW_VMA_LOCKED) {
+		/*
+		 * Lockless walk under the per-VMA lock instead of the mmap
+		 * lock. The VMA lock keeps the VMA stable, but the page tables
+		 * walked below it can still be freed concurrently: a munmap() or
+		 * THP collapse of an adjacent region in the same mm can free a
+		 * shared upper-level table, and collapse_huge_page() ->
+		 * retract_page_tables() frees page tables of VMAs whose lock it
+		 * does not hold. Page table freeing serializes against lockless
+		 * walkers via tlb_remove_table_sync_one(), which IPIs and waits
+		 * for every CPU to enable interrupts; an RCU read-side critical
+		 * section does not block that IPI, so the caller must keep
+		 * interrupts disabled across the whole walk, like gup_fast.
+		 * Hugetlb (PMD sharing) maps page tables not covered by this
+		 * VMA's lock and is not supported.
+		 */
+		VM_WARN_ON_ONCE(!IS_ENABLED(CONFIG_MMU_GATHER_RCU_TABLE_FREE));
+		VM_WARN_ON_ONCE(is_vm_hugetlb_page(vma));
+		lockdep_assert_irqs_disabled();
+		vma_assert_locked(vma);
+	} else {
+		mmap_assert_locked(vma->vm_mm);
+	}
 	vma_pgtable_walk_begin(vma);
 
 	if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
-- 
2.53.0-Meta



  parent reply	other threads:[~2026-06-25  1:51 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25  1:50 [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
2026-06-25  1:50 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
2026-06-25  1:50 ` Rik van Riel [this message]
2026-06-25  7:34   ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Lorenzo Stoakes
2026-06-25  1:50 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
2026-06-25  7:39   ` Lorenzo Stoakes
2026-06-25  6:32 ` [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock David Hildenbrand (Arm)
2026-06-25  7:47   ` Lorenzo Stoakes
  -- strict thread matches above, loose matches on Subject: below --
2026-06-16 19:02 [PATCH " Rik van Riel
2026-06-16 19:02 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the " Rik van Riel
2026-06-19 12:34   ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260625015053.2445008-3-riel@surriel.com \
    --to=riel@surriel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=d@ilvokhin.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mingo@redhat.com \
    --cc=surenb@google.com \
    --cc=tglx@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.