[PATCH 0/3] mm: __access_remote_vm with per-VMA lock

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
@ 2026-06-16 19:02 Rik van Riel
  2026-06-16 19:02 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Rik van Riel @ 2026-06-16 19:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan

Sometimes processes can get stuck with the mmap_lock held for
a long time. This slows down, and can even prevent system monitoring
tools from assessing and logging the situation, because they themselves
end up getting stuck on the mmap_lock.

However, with the introduction of per-VMA locks, we can improve the
reliability of system monitoring, and generally speed up __access_remote_vm
under mmap_loc contention, by adding a fast path that does not require
the process-wide mmap_lock.

This fast path is only compiled in and used when it is safe to do so,
meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
is not hugetlbfs, iomap, pfnmap, etc...

The code seems to work, but could still use some more cleaning up
and benchmarking.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask
  2026-06-16 19:02 [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
@ 2026-06-16 19:02 ` Rik van Riel
  2026-06-18 16:40   ` Usama Arif
  2026-06-16 19:02 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Rik van Riel @ 2026-06-16 19:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan

mm->context.untag_mask is written once, when LAM is enabled
(mm_enable_lam(), under mmap_write_lock and while the process is still
single-threaded), and is otherwise stable and never reverted.
untagged_addr_remote() reads it for a remote mm, and the new
untagged_addr_remote_unlocked() (used by the per-VMA-lock
access_remote_vm() fast path) reads it without the mmap lock.

The field is a single aligned word and cannot tear, but annotate the
reads and writes with READ_ONCE()/WRITE_ONCE() to make the lockless
access explicit and keep the compiler from reloading or tearing it.

No functional change.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/mmu_context.h | 6 +++---
 arch/x86/include/asm/uaccess_64.h  | 2 +-
 arch/x86/kernel/process_64.c       | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index ef5b507de34e..cee710f64658 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -100,18 +100,18 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
 static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
-	mm->context.untag_mask = oldmm->context.untag_mask;
+	WRITE_ONCE(mm->context.untag_mask, READ_ONCE(oldmm->context.untag_mask));
 }
 
 #define mm_untag_mask mm_untag_mask
 static inline unsigned long mm_untag_mask(struct mm_struct *mm)
 {
-	return mm->context.untag_mask;
+	return READ_ONCE(mm->context.untag_mask);
 }
 
 static inline void mm_reset_untag_mask(struct mm_struct *mm)
 {
-	mm->context.untag_mask = -1UL;
+	WRITE_ONCE(mm->context.untag_mask, -1UL);
 }
 
 #define arch_pgtable_dma_compat arch_pgtable_dma_compat
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 20de34cc9aa6..4a52497ba6a1 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -43,7 +43,7 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 						   unsigned long addr)
 {
 	mmap_assert_locked(mm);
-	return addr & (mm)->context.untag_mask;
+	return addr & READ_ONCE((mm)->context.untag_mask);
 }
 
 #define untagged_addr_remote(mm, addr)	({				\
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d44afbe005bb..55096136de53 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -814,7 +814,7 @@ static void enable_lam_func(void *__mm)
 static void mm_enable_lam(struct mm_struct *mm)
 {
 	mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
-	mm->context.untag_mask =  ~GENMASK(62, 57);
+	WRITE_ONCE(mm->context.untag_mask, ~GENMASK(62, 57));
 
 	/*
 	 * Even though the process must still be single-threaded at this
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock
  2026-06-16 19:02 [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
  2026-06-16 19:02 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
@ 2026-06-16 19:02 ` Rik van Riel
  2026-06-19 12:34   ` Lorenzo Stoakes
  2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
  2026-06-17  1:10 ` [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Suren Baghdasaryan
  3 siblings, 1 reply; 24+ messages in thread
From: Rik van Riel @ 2026-06-16 19:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan

folio_walk_start() asserts that the mmap lock is held.  For callers that
only need to read a single, already-present page, the mmap lock is a
heavy and often badly contended hammer: the VMA can instead be
stabilized with the per-VMA lock, and the page table pages that are
walked are kept alive by RCU page-table freeing
(CONFIG_MMU_GATHER_RCU_TABLE_FREE).

Add an FW_VMA_LOCKED flag.  When passed, folio_walk_start() asserts the
per-VMA lock instead of the mmap lock, requires RCU-freed page tables,
and refuses hugetlb VMAs (PMD sharing cannot be walked safely this way).
Everything else folio_walk_start() relies on -- the page table locks,
pmdp_get_lockless() and pte_offset_map_lock() -- is already safe without
the mmap lock, mirroring the per-VMA lock page fault path.

No existing caller passes FW_VMA_LOCKED, so behaviour is unchanged.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/pagewalk.h |  5 +++++
 mm/pagewalk.c            | 18 ++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index b41d7265c01b..84dd0d68f747 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -150,6 +150,11 @@ typedef int __bitwise folio_walk_flags_t;
 
 /* Walk shared zeropages (small + huge) as well. */
 #define FW_ZEROPAGE			((__force folio_walk_flags_t)BIT(0))
+/*
+ * The caller holds the per-VMA lock instead of the mmap lock. Only valid with
+ * RCU-freed page tables (CONFIG_MMU_GATHER_RCU_TABLE_FREE) and not for hugetlb.
+ */
+#define FW_VMA_LOCKED			((__force folio_walk_flags_t)BIT(1))
 
 enum folio_walk_level {
 	FW_LEVEL_PTE,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..c85364b73e12 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -890,7 +890,9 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
  * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
  * not correspond to the first physical entry of a logical hugetlb entry.
  *
- * The mmap lock must be held in read mode.
+ * The mmap lock must be held in read mode. Alternatively, if @FW_VMA_LOCKED is
+ * passed, the VMA's per-VMA lock must be held (only supported with RCU-freed
+ * page tables, i.e. CONFIG_MMU_GATHER_RCU_TABLE_FREE, and not for hugetlb).
  *
  * Return: folio pointer on success, otherwise NULL.
  */
@@ -908,7 +910,19 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 	pgd_t *pgdp;
 	p4d_t *p4dp;
 
-	mmap_assert_locked(vma->vm_mm);
+	if (flags & FW_VMA_LOCKED) {
+		/*
+		 * Lockless walk: the per-VMA lock keeps the VMA stable, and
+		 * RCU-freed page tables keep the walked page table pages alive
+		 * across the lockless upper-level walk and pte_offset_map_lock().
+		 * Hugetlb (PMD sharing) is not supported on this path.
+		 */
+		VM_WARN_ON_ONCE(!IS_ENABLED(CONFIG_MMU_GATHER_RCU_TABLE_FREE));
+		VM_WARN_ON_ONCE(is_vm_hugetlb_page(vma));
+		vma_assert_locked(vma);
+	} else {
+		mmap_assert_locked(vma->vm_mm);
+	}
 	vma_pgtable_walk_begin(vma);
 
 	if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-16 19:02 [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
  2026-06-16 19:02 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
  2026-06-16 19:02 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
@ 2026-06-16 19:03 ` Rik van Riel
  2026-06-17  6:19   ` Suren Baghdasaryan
                     ` (2 more replies)
  2026-06-17  1:10 ` [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Suren Baghdasaryan
  3 siblings, 3 replies; 24+ messages in thread
From: Rik van Riel @ 2026-06-16 19:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan

__access_remote_vm() takes mmap_read_lock() for the entire transfer and
uses get_user_pages_remote(), which faults pages in.  For the common
case of reading memory that is already resident -- /proc/PID/cmdline,
/proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
unnecessary and is badly contended on large machines.

Add an opportunistic, read-only fast path that transfers what it can
without the mmap lock.  For each address it takes the per-VMA lock with
lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
a present page before copying it out.  Anything non-trivial -- a not-
present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
a race with a VMA writer -- falls back to the existing mmap_lock path
for the remainder.

untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
for the fast path; the untag mask is a stable per-mm value.

Only reads are handled here; writes keep using the slow path.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/uaccess_64.h |  12 +++
 include/linux/uaccess.h           |  11 ++
 mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
 3 files changed, 188 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 4a52497ba6a1..c6fac900a747 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
 })
 
+/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
+static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
+							    unsigned long addr)
+{
+	return addr & READ_ONCE((mm)->context.untag_mask);
+}
+
+#define untagged_addr_remote_unlocked(mm, addr)	({			\
+	unsigned long __addr = (__force unsigned long)(addr);		\
+	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
+})
+
 #endif
 
 #define valid_user_address(x) \
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 8a264662b242..c8c83372c9d8 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -34,6 +34,17 @@
 })
 #endif
 
+/*
+ * Like untagged_addr_remote(), but for callers that stabilize @mm by other
+ * means (e.g. a per-VMA lock) and must not assert the mmap lock.
+ */
+#ifndef untagged_addr_remote_unlocked
+#define untagged_addr_remote_unlocked(mm, addr)	({	\
+	(void)(mm);					\
+	untagged_addr(addr);				\
+})
+#endif
+
 #ifdef masked_user_access_begin
  #define can_do_masked_user_access() 1
 # ifndef masked_user_write_access_begin
diff --git a/mm/memory.c b/mm/memory.c
index 86a973119bd4..0b23b82eaa18 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -42,6 +42,8 @@
 #include <linux/kernel_stat.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
+#include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/sched/task.h>
@@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 EXPORT_SYMBOL_GPL(generic_access_phys);
 #endif
 
+/*
+ * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
+ * lock and RCU-freed page tables to walk page tables without the mmap lock.
+ */
+#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
+/*
+ * Opportunistic lockless fast path for __access_remote_vm() reads.
+ *
+ * Memory already resident in @mm can be read without taking the heavily
+ * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
+ * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
+ * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
+ *
+ * Anything that would require faulting a page in, touching a hugetlb or
+ * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
+ * path in __access_remote_vm().  Only reads are handled here.
+ *
+ * Returns the number of bytes transferred via the fast path.
+ */
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+				 void *buf, int len, unsigned int gup_flags)
+{
+	void *old_buf = buf;
+
+	addr = untagged_addr_remote_unlocked(mm, addr);
+
+	while (len) {
+		struct vm_area_struct *vma;
+		vm_flags_t vm_flags;
+
+		vma = lock_vma_under_rcu(mm, addr);
+		if (!vma)
+			break;
+
+		/*
+		 * Mirror the read-side permission checks of check_vma_flags(),
+		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
+		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
+		 * per VMA; anything not positively allowed falls back to the
+		 * slow path, which re-validates everything.
+		 */
+		vm_flags = vma->vm_flags;
+		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
+		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
+		    (!(vm_flags & VM_READ) &&
+		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
+			vma_end_read(vma);
+			break;
+		}
+
+		/*
+		 * Copy as much of this VMA as we can without re-acquiring the
+		 * per-VMA lock; re-lock only when @addr leaves the VMA.
+		 */
+		while (len && addr < vma->vm_end) {
+			struct folio_walk fw;
+			struct folio *folio;
+			struct page *page;
+			unsigned long entry_size, entry_left, folio_left, span;
+			unsigned long copied, idx0;
+			int offset;
+
+			folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
+			if (!folio) {
+				vma_end_read(vma);
+				goto out;
+			}
+			page = fw.page;
+			if (!page) {
+				folio_walk_end(&fw, vma);
+				vma_end_read(vma);
+				goto out;
+			}
+			/* Pin the folio so it stays valid after the PTL is dropped. */
+			folio_get(folio);
+			folio_walk_end(&fw, vma);
+
+			/*
+			 * folio_walk_start() validated exactly one mapping entry,
+			 * which covers a contiguous, present run of this folio:
+			 * PAGE_SIZE for a pte, PMD_SIZE for a pmd leaf, PUD_SIZE
+			 * for a pud leaf.  Copy up to the end of that entry,
+			 * bounded by the folio, the VMA and len, so a huge mapping
+			 * is handled in one walk instead of per page.
+			 */
+			offset = offset_in_page(addr);
+			switch (fw.level) {
+			case FW_LEVEL_PUD:
+				entry_size = PUD_SIZE;
+				break;
+			case FW_LEVEL_PMD:
+				entry_size = PMD_SIZE;
+				break;
+			default:
+				entry_size = PAGE_SIZE;
+				break;
+			}
+			entry_left = entry_size - (addr & (entry_size - 1));
+			idx0 = folio_page_idx(folio, page);
+			folio_left = ((folio_nr_pages(folio) - idx0) << PAGE_SHIFT) -
+				     offset;
+			span = min3((unsigned long)len, entry_left, folio_left);
+			span = min(span, vma->vm_end - addr);
+
+			/*
+			 * Copy the span page-by-page: kmap_local_folio() maps one
+			 * page on HIGHMEM and copy_from_user_page() flushes per
+			 * page on aliasing caches, but the page tables are not
+			 * re-walked.  The span borrows the single folio reference
+			 * taken above, so each mapping is dropped with
+			 * kunmap_local() (not folio_release_kmap(), which would
+			 * also drop a folio reference per page).
+			 */
+			for (copied = 0; copied < span; ) {
+				unsigned long foff = offset + copied;
+				unsigned long pidx = idx0 + (foff >> PAGE_SHIFT);
+				int poff = foff & ~PAGE_MASK;
+				int chunk = min_t(unsigned long, span - copied,
+						  PAGE_SIZE - poff);
+				void *maddr = kmap_local_folio(folio,
+						pidx << PAGE_SHIFT);
+
+				copy_from_user_page(vma, folio_page(folio, pidx),
+						    addr + copied, buf + copied,
+						    maddr + poff, chunk);
+				kunmap_local(maddr);
+				copied += chunk;
+			}
+
+			folio_put(folio);
+			len -= span;
+			buf += span;
+			addr += span;
+		}
+		vma_end_read(vma);
+	}
+out:
+	return buf - old_buf;
+}
+#else
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+				 void *buf, int len, unsigned int gup_flags)
+{
+	return 0;
+}
+#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
+
 /*
  * Access another process' address space as given in mm.
  */
@@ -7071,8 +7220,23 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
 	void *old_buf = buf;
 	int write = gup_flags & FOLL_WRITE;
 
+	/*
+	 * Try the lockless fast path for reads first; it transfers what it can
+	 * from resident memory without taking mmap_lock, and leaves the
+	 * remainder (if any) to the slow path below.
+	 */
+	if (!write) {
+		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
+
+		addr += done;
+		buf += done;
+		len -= done;
+		if (!len)
+			return buf - old_buf;
+	}
+
 	if (mmap_read_lock_killable(mm))
-		return 0;
+		return buf - old_buf;
 
 	/* Untag the address before looking up the VMA */
 	addr = untagged_addr_remote(mm, addr);
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-16 19:02 [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
                   ` (2 preceding siblings ...)
  2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
@ 2026-06-17  1:10 ` Suren Baghdasaryan
  2026-06-17  9:42   ` David Hildenbrand (Arm)
  3 siblings, 1 reply; 24+ messages in thread
From: Suren Baghdasaryan @ 2026-06-17  1:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka

On Tue, Jun 16, 2026 at 12:04 PM Rik van Riel <riel@surriel.com> wrote:
>
> Sometimes processes can get stuck with the mmap_lock held for
> a long time. This slows down, and can even prevent system monitoring
> tools from assessing and logging the situation, because they themselves
> end up getting stuck on the mmap_lock.
>
> However, with the introduction of per-VMA locks, we can improve the
> reliability of system monitoring, and generally speed up __access_remote_vm
> under mmap_loc contention, by adding a fast path that does not require
> the process-wide mmap_lock.
>
> This fast path is only compiled in and used when it is safe to do so,
> meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
> is not hugetlbfs, iomap, pfnmap, etc...
>
> The code seems to work, but could still use some more cleaning up
> and benchmarking.

Thanks for the patchset Rik!
Previously when I looked into using per-VMA locks in
access_remote_vm(), the biggest hurdle was get_user_pages_remote(),
which required mmap_lock. Your implementation avoids altogether and
keeps the code much simpler than what I expected. I very much support
this approach and will start reviewing in more details.
One question: CONFIG_MMU_GATHER_RCU_TABLE_FREE still has this comment
about "Semi RCU freeing of the page directories." [1]. Does that
mechanism being "semi RCU safe" pose an issue here? I'll need to
refresh my memory on why it's only semi-RCU safe but you might have
the answer ready.

[1] https://elixir.bootlin.com/linux/v7.1/source/mm/mmu_gather.c#L236

Thanks,
Suren.

>
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
@ 2026-06-17  6:19   ` Suren Baghdasaryan
  2026-06-19 12:24     ` Lorenzo Stoakes
  2026-06-19 13:46     ` Rik van Riel
  2026-06-18 17:01   ` Usama Arif
  2026-06-19 12:20   ` Lorenzo Stoakes
  2 siblings, 2 replies; 24+ messages in thread
From: Suren Baghdasaryan @ 2026-06-17  6:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka

On Tue, Jun 16, 2026 at 12:04 PM Rik van Riel <riel@surriel.com> wrote:
>
> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> uses get_user_pages_remote(), which faults pages in.  For the common
> case of reading memory that is already resident -- /proc/PID/cmdline,
> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> unnecessary and is badly contended on large machines.
>
> Add an opportunistic, read-only fast path that transfers what it can
> without the mmap lock.  For each address it takes the per-VMA lock with
> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
> a present page before copying it out.  Anything non-trivial -- a not-
> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
> a race with a VMA writer -- falls back to the existing mmap_lock path
> for the remainder.

I don't think we should be using per-VMA locks if the read spans
multiple VMAs. Doing that would risk a possibility of reading
inconsistent data since we are locking one VMA at a time. While we
load and read VMA, its neighboring VMA can be unmapped and another one
can be mapped in its place. So, our read spanning both VMAs will
return inconsistent data. access_remote_vm_fast() can check if the
entire read is contained within one VMA and if not, fall back to
mmap_lock.

>
> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
> for the fast path; the untag mask is a stable per-mm value.
>
> Only reads are handled here; writes keep using the slow path.
>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/uaccess_64.h |  12 +++
>  include/linux/uaccess.h           |  11 ++
>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>  3 files changed, 188 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 4a52497ba6a1..c6fac900a747 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>         (__force __typeof__(addr))__untagged_addr_remote(mm, __addr);   \
>  })
>
> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
> +                                                           unsigned long addr)
> +{
> +       return addr & READ_ONCE((mm)->context.untag_mask);
> +}
> +
> +#define untagged_addr_remote_unlocked(mm, addr)        ({                      \
> +       unsigned long __addr = (__force unsigned long)(addr);           \
> +       (__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
> +})
> +
>  #endif
>
>  #define valid_user_address(x) \
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 8a264662b242..c8c83372c9d8 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -34,6 +34,17 @@
>  })
>  #endif
>
> +/*
> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
> + */
> +#ifndef untagged_addr_remote_unlocked
> +#define untagged_addr_remote_unlocked(mm, addr)        ({      \
> +       (void)(mm);                                     \
> +       untagged_addr(addr);                            \
> +})
> +#endif
> +
>  #ifdef masked_user_access_begin
>   #define can_do_masked_user_access() 1
>  # ifndef masked_user_write_access_begin
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..0b23b82eaa18 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,8 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/secretmem.h>
> +#include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>  EXPORT_SYMBOL_GPL(generic_access_phys);
>  #endif
>
> +/*
> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
> + */
> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)

Just FYI, Dave Hansen is working on a patchset [1] that removes
CONFIG_PER_VMA_LOCK and makes per-VMA locks always available. This
would simplify this patchset too.

[1] https://lore.kernel.org/all/20260610230409.A44D29FA@davehans-spike.ostc.intel.com/

> +/*
> + * Opportunistic lockless fast path for __access_remote_vm() reads.
> + *
> + * Memory already resident in @mm can be read without taking the heavily
> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
> + *
> + * Anything that would require faulting a page in, touching a hugetlb or
> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
> + * path in __access_remote_vm().  Only reads are handled here.
> + *
> + * Returns the number of bytes transferred via the fast path.
> + */
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +                                void *buf, int len, unsigned int gup_flags)
> +{
> +       void *old_buf = buf;
> +
> +       addr = untagged_addr_remote_unlocked(mm, addr);
> +
> +       while (len) {
> +               struct vm_area_struct *vma;
> +               vm_flags_t vm_flags;
> +
> +               vma = lock_vma_under_rcu(mm, addr);
> +               if (!vma)
> +                       break;
> +
> +               /*
> +                * Mirror the read-side permission checks of check_vma_flags(),
> +                * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
> +                * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
> +                * per VMA; anything not positively allowed falls back to the
> +                * slow path, which re-validates everything.
> +                */
> +               vm_flags = vma->vm_flags;
> +               if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
> +                   is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
> +                   (!(vm_flags & VM_READ) &&
> +                    (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
> +                       vma_end_read(vma);
> +                       break;
> +               }
> +
> +               /*
> +                * Copy as much of this VMA as we can without re-acquiring the
> +                * per-VMA lock; re-lock only when @addr leaves the VMA.
> +                */
> +               while (len && addr < vma->vm_end) {
> +                       struct folio_walk fw;
> +                       struct folio *folio;
> +                       struct page *page;
> +                       unsigned long entry_size, entry_left, folio_left, span;
> +                       unsigned long copied, idx0;
> +                       int offset;
> +
> +                       folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
> +                       if (!folio) {
> +                               vma_end_read(vma);
> +                               goto out;
> +                       }
> +                       page = fw.page;
> +                       if (!page) {
> +                               folio_walk_end(&fw, vma);
> +                               vma_end_read(vma);
> +                               goto out;
> +                       }
> +                       /* Pin the folio so it stays valid after the PTL is dropped. */
> +                       folio_get(folio);
> +                       folio_walk_end(&fw, vma);
> +
> +                       /*
> +                        * folio_walk_start() validated exactly one mapping entry,
> +                        * which covers a contiguous, present run of this folio:
> +                        * PAGE_SIZE for a pte, PMD_SIZE for a pmd leaf, PUD_SIZE
> +                        * for a pud leaf.  Copy up to the end of that entry,
> +                        * bounded by the folio, the VMA and len, so a huge mapping
> +                        * is handled in one walk instead of per page.
> +                        */
> +                       offset = offset_in_page(addr);
> +                       switch (fw.level) {
> +                       case FW_LEVEL_PUD:
> +                               entry_size = PUD_SIZE;
> +                               break;
> +                       case FW_LEVEL_PMD:
> +                               entry_size = PMD_SIZE;
> +                               break;
> +                       default:
> +                               entry_size = PAGE_SIZE;
> +                               break;
> +                       }
> +                       entry_left = entry_size - (addr & (entry_size - 1));
> +                       idx0 = folio_page_idx(folio, page);
> +                       folio_left = ((folio_nr_pages(folio) - idx0) << PAGE_SHIFT) -
> +                                    offset;
> +                       span = min3((unsigned long)len, entry_left, folio_left);
> +                       span = min(span, vma->vm_end - addr);
> +
> +                       /*
> +                        * Copy the span page-by-page: kmap_local_folio() maps one
> +                        * page on HIGHMEM and copy_from_user_page() flushes per
> +                        * page on aliasing caches, but the page tables are not
> +                        * re-walked.  The span borrows the single folio reference
> +                        * taken above, so each mapping is dropped with
> +                        * kunmap_local() (not folio_release_kmap(), which would
> +                        * also drop a folio reference per page).
> +                        */
> +                       for (copied = 0; copied < span; ) {
> +                               unsigned long foff = offset + copied;
> +                               unsigned long pidx = idx0 + (foff >> PAGE_SHIFT);
> +                               int poff = foff & ~PAGE_MASK;
> +                               int chunk = min_t(unsigned long, span - copied,
> +                                                 PAGE_SIZE - poff);
> +                               void *maddr = kmap_local_folio(folio,
> +                                               pidx << PAGE_SHIFT);
> +
> +                               copy_from_user_page(vma, folio_page(folio, pidx),
> +                                                   addr + copied, buf + copied,
> +                                                   maddr + poff, chunk);
> +                               kunmap_local(maddr);
> +                               copied += chunk;
> +                       }
> +
> +                       folio_put(folio);
> +                       len -= span;
> +                       buf += span;
> +                       addr += span;
> +               }
> +               vma_end_read(vma);
> +       }
> +out:
> +       return buf - old_buf;
> +}
> +#else
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +                                void *buf, int len, unsigned int gup_flags)
> +{
> +       return 0;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
> +
>  /*
>   * Access another process' address space as given in mm.
>   */
> @@ -7071,8 +7220,23 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
>         void *old_buf = buf;
>         int write = gup_flags & FOLL_WRITE;
>
> +       /*
> +        * Try the lockless fast path for reads first; it transfers what it can
> +        * from resident memory without taking mmap_lock, and leaves the
> +        * remainder (if any) to the slow path below.
> +        */
> +       if (!write) {
> +               int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
> +
> +               addr += done;
> +               buf += done;
> +               len -= done;
> +               if (!len)
> +                       return buf - old_buf;
> +       }
> +
>         if (mmap_read_lock_killable(mm))
> -               return 0;
> +               return buf - old_buf;
>
>         /* Untag the address before looking up the VMA */
>         addr = untagged_addr_remote(mm, addr);
> --
> 2.53.0-Meta
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-17  1:10 ` [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Suren Baghdasaryan
@ 2026-06-17  9:42   ` David Hildenbrand (Arm)
  2026-06-17 13:33     ` Suren Baghdasaryan
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-17  9:42 UTC (permalink / raw)
  To: Suren Baghdasaryan, Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

On 6/17/26 03:10, Suren Baghdasaryan wrote:
> On Tue, Jun 16, 2026 at 12:04 PM Rik van Riel <riel@surriel.com> wrote:
>>
>> Sometimes processes can get stuck with the mmap_lock held for
>> a long time. This slows down, and can even prevent system monitoring
>> tools from assessing and logging the situation, because they themselves
>> end up getting stuck on the mmap_lock.
>>
>> However, with the introduction of per-VMA locks, we can improve the
>> reliability of system monitoring, and generally speed up __access_remote_vm
>> under mmap_loc contention, by adding a fast path that does not require
>> the process-wide mmap_lock.
>>
>> This fast path is only compiled in and used when it is safe to do so,
>> meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
>> is not hugetlbfs, iomap, pfnmap, etc...
>>
>> The code seems to work, but could still use some more cleaning up
>> and benchmarking.
> 
> Thanks for the patchset Rik!
> Previously when I looked into using per-VMA locks in
> access_remote_vm(), the biggest hurdle was get_user_pages_remote(),
> which required mmap_lock. Your implementation avoids altogether and
> keeps the code much simpler than what I expected.

But, wouldn't we, in general, also want to teach GUP to just work with per-VMA
locks?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-17  9:42   ` David Hildenbrand (Arm)
@ 2026-06-17 13:33     ` Suren Baghdasaryan
  2026-06-18 20:37       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 24+ messages in thread
From: Suren Baghdasaryan @ 2026-06-17 13:33 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Rik van Riel, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

On Wed, Jun 17, 2026 at 2:42 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 6/17/26 03:10, Suren Baghdasaryan wrote:
> > On Tue, Jun 16, 2026 at 12:04 PM Rik van Riel <riel@surriel.com> wrote:
> >>
> >> Sometimes processes can get stuck with the mmap_lock held for
> >> a long time. This slows down, and can even prevent system monitoring
> >> tools from assessing and logging the situation, because they themselves
> >> end up getting stuck on the mmap_lock.
> >>
> >> However, with the introduction of per-VMA locks, we can improve the
> >> reliability of system monitoring, and generally speed up __access_remote_vm
> >> under mmap_loc contention, by adding a fast path that does not require
> >> the process-wide mmap_lock.
> >>
> >> This fast path is only compiled in and used when it is safe to do so,
> >> meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
> >> is not hugetlbfs, iomap, pfnmap, etc...
> >>
> >> The code seems to work, but could still use some more cleaning up
> >> and benchmarking.
> >
> > Thanks for the patchset Rik!
> > Previously when I looked into using per-VMA locks in
> > access_remote_vm(), the biggest hurdle was get_user_pages_remote(),
> > which required mmap_lock. Your implementation avoids altogether and
> > keeps the code much simpler than what I expected.
>
> But, wouldn't we, in general, also want to teach GUP to just work with per-VMA
> locks?

Matthew suggested using gup_fast in access_remote_vm before, and I
looked into that. The biggest issue there is that gup_fast assumes it
always operates on current->mm, not on the remote one. Reworking that
is quite an undertaking.
Teaching GUP in general to work with per-VMA locks I think would also
be much harder than what this patchset does and would require a GUP
expert (which I am unfortunately not).

>
> --
> Cheers,
>
> David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask
  2026-06-16 19:02 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
@ 2026-06-18 16:40   ` Usama Arif
  0 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-06-18 16:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Usama Arif, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan

On Tue, 16 Jun 2026 15:02:58 -0400 Rik van Riel <riel@surriel.com> wrote:

> mm->context.untag_mask is written once, when LAM is enabled
> (mm_enable_lam(), under mmap_write_lock and while the process is still
> single-threaded), and is otherwise stable and never reverted.
> untagged_addr_remote() reads it for a remote mm, and the new
> untagged_addr_remote_unlocked() (used by the per-VMA-lock
> access_remote_vm() fast path) reads it without the mmap lock.
> 
> The field is a single aligned word and cannot tear, but annotate the
> reads and writes with READ_ONCE()/WRITE_ONCE() to make the lockless
> access explicit and keep the compiler from reloading or tearing it.
> 
> No functional change.
> 
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/mmu_context.h | 6 +++---
>  arch/x86/include/asm/uaccess_64.h  | 2 +-
>  arch/x86/kernel/process_64.c       | 2 +-
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 

Acked-by: Usama Arif <usama.arif@linux.dev>
 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
  2026-06-17  6:19   ` Suren Baghdasaryan
@ 2026-06-18 17:01   ` Usama Arif
  2026-06-18 17:07     ` David Hildenbrand (Arm)
  2026-06-19 12:20   ` Lorenzo Stoakes
  2 siblings, 1 reply; 24+ messages in thread
From: Usama Arif @ 2026-06-18 17:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Usama Arif, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan

On Tue, 16 Jun 2026 15:03:00 -0400 Rik van Riel <riel@surriel.com> wrote:

> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> uses get_user_pages_remote(), which faults pages in.  For the common
> case of reading memory that is already resident -- /proc/PID/cmdline,
> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> unnecessary and is badly contended on large machines.
> 
> Add an opportunistic, read-only fast path that transfers what it can
> without the mmap lock.  For each address it takes the per-VMA lock with
> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
> a present page before copying it out.  Anything non-trivial -- a not-
> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
> a race with a VMA writer -- falls back to the existing mmap_lock path
> for the remainder.
> 
> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
> for the fast path; the untag mask is a stable per-mm value.
> 
> Only reads are handled here; writes keep using the slow path.
> 
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/uaccess_64.h |  12 +++
>  include/linux/uaccess.h           |  11 ++
>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>  3 files changed, 188 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 4a52497ba6a1..c6fac900a747 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>  	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
>  })
>  
> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
> +							    unsigned long addr)
> +{
> +	return addr & READ_ONCE((mm)->context.untag_mask);
> +}
> +
> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
> +	unsigned long __addr = (__force unsigned long)(addr);		\
> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
> +})
> +
>  #endif
>  
>  #define valid_user_address(x) \
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 8a264662b242..c8c83372c9d8 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -34,6 +34,17 @@
>  })
>  #endif
>  
> +/*
> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
> + */
> +#ifndef untagged_addr_remote_unlocked
> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
> +	(void)(mm);					\
> +	untagged_addr(addr);				\
> +})
> +#endif
> +
>  #ifdef masked_user_access_begin
>   #define can_do_masked_user_access() 1
>  # ifndef masked_user_write_access_begin
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..0b23b82eaa18 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,8 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/secretmem.h>
> +#include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>  EXPORT_SYMBOL_GPL(generic_access_phys);
>  #endif
>  
> +/*
> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
> + */
> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
> +/*
> + * Opportunistic lockless fast path for __access_remote_vm() reads.
> + *
> + * Memory already resident in @mm can be read without taking the heavily
> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
> + *
> + * Anything that would require faulting a page in, touching a hugetlb or
> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
> + * path in __access_remote_vm().  Only reads are handled here.
> + *
> + * Returns the number of bytes transferred via the fast path.
> + */
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	void *old_buf = buf;
> +
> +	addr = untagged_addr_remote_unlocked(mm, addr);
> +
> +	while (len) {
> +		struct vm_area_struct *vma;
> +		vm_flags_t vm_flags;
> +
> +		vma = lock_vma_under_rcu(mm, addr);
> +		if (!vma)
> +			break;
> +
> +		/*
> +		 * Mirror the read-side permission checks of check_vma_flags(),
> +		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
> +		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
> +		 * per VMA; anything not positively allowed falls back to the
> +		 * slow path, which re-validates everything.
> +		 */
> +		vm_flags = vma->vm_flags;
> +		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
> +		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
> +		    (!(vm_flags & VM_READ) &&
> +		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
> +			vma_end_read(vma);
> +			break;
> +		}

This should also do the FOLL_ANON check from check_vma_flags().

check_vma_flags() rejects non-anonymous VMAs when FOLL_ANON is set:

	if ((gup_flags & FOLL_ANON) && !vma_anon)
		return -EFAULT;

That flag is used by fs/proc/base.c for /proc/PID/cmdline and
/proc/PID/environ.  It was added by commit 7f7ccc2ccc2e ("proc: do not
access cmdline nor environ from file-backed areas"), which fixed
CVE-2018-1120 by making those proc files refuse file-backed argv/env
areas.

> +
> +		/*
> +		 * Copy as much of this VMA as we can without re-acquiring the
> +		 * per-VMA lock; re-lock only when @addr leaves the VMA.
> +		 */
> +		while (len && addr < vma->vm_end) {
> +			struct folio_walk fw;
> +			struct folio *folio;
> +			struct page *page;
> +			unsigned long entry_size, entry_left, folio_left, span;
> +			unsigned long copied, idx0;
> +			int offset;
> +
> +			folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
> +			if (!folio) {
> +				vma_end_read(vma);
> +				goto out;
> +			}
> +			page = fw.page;
> +			if (!page) {
> +				folio_walk_end(&fw, vma);
> +				vma_end_read(vma);
> +				goto out;
> +			}
> +			/* Pin the folio so it stays valid after the PTL is dropped. */
> +			folio_get(folio);
> +			folio_walk_end(&fw, vma);
> +
> +			/*
> +			 * folio_walk_start() validated exactly one mapping entry,
> +			 * which covers a contiguous, present run of this folio:
> +			 * PAGE_SIZE for a pte, PMD_SIZE for a pmd leaf, PUD_SIZE
> +			 * for a pud leaf.  Copy up to the end of that entry,
> +			 * bounded by the folio, the VMA and len, so a huge mapping
> +			 * is handled in one walk instead of per page.
> +			 */
> +			offset = offset_in_page(addr);
> +			switch (fw.level) {
> +			case FW_LEVEL_PUD:
> +				entry_size = PUD_SIZE;
> +				break;
> +			case FW_LEVEL_PMD:
> +				entry_size = PMD_SIZE;
> +				break;
> +			default:
> +				entry_size = PAGE_SIZE;
> +				break;
> +			}
> +			entry_left = entry_size - (addr & (entry_size - 1));
> +			idx0 = folio_page_idx(folio, page);
> +			folio_left = ((folio_nr_pages(folio) - idx0) << PAGE_SHIFT) -
> +				     offset;
> +			span = min3((unsigned long)len, entry_left, folio_left);
> +			span = min(span, vma->vm_end - addr);
> +
> +			/*
> +			 * Copy the span page-by-page: kmap_local_folio() maps one
> +			 * page on HIGHMEM and copy_from_user_page() flushes per
> +			 * page on aliasing caches, but the page tables are not
> +			 * re-walked.  The span borrows the single folio reference
> +			 * taken above, so each mapping is dropped with
> +			 * kunmap_local() (not folio_release_kmap(), which would
> +			 * also drop a folio reference per page).
> +			 */
> +			for (copied = 0; copied < span; ) {
> +				unsigned long foff = offset + copied;
> +				unsigned long pidx = idx0 + (foff >> PAGE_SHIFT);
> +				int poff = foff & ~PAGE_MASK;
> +				int chunk = min_t(unsigned long, span - copied,
> +						  PAGE_SIZE - poff);
> +				void *maddr = kmap_local_folio(folio,
> +						pidx << PAGE_SHIFT);
> +
> +				copy_from_user_page(vma, folio_page(folio, pidx),
> +						    addr + copied, buf + copied,
> +						    maddr + poff, chunk);
> +				kunmap_local(maddr);
> +				copied += chunk;
> +			}

__access_remote_vm() slow path calls get_user_page_vma_remote() which calls
get_user_pages_remote(). get_user_pages_remote() adds FOLL_TOUCH and then the
page-table walk eventually reaches follow_page_pte().

The new resident-page fast path copies the same data without doing an
equivalent folio_mark_accessed().

That changes reclaim behaviour for pages repeatedly read through
access_remote_vm(), such as /proc/PID/cmdline polling. I think you
should mark the folio as accessed.


> +
> +			folio_put(folio);
> +			len -= span;
> +			buf += span;
> +			addr += span;
> +		}
> +		vma_end_read(vma);
> +	}
> +out:
> +	return buf - old_buf;
> +}
> +#else
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
> +
>  /*
>   * Access another process' address space as given in mm.
>   */
> @@ -7071,8 +7220,23 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
>  	void *old_buf = buf;
>  	int write = gup_flags & FOLL_WRITE;
>  
> +	/*
> +	 * Try the lockless fast path for reads first; it transfers what it can
> +	 * from resident memory without taking mmap_lock, and leaves the
> +	 * remainder (if any) to the slow path below.
> +	 */
> +	if (!write) {
> +		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
> +
> +		addr += done;
> +		buf += done;
> +		len -= done;
> +		if (!len)
> +			return buf - old_buf;
> +	}
> +
>  	if (mmap_read_lock_killable(mm))
> -		return 0;
> +		return buf - old_buf;
>  
>  	/* Untag the address before looking up the VMA */
>  	addr = untagged_addr_remote(mm, addr);
> -- 
> 2.53.0-Meta
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-18 17:01   ` Usama Arif
@ 2026-06-18 17:07     ` David Hildenbrand (Arm)
  2026-06-18 17:22       ` Usama Arif
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-18 17:07 UTC (permalink / raw)
  To: Usama Arif, Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan

On 6/18/26 19:01, Usama Arif wrote:
> On Tue, 16 Jun 2026 15:03:00 -0400 Rik van Riel <riel@surriel.com> wrote:
> 
>> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
>> uses get_user_pages_remote(), which faults pages in.  For the common
>> case of reading memory that is already resident -- /proc/PID/cmdline,
>> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
>> unnecessary and is badly contended on large machines.
>>
>> Add an opportunistic, read-only fast path that transfers what it can
>> without the mmap lock.  For each address it takes the per-VMA lock with
>> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
>> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
>> a present page before copying it out.  Anything non-trivial -- a not-
>> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
>> a race with a VMA writer -- falls back to the existing mmap_lock path
>> for the remainder.
>>
>> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
>> for the fast path; the untag mask is a stable per-mm value.
>>
>> Only reads are handled here; writes keep using the slow path.
>>
>> Assisted-by: Claude:claude-opus-4-8
>> Signed-off-by: Rik van Riel <riel@surriel.com>
>> ---
>>  arch/x86/include/asm/uaccess_64.h |  12 +++
>>  include/linux/uaccess.h           |  11 ++
>>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>>  3 files changed, 188 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
>> index 4a52497ba6a1..c6fac900a747 100644
>> --- a/arch/x86/include/asm/uaccess_64.h
>> +++ b/arch/x86/include/asm/uaccess_64.h
>> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>>  	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
>>  })
>>  
>> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
>> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
>> +							    unsigned long addr)
>> +{
>> +	return addr & READ_ONCE((mm)->context.untag_mask);
>> +}
>> +
>> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
>> +	unsigned long __addr = (__force unsigned long)(addr);		\
>> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
>> +})
>> +
>>  #endif
>>  
>>  #define valid_user_address(x) \
>> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
>> index 8a264662b242..c8c83372c9d8 100644
>> --- a/include/linux/uaccess.h
>> +++ b/include/linux/uaccess.h
>> @@ -34,6 +34,17 @@
>>  })
>>  #endif
>>  
>> +/*
>> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
>> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
>> + */
>> +#ifndef untagged_addr_remote_unlocked
>> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
>> +	(void)(mm);					\
>> +	untagged_addr(addr);				\
>> +})
>> +#endif
>> +
>>  #ifdef masked_user_access_begin
>>   #define can_do_masked_user_access() 1
>>  # ifndef masked_user_write_access_begin
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 86a973119bd4..0b23b82eaa18 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -42,6 +42,8 @@
>>  #include <linux/kernel_stat.h>
>>  #include <linux/mm.h>
>>  #include <linux/mm_inline.h>
>> +#include <linux/secretmem.h>
>> +#include <linux/pagewalk.h>
>>  #include <linux/sched/mm.h>
>>  #include <linux/sched/numa_balancing.h>
>>  #include <linux/sched/task.h>
>> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>>  EXPORT_SYMBOL_GPL(generic_access_phys);
>>  #endif
>>  
>> +/*
>> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
>> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
>> + */
>> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
>> +/*
>> + * Opportunistic lockless fast path for __access_remote_vm() reads.
>> + *
>> + * Memory already resident in @mm can be read without taking the heavily
>> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
>> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
>> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
>> + *
>> + * Anything that would require faulting a page in, touching a hugetlb or
>> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
>> + * path in __access_remote_vm().  Only reads are handled here.
>> + *
>> + * Returns the number of bytes transferred via the fast path.
>> + */
>> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
>> +				 void *buf, int len, unsigned int gup_flags)
>> +{
>> +	void *old_buf = buf;
>> +
>> +	addr = untagged_addr_remote_unlocked(mm, addr);
>> +
>> +	while (len) {
>> +		struct vm_area_struct *vma;
>> +		vm_flags_t vm_flags;
>> +
>> +		vma = lock_vma_under_rcu(mm, addr);
>> +		if (!vma)
>> +			break;
>> +
>> +		/*
>> +		 * Mirror the read-side permission checks of check_vma_flags(),
>> +		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
>> +		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
>> +		 * per VMA; anything not positively allowed falls back to the
>> +		 * slow path, which re-validates everything.
>> +		 */
>> +		vm_flags = vma->vm_flags;
>> +		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
>> +		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
>> +		    (!(vm_flags & VM_READ) &&
>> +		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
>> +			vma_end_read(vma);
>> +			break;
>> +		}
> 
> This should also do the FOLL_ANON check from check_vma_flags().
> 
> check_vma_flags() rejects non-anonymous VMAs when FOLL_ANON is set:
> 
> 	if ((gup_flags & FOLL_ANON) && !vma_anon)
> 		return -EFAULT;

Duplicating GUP logic in a non-GUP file. Splendid. :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-18 17:07     ` David Hildenbrand (Arm)
@ 2026-06-18 17:22       ` Usama Arif
  0 siblings, 0 replies; 24+ messages in thread
From: Usama Arif @ 2026-06-18 17:22 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan



On 18/06/2026 18:07, David Hildenbrand (Arm) wrote:
> On 6/18/26 19:01, Usama Arif wrote:
>> On Tue, 16 Jun 2026 15:03:00 -0400 Rik van Riel <riel@surriel.com> wrote:
>>
>>> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
>>> uses get_user_pages_remote(), which faults pages in.  For the common
>>> case of reading memory that is already resident -- /proc/PID/cmdline,
>>> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
>>> unnecessary and is badly contended on large machines.
>>>
>>> Add an opportunistic, read-only fast path that transfers what it can
>>> without the mmap lock.  For each address it takes the per-VMA lock with
>>> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
>>> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
>>> a present page before copying it out.  Anything non-trivial -- a not-
>>> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
>>> a race with a VMA writer -- falls back to the existing mmap_lock path
>>> for the remainder.
>>>
>>> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
>>> for the fast path; the untag mask is a stable per-mm value.
>>>
>>> Only reads are handled here; writes keep using the slow path.
>>>
>>> Assisted-by: Claude:claude-opus-4-8
>>> Signed-off-by: Rik van Riel <riel@surriel.com>
>>> ---
>>>  arch/x86/include/asm/uaccess_64.h |  12 +++
>>>  include/linux/uaccess.h           |  11 ++
>>>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>>>  3 files changed, 188 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
>>> index 4a52497ba6a1..c6fac900a747 100644
>>> --- a/arch/x86/include/asm/uaccess_64.h
>>> +++ b/arch/x86/include/asm/uaccess_64.h
>>> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>>>  	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
>>>  })
>>>  
>>> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
>>> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
>>> +							    unsigned long addr)
>>> +{
>>> +	return addr & READ_ONCE((mm)->context.untag_mask);
>>> +}
>>> +
>>> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
>>> +	unsigned long __addr = (__force unsigned long)(addr);		\
>>> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
>>> +})
>>> +
>>>  #endif
>>>  
>>>  #define valid_user_address(x) \
>>> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
>>> index 8a264662b242..c8c83372c9d8 100644
>>> --- a/include/linux/uaccess.h
>>> +++ b/include/linux/uaccess.h
>>> @@ -34,6 +34,17 @@
>>>  })
>>>  #endif
>>>  
>>> +/*
>>> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
>>> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
>>> + */
>>> +#ifndef untagged_addr_remote_unlocked
>>> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
>>> +	(void)(mm);					\
>>> +	untagged_addr(addr);				\
>>> +})
>>> +#endif
>>> +
>>>  #ifdef masked_user_access_begin
>>>   #define can_do_masked_user_access() 1
>>>  # ifndef masked_user_write_access_begin
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 86a973119bd4..0b23b82eaa18 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -42,6 +42,8 @@
>>>  #include <linux/kernel_stat.h>
>>>  #include <linux/mm.h>
>>>  #include <linux/mm_inline.h>
>>> +#include <linux/secretmem.h>
>>> +#include <linux/pagewalk.h>
>>>  #include <linux/sched/mm.h>
>>>  #include <linux/sched/numa_balancing.h>
>>>  #include <linux/sched/task.h>
>>> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>>>  EXPORT_SYMBOL_GPL(generic_access_phys);
>>>  #endif
>>>  
>>> +/*
>>> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
>>> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
>>> + */
>>> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
>>> +/*
>>> + * Opportunistic lockless fast path for __access_remote_vm() reads.
>>> + *
>>> + * Memory already resident in @mm can be read without taking the heavily
>>> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
>>> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
>>> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
>>> + *
>>> + * Anything that would require faulting a page in, touching a hugetlb or
>>> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
>>> + * path in __access_remote_vm().  Only reads are handled here.
>>> + *
>>> + * Returns the number of bytes transferred via the fast path.
>>> + */
>>> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
>>> +				 void *buf, int len, unsigned int gup_flags)
>>> +{
>>> +	void *old_buf = buf;
>>> +
>>> +	addr = untagged_addr_remote_unlocked(mm, addr);
>>> +
>>> +	while (len) {
>>> +		struct vm_area_struct *vma;
>>> +		vm_flags_t vm_flags;
>>> +
>>> +		vma = lock_vma_under_rcu(mm, addr);
>>> +		if (!vma)
>>> +			break;
>>> +
>>> +		/*
>>> +		 * Mirror the read-side permission checks of check_vma_flags(),
>>> +		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
>>> +		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
>>> +		 * per VMA; anything not positively allowed falls back to the
>>> +		 * slow path, which re-validates everything.
>>> +		 */
>>> +		vm_flags = vma->vm_flags;
>>> +		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
>>> +		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
>>> +		    (!(vm_flags & VM_READ) &&
>>> +		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
>>> +			vma_end_read(vma);
>>> +			break;
>>> +		}
>>
>> This should also do the FOLL_ANON check from check_vma_flags().
>>
>> check_vma_flags() rejects non-anonymous VMAs when FOLL_ANON is set:
>>
>> 	if ((gup_flags & FOLL_ANON) && !vma_anon)
>> 		return -EFAULT;
> 
> Duplicating GUP logic in a non-GUP file. Splendid. :)
> 

Haha probably just need a common helper.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-17 13:33     ` Suren Baghdasaryan
@ 2026-06-18 20:37       ` David Hildenbrand (Arm)
  2026-06-19 12:43         ` Lorenzo Stoakes
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-18 20:37 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Rik van Riel, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

On 6/17/26 15:33, Suren Baghdasaryan wrote:
> On Wed, Jun 17, 2026 at 2:42 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
>>
>> On 6/17/26 03:10, Suren Baghdasaryan wrote:
>>>
>>> Thanks for the patchset Rik!
>>> Previously when I looked into using per-VMA locks in
>>> access_remote_vm(), the biggest hurdle was get_user_pages_remote(),
>>> which required mmap_lock. Your implementation avoids altogether and
>>> keeps the code much simpler than what I expected.
>>
>> But, wouldn't we, in general, also want to teach GUP to just work with per-VMA
>> locks?
> 
> Matthew suggested using gup_fast in access_remote_vm before, and I
> looked into that. The biggest issue there is that gup_fast assumes it
> always operates on current->mm, not on the remote one. Reworking that
> is quite an undertaking.

Right, that's more tricky, IIRC the CPU from a remote MM might not get an IPI
sent to sync. (but my memory is fuzzy on that)

> Teaching GUP in general to work with per-VMA locks I think would also
> be much harder than what this patchset does and would require a GUP
> expert (which I am unfortunately not).

Well, "harder" is not really an excuse ;)

Where a folio_walk really shines at is that you can just walk a PMD entry and
process it all at once, instead of returning 512. Where it doesn't shine is that
you have to walk the complete page table again for each individual PTE.

... which is also what we do right now through get_user_page_vma_remote(), which
is rather suboptimal.

Ideally, you'd obtain multiple page ranges (with upper limit on the ranges) in
one shot, whereby each page range belongs to the same compound page, and there
is exactly one page/folio ref on a range. [we discussed that in other context
recently]

Then, you can just let GUP do the GUP work, without re-implementing it for some
special cases elsewhere. And others can benefit from the work.

So I'd really like us to find out what it would take to teach ordinary GUP (or
better, some new GUP interface) to run under the VMA lock. We can start with the
existing interface to GUP a single page to KIS.

Maybe having a new GUP interface that consumes a VMA instead of an MM could be
the first start to enable per-VMA locks?

All GUP does is walk page tables and call fault handlers. userfaultfd is nasty,
but existing page faults must also deal with that having to fallback to the MM
lock, so it sounds like a solvable problem with some churn?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
  2026-06-17  6:19   ` Suren Baghdasaryan
  2026-06-18 17:01   ` Usama Arif
@ 2026-06-19 12:20   ` Lorenzo Stoakes
  2 siblings, 0 replies; 24+ messages in thread
From: Lorenzo Stoakes @ 2026-06-19 12:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan

On Tue, Jun 16, 2026 at 03:03:00PM -0400, Rik van Riel wrote:
> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> uses get_user_pages_remote(), which faults pages in.  For the common
> case of reading memory that is already resident -- /proc/PID/cmdline,
> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> unnecessary and is badly contended on large machines.
>
> Add an opportunistic, read-only fast path that transfers what it can
> without the mmap lock.  For each address it takes the per-VMA lock with
> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
> a present page before copying it out.  Anything non-trivial -- a not-
> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
> a race with a VMA writer -- falls back to the existing mmap_lock path
> for the remainder.
>
> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
> for the fast path; the untag mask is a stable per-mm value.
>
> Only reads are handled here; writes keep using the slow path.
>
> Assisted-by: Claude:claude-opus-4-8

This feels as if there was a little too much left to AI :)

> Signed-off-by: Rik van Riel <riel@surriel.com>

This needs to be separated into more patches, functions, and thoroughly reworked
to be upstreamable, unfortunately.

It's additionally quite hard to review in this form.

> ---
>  arch/x86/include/asm/uaccess_64.h |  12 +++
>  include/linux/uaccess.h           |  11 ++
>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>  3 files changed, 188 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 4a52497ba6a1..c6fac900a747 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>  	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
>  })
>
> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */

How? This is pretty vague.

> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
> +							    unsigned long addr)
> +{
> +	return addr & READ_ONCE((mm)->context.untag_mask);
> +}
> +
> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
> +	unsigned long __addr = (__force unsigned long)(addr);		\
> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
> +})

I'm confused why you're implementing this and not just calling untagged_addr()
from untagged_addr_remote_unlocked()?

You don't comment or explain this in the commit msg afaict.

> +
>  #endif
>
>  #define valid_user_address(x) \
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 8a264662b242..c8c83372c9d8 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -34,6 +34,17 @@
>  })
>  #endif
>
> +/*
> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
> + */

It's odd you'll comment this like this but not explain the confusing bit as to
why you can't just call untagged_addr()

> +#ifndef untagged_addr_remote_unlocked
> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
> +	(void)(mm);					\

I'm not sure this is required?

> +	untagged_addr(addr);				\

Weird again that x86 needs special treatment but not other arches?

> +})
> +#endif
> +

You should really make untagged_addr_remote() call
untagged_addr_remote_unlocked() after its assert, otherwise it's a really odd
inconsistency.

>  #ifdef masked_user_access_begin
>   #define can_do_masked_user_access() 1
>  # ifndef masked_user_write_access_begin
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..0b23b82eaa18 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,8 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/secretmem.h>
> +#include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>  EXPORT_SYMBOL_GPL(generic_access_phys);
>  #endif
>
> +/*
> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
> + */
> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)

Shall we wait for, or rely on Dave's series to remove CONFIG_PER_VMA_LOCK + make
it permanently on here?

> +/*
> + * Opportunistic lockless fast path for __access_remote_vm() reads.
> + *
> + * Memory already resident in @mm can be read without taking the heavily
> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
> + *
> + * Anything that would require faulting a page in, touching a hugetlb or
> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
> + * path in __access_remote_vm().  Only reads are handled here.

I think referencing the confusing mess that is the special VMA flags is best
avoided (and anyway I think and you should just say

I think we could be clearer here like:

This is the read fast patch, writes are handled by the slow path in
__access_remote_vm() - faulting in, touching hugetlb or a remap.


> + *
> + * Returns the number of bytes transferred via the fast path.
> + */
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)

In general this function is... ugly. Very ugly :)

You nest while -> while -> for and it's all open-coded.

It needs major refactoring - separate out smaller functions, improve the
comments, separate out logic that can be shared with gup etc.

> +{
> +	void *old_buf = buf;
> +
> +	addr = untagged_addr_remote_unlocked(mm, addr);
> +
> +	while (len) {
> +		struct vm_area_struct *vma;
> +		vm_flags_t vm_flags;
> +
> +		vma = lock_vma_under_rcu(mm, addr);
> +		if (!vma)
> +			break;
> +
> +		/*
> +		 * Mirror the read-side permission checks of check_vma_flags(),
> +		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
> +		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
> +		 * per VMA; anything not positively allowed falls back to the
> +		 * slow path, which re-validates everything.
> +		 */

This feels very overwrought. You're compressing far to omuch into one lump of
text.

> +		vm_flags = vma->vm_flags;

Please don't use the old VMA flags API for new code. And definitely don't do
some weird vm_flags we keep separate from vma->vm_flags thing here.

	vma_test_any(vma, VMA_IO_BIT, VMA_PFNMAP_BIT)

> +		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
> +		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
> +		    (!(vm_flags & VM_READ) &&

But what does !VM_READ mean exactly? Are you really checking for PROT_NONE?

So vma_is_accessible() is right here no?

> +		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {

This conditional is abhorrent...

I think a good rule of thumb for this kind of thing is to read it out loud in
English - 'if IO or PFN map or hugetlb or secret mem or either not VM_READ etc.'
- if you're confused by it in English then don't put it in code.

Anyway it should clearly be a separate function like:

	static bool vma_can_use_fast_path(const struct vm_area_struct *vma)
	{
		/* We cannot GUP PFN maps or I/O memory. */
		if (vma_test_any(vma, VMA_IO_BIT, VMA_PFNMAP_BIT))
			return false;
		/* Hugetlb is a special snowflake. */
		if (is_vm_hugetlb_page(vma))
			return false;
		... etc. etc. ...
		return true;
	}

Which is vastly clearer.


> +			vma_end_read(vma);
> +			break;
> +		}
> +
> +		/*
> +		 * Copy as much of this VMA as we can without re-acquiring the
> +		 * per-VMA lock; re-lock only when @addr leaves the VMA.
> +		 */

Strange phrasing. I'm not even sure it's a useful comment?

> +		while (len && addr < vma->vm_end) {
> +			struct folio_walk fw;

Be good to avoid mystery meat varible names. 'walk'?

> +			struct folio *folio;
> +			struct page *page;
> +			unsigned long entry_size, entry_left, folio_left, span;
> +			unsigned long copied, idx0;

idx0 is a terrible name :)

All these variables tells you the function is too long.

> +			int offset;
> +
> +			folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
> +			if (!folio) {
> +				vma_end_read(vma);
> +				goto out;
> +			}
> +			page = fw.page;
> +			if (!page) {

under what circumstances would !fw.page when folio is non-NULL?

> +				folio_walk_end(&fw, vma);
> +				vma_end_read(vma);
> +				goto out;
> +			}
> +			/* Pin the folio so it stays valid after the PTL is dropped. */
> +			folio_get(folio);
> +			folio_walk_end(&fw, vma);
> +
> +			/*
> +			 * folio_walk_start() validated exactly one mapping entry,
> +			 * which covers a contiguous, present run of this folio:
> +			 * PAGE_SIZE for a pte, PMD_SIZE for a pmd leaf, PUD_SIZE
> +			 * for a pud leaf.  Copy up to the end of that entry,
> +			 * bounded by the folio, the VMA and len, so a huge mapping
> +			 * is handled in one walk instead of per page.
> +			 */
> +			offset = offset_in_page(addr);
> +			switch (fw.level) {
> +			case FW_LEVEL_PUD:
> +				entry_size = PUD_SIZE;
> +				break;
> +			case FW_LEVEL_PMD:
> +				entry_size = PMD_SIZE;
> +				break;
> +			default:
> +				entry_size = PAGE_SIZE;
> +				break;
> +			}
> +			entry_left = entry_size - (addr & (entry_size - 1));

Surely we have a better way of doing this? At least needs abstracting, a random
switch in the middle of this code is horrid.

> +			idx0 = folio_page_idx(folio, page);
> +			folio_left = ((folio_nr_pages(folio) - idx0) << PAGE_SHIFT) -
> +				     offset;

Couldn't we just keep track of this without this horrid expression?

> +			span = min3((unsigned long)len, entry_left, folio_left);
> +			span = min(span, vma->vm_end - addr);

You add massive comments for some bits, then do extremely confusing open coded
stuff here?

This needs a lot of breaking up.

> +
> +			/*
> +			 * Copy the span page-by-page: kmap_local_folio() maps one
> +			 * page on HIGHMEM and copy_from_user_page() flushes per
> +			 * page on aliasing caches, but the page tables are not
> +			 * re-walked.  The span borrows the single folio reference
> +			 * taken above, so each mapping is dropped with
> +			 * kunmap_local() (not folio_release_kmap(), which would
> +			 * also drop a folio reference per page).
> +			 */

This is a really confusing mass of text that is really dense and hard to
parse. Clarity is king.

In any case this should be separted out.

> +			for (copied = 0; copied < span; ) {

Very odd for loop.

	copied = 0;
	while (copied < span) {
		...
	}

Would be better. But I think reworking it so a normal for (init; cond; incr)
loop would work would be better?

> +				unsigned long foff = offset + copied;

foff :)) now I won't be childish :P

I really dislike overly compressed variable names. It's vague. File offset? I
guess you mean folio offset right?

and why did you call it plain 'offset' before but now specify folio but as 'f'?

> +				unsigned long pidx = idx0 + (foff >> PAGE_SHIFT);

'pidx'?

Equally unnecessarily and confusingly compressed variable name. We can live with
page_index if that's what you mean?

Also it's unceratin what the units are. It's ok to say 'bytes' and 'nr_pages' or
'page_nr' or something, and far clearer.

This should obviously be in another function anyway given indentation levels here.

> +				int poff = foff & ~PAGE_MASK;
> +				int chunk = min_t(unsigned long, span - copied,
> +						  PAGE_SIZE - poff);
> +				void *maddr = kmap_local_folio(folio,
> +						pidx << PAGE_SHIFT);
> +
> +				copy_from_user_page(vma, folio_page(folio, pidx),
> +						    addr + copied, buf + copied,
> +						    maddr + poff, chunk);
> +				kunmap_local(maddr);
> +				copied += chunk;
> +			}
> +
> +			folio_put(folio);
> +			len -= span;
> +			buf += span;
> +			addr += span;
> +		}
> +		vma_end_read(vma);

Really hard to keep track of what's what here.

> +	}
> +out:
> +	return buf - old_buf;
> +}
> +#else
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
> +
>  /*
>   * Access another process' address space as given in mm.
>   */
> @@ -7071,8 +7220,23 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
>  	void *old_buf = buf;
>  	int write = gup_flags & FOLL_WRITE;
>
> +	/*
> +	 * Try the lockless fast path for reads first; it transfers what it can
> +	 * from resident memory without taking mmap_lock, and leaves the
> +	 * remainder (if any) to the slow path below.
> +	 */

This is a weird comment, you should describe what access_remote_vm_fast() does
in access_remote_vm_fast(). You also don't mention the !write here which is the
thing people might wonder about.

I think the code is self-documenting anyway - try fast path - pretty clear.


> +	if (!write) {
> +		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);

Can be const.

Can't errors arise in access_remote_vm_fast()? And in general seems it'd make
more sense to return an error/bool rather and have done as output param rather
than infer stuff from done.

> +
> +		addr += done;
> +		buf += done;
> +		len -= done;
> +		if (!len)
> +			return buf - old_buf;

So usual case will be it does everything right? So you do some useless
arithmetic and then return buf - old_buf.

Should probably instead have a return value.

But in general __access_remote_vm() is horrible. I think if you add new features
it's only right you spend some commits cleaning up first.

Otherwise we heap more stuff on top of broken stuff on and on and things get
messier + messier.


> +	}
> +
>  	if (mmap_read_lock_killable(mm))
> -		return 0;
> +		return buf - old_buf;

Err there's other cases where you return 0 here, e.g.:

	/* Avoid triggering the temporary warning in __get_user_pages */
	if (!vma_lookup(mm, addr) && !expand_stack(mm, addr))
		return 0;

So you probably need to fix those up to?

Probably better to just have a return value declared in the function.

>
>  	/* Untag the address before looking up the VMA */
>  	addr = untagged_addr_remote(mm, addr);
> --
> 2.53.0-Meta
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-17  6:19   ` Suren Baghdasaryan
@ 2026-06-19 12:24     ` Lorenzo Stoakes
  2026-06-19 13:46     ` Rik van Riel
  1 sibling, 0 replies; 24+ messages in thread
From: Lorenzo Stoakes @ 2026-06-19 12:24 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Rik van Riel, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka

On Tue, Jun 16, 2026 at 11:19:12PM -0700, Suren Baghdasaryan wrote:
> On Tue, Jun 16, 2026 at 12:04 PM Rik van Riel <riel@surriel.com> wrote:
> >
> > __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> > uses get_user_pages_remote(), which faults pages in.  For the common
> > case of reading memory that is already resident -- /proc/PID/cmdline,
> > /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> > unnecessary and is badly contended on large machines.
> >
> > Add an opportunistic, read-only fast path that transfers what it can
> > without the mmap lock.  For each address it takes the per-VMA lock with
> > lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
> > folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
> > a present page before copying it out.  Anything non-trivial -- a not-
> > present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
> > a race with a VMA writer -- falls back to the existing mmap_lock path
> > for the remainder.
>
> I don't think we should be using per-VMA locks if the read spans
> multiple VMAs. Doing that would risk a possibility of reading
> inconsistent data since we are locking one VMA at a time. While we

Yeah, very true.

Suren has expounded on the possible cases that can occur elsewhere but you can
observe strange states like that.

You can see tools/testing/selftests/proc/proc-maps-race.c for a sense of it and
https://lore.kernel.org/all/20260426062718.1238437-1-surenb@google.com/

Note that for e.g. madvise() this is exactly what we do.

> load and read VMA, its neighboring VMA can be unmapped and another one
> can be mapped in its place. So, our read spanning both VMAs will
> return inconsistent data. access_remote_vm_fast() can check if the
> entire read is contained within one VMA and if not, fall back to
> mmap_lock.

This would also vastly simplify the code. I expect most real-world cases are
like this anyway?

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock
  2026-06-16 19:02 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
@ 2026-06-19 12:34   ` Lorenzo Stoakes
  0 siblings, 0 replies; 24+ messages in thread
From: Lorenzo Stoakes @ 2026-06-19 12:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan

On Tue, Jun 16, 2026 at 03:02:59PM -0400, Rik van Riel wrote:
> folio_walk_start() asserts that the mmap lock is held.  For callers that
> only need to read a single, already-present page, the mmap lock is a
> heavy and often badly contended hammer: the VMA can instead be
> stabilized with the per-VMA lock, and the page table pages that are
> walked are kept alive by RCU page-table freeing
> (CONFIG_MMU_GATHER_RCU_TABLE_FREE).

See below, I don't think this is correct?

>
> Add an FW_VMA_LOCKED flag.  When passed, folio_walk_start() asserts the
> per-VMA lock instead of the mmap lock, requires RCU-freed page tables,
> and refuses hugetlb VMAs (PMD sharing cannot be walked safely this way).

This is mostly superfluous. You can just say you added the flag to use a VMA
flag. You put in parens the key thing about hugetlb, I think you should break
that out.

> Everything else folio_walk_start() relies on -- the page table locks,
> pmdp_get_lockless() and pte_offset_map_lock() -- is already safe without

is -> are.

> the mmap lock, mirroring the per-VMA lock page fault path.

I'm not sure I understand why you have to have RCU freed page tables but then
say that you didn't need it here? Strange to reference arbitrary functions from
folio_walk_start() too.

>
> No existing caller passes FW_VMA_LOCKED, so behaviour is unchanged.
>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  include/linux/pagewalk.h |  5 +++++
>  mm/pagewalk.c            | 18 ++++++++++++++++--
>  2 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index b41d7265c01b..84dd0d68f747 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -150,6 +150,11 @@ typedef int __bitwise folio_walk_flags_t;
>
>  /* Walk shared zeropages (small + huge) as well. */
>  #define FW_ZEROPAGE			((__force folio_walk_flags_t)BIT(0))
> +/*
> + * The caller holds the per-VMA lock instead of the mmap lock. Only valid with
> + * RCU-freed page tables (CONFIG_MMU_GATHER_RCU_TABLE_FREE) and not for hugetlb.
> + */

Hang on how could we be freeing higher level page tables of a VMA that's still
locked?

A VMA lock stabilises page tables for traversal, so why do you require
CONFIG_MMU_GATHER_RCU_TABLE_FREE here?

What will free the higher-level page tables?

Ref: https://origin.kernel.org/doc/html/latest/mm/process_addrs.html#page-table


> +#define FW_VMA_LOCKED			((__force folio_walk_flags_t)BIT(1))
>
>  enum folio_walk_level {
>  	FW_LEVEL_PTE,
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 3ae2586ff45b..c85364b73e12 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -890,7 +890,9 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
>   * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
>   * not correspond to the first physical entry of a logical hugetlb entry.
>   *
> - * The mmap lock must be held in read mode.
> + * The mmap lock must be held in read mode. Alternatively, if @FW_VMA_LOCKED is
> + * passed, the VMA's per-VMA lock must be held (only supported with RCU-freed
> + * page tables, i.e. CONFIG_MMU_GATHER_RCU_TABLE_FREE, and not for hugetlb).
>   *
>   * Return: folio pointer on success, otherwise NULL.
>   */
> @@ -908,7 +910,19 @@ struct folio *folio_walk_start(struct folio_walk *fw,
>  	pgd_t *pgdp;
>  	p4d_t *p4dp;
>
> -	mmap_assert_locked(vma->vm_mm);
> +	if (flags & FW_VMA_LOCKED) {
> +		/*
> +		 * Lockless walk: the per-VMA lock keeps the VMA stable, and
> +		 * RCU-freed page tables keep the walked page table pages alive
> +		 * across the lockless upper-level walk and pte_offset_map_lock().

Err, but we take locks as normal on the walk?

> +		 * Hugetlb (PMD sharing) is not supported on this path.

I don't get the explanation above and then you just write a line that says what
your assert is doing with zero explanation here.

You should explain why hugetlb isn't supported.

> +		 */
> +		VM_WARN_ON_ONCE(!IS_ENABLED(CONFIG_MMU_GATHER_RCU_TABLE_FREE));
> +		VM_WARN_ON_ONCE(is_vm_hugetlb_page(vma));
> +		vma_assert_locked(vma);
> +	} else {
> +		mmap_assert_locked(vma->vm_mm);
> +	}
>  	vma_pgtable_walk_begin(vma);
>
>  	if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
> --
> 2.53.0-Meta
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-18 20:37       ` David Hildenbrand (Arm)
@ 2026-06-19 12:43         ` Lorenzo Stoakes
  2026-06-19 13:04           ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 24+ messages in thread
From: Lorenzo Stoakes @ 2026-06-19 12:43 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Suren Baghdasaryan, Rik van Riel, linux-kernel, x86, linux-mm,
	Thomas Gleixner, Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov,
	Dave Hansen, Andrew Morton, Liam R. Howlett, Vlastimil Babka

On Thu, Jun 18, 2026 at 10:37:03PM +0200, David Hildenbrand (Arm) wrote:
> On 6/17/26 15:33, Suren Baghdasaryan wrote:
> > On Wed, Jun 17, 2026 at 2:42 AM David Hildenbrand (Arm)
> > <david@kernel.org> wrote:
> >>
> >> On 6/17/26 03:10, Suren Baghdasaryan wrote:
> >>>
> >>> Thanks for the patchset Rik!
> >>> Previously when I looked into using per-VMA locks in
> >>> access_remote_vm(), the biggest hurdle was get_user_pages_remote(),
> >>> which required mmap_lock. Your implementation avoids altogether and
> >>> keeps the code much simpler than what I expected.
> >>
> >> But, wouldn't we, in general, also want to teach GUP to just work with per-VMA
> >> locks?
> >
> > Matthew suggested using gup_fast in access_remote_vm before, and I
> > looked into that. The biggest issue there is that gup_fast assumes it
> > always operates on current->mm, not on the remote one. Reworking that
> > is quite an undertaking.
>
> Right, that's more tricky, IIRC the CPU from a remote MM might not get an IPI
> sent to sync. (but my memory is fuzzy on that)
>
> > Teaching GUP in general to work with per-VMA locks I think would also
> > be much harder than what this patchset does and would require a GUP
> > expert (which I am unfortunately not).
>
> Well, "harder" is not really an excuse ;)

Agreed.

We shouldn't just tack on new features like this without improving what already
exists.

Really a series that adds a new feature should have patches that first clean
things up to lay the foundations.

I think we've been a bit too permissive of 'just add more code' series in mm in
general when we know the ground around it is already shakey.

We all need to do our part in improving the codebase.

>
> Where a folio_walk really shines at is that you can just walk a PMD entry and
> process it all at once, instead of returning 512. Where it doesn't shine is that
> you have to walk the complete page table again for each individual PTE.
>
> ... which is also what we do right now through get_user_page_vma_remote(), which
> is rather suboptimal.
>
> Ideally, you'd obtain multiple page ranges (with upper limit on the ranges) in
> one shot, whereby each page range belongs to the same compound page, and there
> is exactly one page/folio ref on a range. [we discussed that in other context
> recently]
>
> Then, you can just let GUP do the GUP work, without re-implementing it for some
> special cases elsewhere. And others can benefit from the work.
>
>
> So I'd really like us to find out what it would take to teach ordinary GUP (or
> better, some new GUP interface) to run under the VMA lock. We can start with the
> existing interface to GUP a single page to KIS.
>
> Maybe having a new GUP interface that consumes a VMA instead of an MM could be
> the first start to enable per-VMA locks?
>
> All GUP does is walk page tables and call fault handlers. userfaultfd is nasty,
> but existing page faults must also deal with that having to fallback to the MM
> lock, so it sounds like a solvable problem with some churn?

Well I think a critical problem here, as pointed out by Suren, is that holding a
VMA lock means that the VMAs around you can change and in ways that are quite
problematic.

E.g. The moment you drop the VMA lock that VMA might get freed and then merged
with something else, and the next VMA you consume is the same one you just
partially walked, for instance.

Now perhaps you could reason your way around this, but I'm pretty sure there are
cases where you might actually miss VMAs due to races (Suren knows best).

And also without an mmap lock people can unmap and map new VMAs in the range as
you go through which might cause weirdness as well.

Really, unless you are dealing with a single VMA in the range, I suspect GUP
needs to stabilise that whole range.

If we could find a way to have GUP fast-path the single VMA case sensibly, then
that's probably workable?

And I agree special-casing only one place but not others sucks.

Perhaps we could find a way to get this improvement without it being quite so
'tacked on' but without needing significant rework of GUP, but in either case I
broadly agree we need to improve the codebase as part of the changes.

>
> --
> Cheers,
>
> David

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-19 12:43         ` Lorenzo Stoakes
@ 2026-06-19 13:04           ` David Hildenbrand (Arm)
  2026-06-19 14:19             ` Suren Baghdasaryan
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-19 13:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Rik van Riel, linux-kernel, x86, linux-mm,
	Thomas Gleixner, Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov,
	Dave Hansen, Andrew Morton, Liam R. Howlett, Vlastimil Babka

>> All GUP does is walk page tables and call fault handlers. userfaultfd is nasty,
>> but existing page faults must also deal with that having to fallback to the MM
>> lock, so it sounds like a solvable problem with some churn?
> 
> Well I think a critical problem here, as pointed out by Suren, is that holding a
> VMA lock means that the VMAs around you can change and in ways that are quite
> problematic.
> 
> E.g. The moment you drop the VMA lock that VMA might get freed and then merged
> with something else, and the next VMA you consume is the same one you just
> partially walked, for instance.
> 
> Now perhaps you could reason your way around this, but I'm pretty sure there are
> cases where you might actually miss VMAs due to races (Suren knows best).
> 
> And also without an mmap lock people can unmap and map new VMAs in the range as
> you go through which might cause weirdness as well.
> 
> Really, unless you are dealing with a single VMA in the range, I suspect GUP
> needs to stabilise that whole range.

Well, depends, really. It's not like a all GUP operation that target many pages
runs exclusively under the mmap lock that would prevent any VMA changes.

With userfaultfd, for example, we drop the lock in between, to lookup the VMA
again later. There are various paths where __get_user_pages_locked() is
instructed to grab the mmap lock itself, to even temporarily drop it if the mmap
lock was dropped.

gup_fast_fallback() grabs some pages to then take the mmap lock. And continue
from the next address.

So it really depends on the use case. I would actually be surprised if there a
lot of use cases that strictly must block concurrent mremap operations etc.

The important part is that you process each virtual page address requested
exactly once. If the VMA was merged in the meantime, you continue from that
address in the previously-processed VMA.

Some use cases might indeed want to stabilize the whole range. But I wouldn't
expect them to opt-in to using per-VMA locks.

Just like with any other page table walker, we cannot just convert all in one
shot to use per-VMA locks.

> 
> If we could find a way to have GUP fast-path the single VMA case sensibly, then
> that's probably workable?

Right, that's what I said: start with a single-VMA interface that supports
getting called with the per-vma lock or the mmap lock.

If we have to fallback to the mmap lock (userfaultd? indicated back by the
caller), handle it in the caller of that interface for now.

> 
> And I agree special-casing only one place but not others sucks.

Yeah, we're not doing that unless inevitable.

> 
> Perhaps we could find a way to get this improvement without it being quite so
> 'tacked on' but without needing significant rework of GUP, but in either case I
> broadly agree we need to improve the codebase as part of the changes.

We shouldn't fear extending GUP in a reasonable way that makes everyone out
there profit ion the long run :)

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-17  6:19   ` Suren Baghdasaryan
  2026-06-19 12:24     ` Lorenzo Stoakes
@ 2026-06-19 13:46     ` Rik van Riel
  2026-06-19 14:03       ` Suren Baghdasaryan
  1 sibling, 1 reply; 24+ messages in thread
From: Rik van Riel @ 2026-06-19 13:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka

On Tue, 2026-06-16 at 23:19 -0700, Suren Baghdasaryan wrote:
> 
> I don't think we should be using per-VMA locks if the read spans
> multiple VMAs. Doing that would risk a possibility of reading
> inconsistent data since we are locking one VMA at a time. While we
> load and read VMA, its neighboring VMA can be unmapped and another
> one
> can be mapped in its place. 

How is that different from userspace overwriting
data while we are reading it, and us reading half
new, and half old contents?

We already have nothing synchronizing the contents
today.

All our locking does is protect the metadata we
use to find the memory.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-19 13:46     ` Rik van Riel
@ 2026-06-19 14:03       ` Suren Baghdasaryan
  2026-06-19 14:33         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 24+ messages in thread
From: Suren Baghdasaryan @ 2026-06-19 14:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka

On Fri, Jun 19, 2026 at 6:47 AM Rik van Riel <riel@surriel.com> wrote:
>
> On Tue, 2026-06-16 at 23:19 -0700, Suren Baghdasaryan wrote:
> >
> > I don't think we should be using per-VMA locks if the read spans
> > multiple VMAs. Doing that would risk a possibility of reading
> > inconsistent data since we are locking one VMA at a time. While we
> > load and read VMA, its neighboring VMA can be unmapped and another
> > one
> > can be mapped in its place.
>
> How is that different from userspace overwriting
> data while we are reading it, and us reading half
> new, and half old contents?
>
> We already have nothing synchronizing the contents
> today.
>
> All our locking does is protect the metadata we
> use to find the memory.

Current locking ensures that when reading across VMAs, the structure
of the VMA tree stays consistent during that read. I think that
*structural consistency* of the area we are reading is important to
preserve. Someone can be overriding the data while we are reading it,
but that's *content consistency* and yes, we don't have protection
against that both before and after your change.

>
> --
> All Rights Reversed.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-19 13:04           ` David Hildenbrand (Arm)
@ 2026-06-19 14:19             ` Suren Baghdasaryan
  2026-06-19 14:27               ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 24+ messages in thread
From: Suren Baghdasaryan @ 2026-06-19 14:19 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, Rik van Riel, linux-kernel, x86, linux-mm,
	Thomas Gleixner, Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov,
	Dave Hansen, Andrew Morton, Liam R. Howlett, Vlastimil Babka

On Fri, Jun 19, 2026 at 6:05 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> >> All GUP does is walk page tables and call fault handlers. userfaultfd is nasty,
> >> but existing page faults must also deal with that having to fallback to the MM
> >> lock, so it sounds like a solvable problem with some churn?
> >
> > Well I think a critical problem here, as pointed out by Suren, is that holding a
> > VMA lock means that the VMAs around you can change and in ways that are quite
> > problematic.
> >
> > E.g. The moment you drop the VMA lock that VMA might get freed and then merged
> > with something else, and the next VMA you consume is the same one you just
> > partially walked, for instance.
> >
> > Now perhaps you could reason your way around this, but I'm pretty sure there are
> > cases where you might actually miss VMAs due to races (Suren knows best).
> >
> > And also without an mmap lock people can unmap and map new VMAs in the range as
> > you go through which might cause weirdness as well.
> >
> > Really, unless you are dealing with a single VMA in the range, I suspect GUP
> > needs to stabilise that whole range.
>
> Well, depends, really. It's not like a all GUP operation that target many pages
> runs exclusively under the mmap lock that would prevent any VMA changes.
>
> With userfaultfd, for example, we drop the lock in between, to lookup the VMA
> again later. There are various paths where __get_user_pages_locked() is
> instructed to grab the mmap lock itself, to even temporarily drop it if the mmap
> lock was dropped.
>
> gup_fast_fallback() grabs some pages to then take the mmap lock. And continue
> from the next address.
>
>
> So it really depends on the use case. I would actually be surprised if there a
> lot of use cases that strictly must block concurrent mremap operations etc.
>
> The important part is that you process each virtual page address requested
> exactly once. If the VMA was merged in the meantime, you continue from that
> address in the previously-processed VMA.
>
>
> Some use cases might indeed want to stabilize the whole range. But I wouldn't
> expect them to opt-in to using per-VMA locks.
>
> Just like with any other page table walker, we cannot just convert all in one
> shot to use per-VMA locks.
>
> >
> > If we could find a way to have GUP fast-path the single VMA case sensibly, then
> > that's probably workable?
>
> Right, that's what I said: start with a single-VMA interface that supports
> getting called with the per-vma lock or the mmap lock.
>
> If we have to fallback to the mmap lock (userfaultd? indicated back by the
> caller), handle it in the caller of that interface for now.
>
> >
> > And I agree special-casing only one place but not others sucks.
>
> Yeah, we're not doing that unless inevitable.
>
> >
> > Perhaps we could find a way to get this improvement without it being quite so
> > 'tacked on' but without needing significant rework of GUP, but in either case I
> > broadly agree we need to improve the codebase as part of the changes.
>
> We shouldn't fear extending GUP in a reasonable way that makes everyone out
> there profit ion the long run :)

I do not disagree with the general premise of making existing
mechanisms work better rather than implementing parallel ones. I'm
just pointing out my findings so far when I moved in that direction
and I'm happy Rik posted an alternative simple way around large
refactoring and started this discussion. We should definitely try
reworking GUP to cause less contention. I just don't have enough time
ATM to drive that, but would be happy to help with the VMA-locking
parts.

>
> --
> Cheers,
>
> David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-19 14:19             ` Suren Baghdasaryan
@ 2026-06-19 14:27               ` David Hildenbrand (Arm)
  2026-06-19 15:07                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-19 14:27 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Lorenzo Stoakes, Rik van Riel, linux-kernel, x86, linux-mm,
	Thomas Gleixner, Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov,
	Dave Hansen, Andrew Morton, Liam R. Howlett, Vlastimil Babka

On 6/19/26 16:19, Suren Baghdasaryan wrote:
> On Fri, Jun 19, 2026 at 6:05 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
>>
>>>
>>> Well I think a critical problem here, as pointed out by Suren, is that holding a
>>> VMA lock means that the VMAs around you can change and in ways that are quite
>>> problematic.
>>>
>>> E.g. The moment you drop the VMA lock that VMA might get freed and then merged
>>> with something else, and the next VMA you consume is the same one you just
>>> partially walked, for instance.
>>>
>>> Now perhaps you could reason your way around this, but I'm pretty sure there are
>>> cases where you might actually miss VMAs due to races (Suren knows best).
>>>
>>> And also without an mmap lock people can unmap and map new VMAs in the range as
>>> you go through which might cause weirdness as well.
>>>
>>> Really, unless you are dealing with a single VMA in the range, I suspect GUP
>>> needs to stabilise that whole range.
>>
>> Well, depends, really. It's not like a all GUP operation that target many pages
>> runs exclusively under the mmap lock that would prevent any VMA changes.
>>
>> With userfaultfd, for example, we drop the lock in between, to lookup the VMA
>> again later. There are various paths where __get_user_pages_locked() is
>> instructed to grab the mmap lock itself, to even temporarily drop it if the mmap
>> lock was dropped.
>>
>> gup_fast_fallback() grabs some pages to then take the mmap lock. And continue
>> from the next address.
>>
>>
>> So it really depends on the use case. I would actually be surprised if there a
>> lot of use cases that strictly must block concurrent mremap operations etc.
>>
>> The important part is that you process each virtual page address requested
>> exactly once. If the VMA was merged in the meantime, you continue from that
>> address in the previously-processed VMA.
>>
>>
>> Some use cases might indeed want to stabilize the whole range. But I wouldn't
>> expect them to opt-in to using per-VMA locks.
>>
>> Just like with any other page table walker, we cannot just convert all in one
>> shot to use per-VMA locks.
>>
>>>
>>> If we could find a way to have GUP fast-path the single VMA case sensibly, then
>>> that's probably workable?
>>
>> Right, that's what I said: start with a single-VMA interface that supports
>> getting called with the per-vma lock or the mmap lock.
>>
>> If we have to fallback to the mmap lock (userfaultd? indicated back by the
>> caller), handle it in the caller of that interface for now.
>>
>>>
>>> And I agree special-casing only one place but not others sucks.
>>
>> Yeah, we're not doing that unless inevitable.
>>
>>>
>>> Perhaps we could find a way to get this improvement without it being quite so
>>> 'tacked on' but without needing significant rework of GUP, but in either case I
>>> broadly agree we need to improve the codebase as part of the changes.
>>
>> We shouldn't fear extending GUP in a reasonable way that makes everyone out
>> there profit ion the long run :)
> 
> I do not disagree with the general premise of making existing
> mechanisms work better rather than implementing parallel ones. I'm
> just pointing out my findings so far when I moved in that direction
> and I'm happy Rik posted an alternative simple way around large
> refactoring and started this discussion. We should definitely try
> reworking GUP to cause less contention. I just don't have enough time
> ATM to drive that, but would be happy to help with the VMA-locking
> parts.

If we have a VMA-lock GUP variant, I guess supporting the write part would also
be fairly easy? Not sure how performance-relevant that is, though.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-19 14:03       ` Suren Baghdasaryan
@ 2026-06-19 14:33         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-19 14:33 UTC (permalink / raw)
  To: Suren Baghdasaryan, Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

On 6/19/26 16:03, Suren Baghdasaryan wrote:
> On Fri, Jun 19, 2026 at 6:47 AM Rik van Riel <riel@surriel.com> wrote:
>>
>> On Tue, 2026-06-16 at 23:19 -0700, Suren Baghdasaryan wrote:
>>>
>>> I don't think we should be using per-VMA locks if the read spans
>>> multiple VMAs. Doing that would risk a possibility of reading
>>> inconsistent data since we are locking one VMA at a time. While we
>>> load and read VMA, its neighboring VMA can be unmapped and another
>>> one
>>> can be mapped in its place.
>>
>> How is that different from userspace overwriting
>> data while we are reading it, and us reading half
>> new, and half old contents?
>>
>> We already have nothing synchronizing the contents
>> today.
>>
>> All our locking does is protect the metadata we
>> use to find the memory.
> 
> Current locking ensures that when reading across VMAs, the structure
> of the VMA tree stays consistent during that read. I think that
> *structural consistency* of the area we are reading is important to
> preserve.

I tend to agree: if more than one VMA is involved, it's best to fallback to the
mmap lock for now.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-19 14:27               ` David Hildenbrand (Arm)
@ 2026-06-19 15:07                 ` Suren Baghdasaryan
  0 siblings, 0 replies; 24+ messages in thread
From: Suren Baghdasaryan @ 2026-06-19 15:07 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, Rik van Riel, linux-kernel, x86, linux-mm,
	Thomas Gleixner, Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov,
	Dave Hansen, Andrew Morton, Liam R. Howlett, Vlastimil Babka

On Fri, Jun 19, 2026 at 7:27 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 6/19/26 16:19, Suren Baghdasaryan wrote:
> > On Fri, Jun 19, 2026 at 6:05 AM David Hildenbrand (Arm)
> > <david@kernel.org> wrote:
> >>
> >>>
> >>> Well I think a critical problem here, as pointed out by Suren, is that holding a
> >>> VMA lock means that the VMAs around you can change and in ways that are quite
> >>> problematic.
> >>>
> >>> E.g. The moment you drop the VMA lock that VMA might get freed and then merged
> >>> with something else, and the next VMA you consume is the same one you just
> >>> partially walked, for instance.
> >>>
> >>> Now perhaps you could reason your way around this, but I'm pretty sure there are
> >>> cases where you might actually miss VMAs due to races (Suren knows best).
> >>>
> >>> And also without an mmap lock people can unmap and map new VMAs in the range as
> >>> you go through which might cause weirdness as well.
> >>>
> >>> Really, unless you are dealing with a single VMA in the range, I suspect GUP
> >>> needs to stabilise that whole range.
> >>
> >> Well, depends, really. It's not like a all GUP operation that target many pages
> >> runs exclusively under the mmap lock that would prevent any VMA changes.
> >>
> >> With userfaultfd, for example, we drop the lock in between, to lookup the VMA
> >> again later. There are various paths where __get_user_pages_locked() is
> >> instructed to grab the mmap lock itself, to even temporarily drop it if the mmap
> >> lock was dropped.
> >>
> >> gup_fast_fallback() grabs some pages to then take the mmap lock. And continue
> >> from the next address.
> >>
> >>
> >> So it really depends on the use case. I would actually be surprised if there a
> >> lot of use cases that strictly must block concurrent mremap operations etc.
> >>
> >> The important part is that you process each virtual page address requested
> >> exactly once. If the VMA was merged in the meantime, you continue from that
> >> address in the previously-processed VMA.
> >>
> >>
> >> Some use cases might indeed want to stabilize the whole range. But I wouldn't
> >> expect them to opt-in to using per-VMA locks.
> >>
> >> Just like with any other page table walker, we cannot just convert all in one
> >> shot to use per-VMA locks.
> >>
> >>>
> >>> If we could find a way to have GUP fast-path the single VMA case sensibly, then
> >>> that's probably workable?
> >>
> >> Right, that's what I said: start with a single-VMA interface that supports
> >> getting called with the per-vma lock or the mmap lock.
> >>
> >> If we have to fallback to the mmap lock (userfaultd? indicated back by the
> >> caller), handle it in the caller of that interface for now.
> >>
> >>>
> >>> And I agree special-casing only one place but not others sucks.
> >>
> >> Yeah, we're not doing that unless inevitable.
> >>
> >>>
> >>> Perhaps we could find a way to get this improvement without it being quite so
> >>> 'tacked on' but without needing significant rework of GUP, but in either case I
> >>> broadly agree we need to improve the codebase as part of the changes.
> >>
> >> We shouldn't fear extending GUP in a reasonable way that makes everyone out
> >> there profit ion the long run :)
> >
> > I do not disagree with the general premise of making existing
> > mechanisms work better rather than implementing parallel ones. I'm
> > just pointing out my findings so far when I moved in that direction
> > and I'm happy Rik posted an alternative simple way around large
> > refactoring and started this discussion. We should definitely try
> > reworking GUP to cause less contention. I just don't have enough time
> > ATM to drive that, but would be happy to help with the VMA-locking
> > parts.
>
> If we have a VMA-lock GUP variant, I guess supporting the write part would also
> be fairly easy?

I think so.

> Not sure how performance-relevant that is, though.

All the contention cases I saw or heard of involved reading.

>
> --
> Cheers,
>
> David


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-06-19 15:08 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16 19:02 [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
2026-06-16 19:02 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
2026-06-18 16:40   ` Usama Arif
2026-06-16 19:02 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
2026-06-19 12:34   ` Lorenzo Stoakes
2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
2026-06-17  6:19   ` Suren Baghdasaryan
2026-06-19 12:24     ` Lorenzo Stoakes
2026-06-19 13:46     ` Rik van Riel
2026-06-19 14:03       ` Suren Baghdasaryan
2026-06-19 14:33         ` David Hildenbrand (Arm)
2026-06-18 17:01   ` Usama Arif
2026-06-18 17:07     ` David Hildenbrand (Arm)
2026-06-18 17:22       ` Usama Arif
2026-06-19 12:20   ` Lorenzo Stoakes
2026-06-17  1:10 ` [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Suren Baghdasaryan
2026-06-17  9:42   ` David Hildenbrand (Arm)
2026-06-17 13:33     ` Suren Baghdasaryan
2026-06-18 20:37       ` David Hildenbrand (Arm)
2026-06-19 12:43         ` Lorenzo Stoakes
2026-06-19 13:04           ` David Hildenbrand (Arm)
2026-06-19 14:19             ` Suren Baghdasaryan
2026-06-19 14:27               ` David Hildenbrand (Arm)
2026-06-19 15:07                 ` Suren Baghdasaryan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.