All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
@ 2026-06-16 19:02 Rik van Riel
  2026-06-16 19:02 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Rik van Riel @ 2026-06-16 19:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan

Sometimes processes can get stuck with the mmap_lock held for
a long time. This slows down, and can even prevent system monitoring
tools from assessing and logging the situation, because they themselves
end up getting stuck on the mmap_lock.

However, with the introduction of per-VMA locks, we can improve the
reliability of system monitoring, and generally speed up __access_remote_vm
under mmap_loc contention, by adding a fast path that does not require
the process-wide mmap_lock.

This fast path is only compiled in and used when it is safe to do so,
meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
is not hugetlbfs, iomap, pfnmap, etc...

The code seems to work, but could still use some more cleaning up
and benchmarking.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask
  2026-06-16 19:02 [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
@ 2026-06-16 19:02 ` Rik van Riel
  2026-06-18 16:40   ` Usama Arif
  2026-06-16 19:02 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2026-06-16 19:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan

mm->context.untag_mask is written once, when LAM is enabled
(mm_enable_lam(), under mmap_write_lock and while the process is still
single-threaded), and is otherwise stable and never reverted.
untagged_addr_remote() reads it for a remote mm, and the new
untagged_addr_remote_unlocked() (used by the per-VMA-lock
access_remote_vm() fast path) reads it without the mmap lock.

The field is a single aligned word and cannot tear, but annotate the
reads and writes with READ_ONCE()/WRITE_ONCE() to make the lockless
access explicit and keep the compiler from reloading or tearing it.

No functional change.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/mmu_context.h | 6 +++---
 arch/x86/include/asm/uaccess_64.h  | 2 +-
 arch/x86/kernel/process_64.c       | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index ef5b507de34e..cee710f64658 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -100,18 +100,18 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
 static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
-	mm->context.untag_mask = oldmm->context.untag_mask;
+	WRITE_ONCE(mm->context.untag_mask, READ_ONCE(oldmm->context.untag_mask));
 }
 
 #define mm_untag_mask mm_untag_mask
 static inline unsigned long mm_untag_mask(struct mm_struct *mm)
 {
-	return mm->context.untag_mask;
+	return READ_ONCE(mm->context.untag_mask);
 }
 
 static inline void mm_reset_untag_mask(struct mm_struct *mm)
 {
-	mm->context.untag_mask = -1UL;
+	WRITE_ONCE(mm->context.untag_mask, -1UL);
 }
 
 #define arch_pgtable_dma_compat arch_pgtable_dma_compat
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 20de34cc9aa6..4a52497ba6a1 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -43,7 +43,7 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 						   unsigned long addr)
 {
 	mmap_assert_locked(mm);
-	return addr & (mm)->context.untag_mask;
+	return addr & READ_ONCE((mm)->context.untag_mask);
 }
 
 #define untagged_addr_remote(mm, addr)	({				\
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d44afbe005bb..55096136de53 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -814,7 +814,7 @@ static void enable_lam_func(void *__mm)
 static void mm_enable_lam(struct mm_struct *mm)
 {
 	mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
-	mm->context.untag_mask =  ~GENMASK(62, 57);
+	WRITE_ONCE(mm->context.untag_mask, ~GENMASK(62, 57));
 
 	/*
 	 * Even though the process must still be single-threaded at this
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock
  2026-06-16 19:02 [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
  2026-06-16 19:02 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
@ 2026-06-16 19:02 ` Rik van Riel
  2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
  2026-06-17  1:10 ` [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Suren Baghdasaryan
  3 siblings, 0 replies; 13+ messages in thread
From: Rik van Riel @ 2026-06-16 19:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan

folio_walk_start() asserts that the mmap lock is held.  For callers that
only need to read a single, already-present page, the mmap lock is a
heavy and often badly contended hammer: the VMA can instead be
stabilized with the per-VMA lock, and the page table pages that are
walked are kept alive by RCU page-table freeing
(CONFIG_MMU_GATHER_RCU_TABLE_FREE).

Add an FW_VMA_LOCKED flag.  When passed, folio_walk_start() asserts the
per-VMA lock instead of the mmap lock, requires RCU-freed page tables,
and refuses hugetlb VMAs (PMD sharing cannot be walked safely this way).
Everything else folio_walk_start() relies on -- the page table locks,
pmdp_get_lockless() and pte_offset_map_lock() -- is already safe without
the mmap lock, mirroring the per-VMA lock page fault path.

No existing caller passes FW_VMA_LOCKED, so behaviour is unchanged.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/pagewalk.h |  5 +++++
 mm/pagewalk.c            | 18 ++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index b41d7265c01b..84dd0d68f747 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -150,6 +150,11 @@ typedef int __bitwise folio_walk_flags_t;
 
 /* Walk shared zeropages (small + huge) as well. */
 #define FW_ZEROPAGE			((__force folio_walk_flags_t)BIT(0))
+/*
+ * The caller holds the per-VMA lock instead of the mmap lock. Only valid with
+ * RCU-freed page tables (CONFIG_MMU_GATHER_RCU_TABLE_FREE) and not for hugetlb.
+ */
+#define FW_VMA_LOCKED			((__force folio_walk_flags_t)BIT(1))
 
 enum folio_walk_level {
 	FW_LEVEL_PTE,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..c85364b73e12 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -890,7 +890,9 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
  * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
  * not correspond to the first physical entry of a logical hugetlb entry.
  *
- * The mmap lock must be held in read mode.
+ * The mmap lock must be held in read mode. Alternatively, if @FW_VMA_LOCKED is
+ * passed, the VMA's per-VMA lock must be held (only supported with RCU-freed
+ * page tables, i.e. CONFIG_MMU_GATHER_RCU_TABLE_FREE, and not for hugetlb).
  *
  * Return: folio pointer on success, otherwise NULL.
  */
@@ -908,7 +910,19 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 	pgd_t *pgdp;
 	p4d_t *p4dp;
 
-	mmap_assert_locked(vma->vm_mm);
+	if (flags & FW_VMA_LOCKED) {
+		/*
+		 * Lockless walk: the per-VMA lock keeps the VMA stable, and
+		 * RCU-freed page tables keep the walked page table pages alive
+		 * across the lockless upper-level walk and pte_offset_map_lock().
+		 * Hugetlb (PMD sharing) is not supported on this path.
+		 */
+		VM_WARN_ON_ONCE(!IS_ENABLED(CONFIG_MMU_GATHER_RCU_TABLE_FREE));
+		VM_WARN_ON_ONCE(is_vm_hugetlb_page(vma));
+		vma_assert_locked(vma);
+	} else {
+		mmap_assert_locked(vma->vm_mm);
+	}
 	vma_pgtable_walk_begin(vma);
 
 	if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-16 19:02 [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
  2026-06-16 19:02 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
  2026-06-16 19:02 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
@ 2026-06-16 19:03 ` Rik van Riel
  2026-06-17  6:19   ` Suren Baghdasaryan
  2026-06-18 17:01   ` Usama Arif
  2026-06-17  1:10 ` [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Suren Baghdasaryan
  3 siblings, 2 replies; 13+ messages in thread
From: Rik van Riel @ 2026-06-16 19:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan

__access_remote_vm() takes mmap_read_lock() for the entire transfer and
uses get_user_pages_remote(), which faults pages in.  For the common
case of reading memory that is already resident -- /proc/PID/cmdline,
/proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
unnecessary and is badly contended on large machines.

Add an opportunistic, read-only fast path that transfers what it can
without the mmap lock.  For each address it takes the per-VMA lock with
lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
a present page before copying it out.  Anything non-trivial -- a not-
present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
a race with a VMA writer -- falls back to the existing mmap_lock path
for the remainder.

untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
for the fast path; the untag mask is a stable per-mm value.

Only reads are handled here; writes keep using the slow path.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/uaccess_64.h |  12 +++
 include/linux/uaccess.h           |  11 ++
 mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
 3 files changed, 188 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 4a52497ba6a1..c6fac900a747 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
 })
 
+/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
+static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
+							    unsigned long addr)
+{
+	return addr & READ_ONCE((mm)->context.untag_mask);
+}
+
+#define untagged_addr_remote_unlocked(mm, addr)	({			\
+	unsigned long __addr = (__force unsigned long)(addr);		\
+	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
+})
+
 #endif
 
 #define valid_user_address(x) \
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 8a264662b242..c8c83372c9d8 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -34,6 +34,17 @@
 })
 #endif
 
+/*
+ * Like untagged_addr_remote(), but for callers that stabilize @mm by other
+ * means (e.g. a per-VMA lock) and must not assert the mmap lock.
+ */
+#ifndef untagged_addr_remote_unlocked
+#define untagged_addr_remote_unlocked(mm, addr)	({	\
+	(void)(mm);					\
+	untagged_addr(addr);				\
+})
+#endif
+
 #ifdef masked_user_access_begin
  #define can_do_masked_user_access() 1
 # ifndef masked_user_write_access_begin
diff --git a/mm/memory.c b/mm/memory.c
index 86a973119bd4..0b23b82eaa18 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -42,6 +42,8 @@
 #include <linux/kernel_stat.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
+#include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/sched/task.h>
@@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 EXPORT_SYMBOL_GPL(generic_access_phys);
 #endif
 
+/*
+ * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
+ * lock and RCU-freed page tables to walk page tables without the mmap lock.
+ */
+#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
+/*
+ * Opportunistic lockless fast path for __access_remote_vm() reads.
+ *
+ * Memory already resident in @mm can be read without taking the heavily
+ * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
+ * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
+ * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
+ *
+ * Anything that would require faulting a page in, touching a hugetlb or
+ * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
+ * path in __access_remote_vm().  Only reads are handled here.
+ *
+ * Returns the number of bytes transferred via the fast path.
+ */
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+				 void *buf, int len, unsigned int gup_flags)
+{
+	void *old_buf = buf;
+
+	addr = untagged_addr_remote_unlocked(mm, addr);
+
+	while (len) {
+		struct vm_area_struct *vma;
+		vm_flags_t vm_flags;
+
+		vma = lock_vma_under_rcu(mm, addr);
+		if (!vma)
+			break;
+
+		/*
+		 * Mirror the read-side permission checks of check_vma_flags(),
+		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
+		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
+		 * per VMA; anything not positively allowed falls back to the
+		 * slow path, which re-validates everything.
+		 */
+		vm_flags = vma->vm_flags;
+		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
+		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
+		    (!(vm_flags & VM_READ) &&
+		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
+			vma_end_read(vma);
+			break;
+		}
+
+		/*
+		 * Copy as much of this VMA as we can without re-acquiring the
+		 * per-VMA lock; re-lock only when @addr leaves the VMA.
+		 */
+		while (len && addr < vma->vm_end) {
+			struct folio_walk fw;
+			struct folio *folio;
+			struct page *page;
+			unsigned long entry_size, entry_left, folio_left, span;
+			unsigned long copied, idx0;
+			int offset;
+
+			folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
+			if (!folio) {
+				vma_end_read(vma);
+				goto out;
+			}
+			page = fw.page;
+			if (!page) {
+				folio_walk_end(&fw, vma);
+				vma_end_read(vma);
+				goto out;
+			}
+			/* Pin the folio so it stays valid after the PTL is dropped. */
+			folio_get(folio);
+			folio_walk_end(&fw, vma);
+
+			/*
+			 * folio_walk_start() validated exactly one mapping entry,
+			 * which covers a contiguous, present run of this folio:
+			 * PAGE_SIZE for a pte, PMD_SIZE for a pmd leaf, PUD_SIZE
+			 * for a pud leaf.  Copy up to the end of that entry,
+			 * bounded by the folio, the VMA and len, so a huge mapping
+			 * is handled in one walk instead of per page.
+			 */
+			offset = offset_in_page(addr);
+			switch (fw.level) {
+			case FW_LEVEL_PUD:
+				entry_size = PUD_SIZE;
+				break;
+			case FW_LEVEL_PMD:
+				entry_size = PMD_SIZE;
+				break;
+			default:
+				entry_size = PAGE_SIZE;
+				break;
+			}
+			entry_left = entry_size - (addr & (entry_size - 1));
+			idx0 = folio_page_idx(folio, page);
+			folio_left = ((folio_nr_pages(folio) - idx0) << PAGE_SHIFT) -
+				     offset;
+			span = min3((unsigned long)len, entry_left, folio_left);
+			span = min(span, vma->vm_end - addr);
+
+			/*
+			 * Copy the span page-by-page: kmap_local_folio() maps one
+			 * page on HIGHMEM and copy_from_user_page() flushes per
+			 * page on aliasing caches, but the page tables are not
+			 * re-walked.  The span borrows the single folio reference
+			 * taken above, so each mapping is dropped with
+			 * kunmap_local() (not folio_release_kmap(), which would
+			 * also drop a folio reference per page).
+			 */
+			for (copied = 0; copied < span; ) {
+				unsigned long foff = offset + copied;
+				unsigned long pidx = idx0 + (foff >> PAGE_SHIFT);
+				int poff = foff & ~PAGE_MASK;
+				int chunk = min_t(unsigned long, span - copied,
+						  PAGE_SIZE - poff);
+				void *maddr = kmap_local_folio(folio,
+						pidx << PAGE_SHIFT);
+
+				copy_from_user_page(vma, folio_page(folio, pidx),
+						    addr + copied, buf + copied,
+						    maddr + poff, chunk);
+				kunmap_local(maddr);
+				copied += chunk;
+			}
+
+			folio_put(folio);
+			len -= span;
+			buf += span;
+			addr += span;
+		}
+		vma_end_read(vma);
+	}
+out:
+	return buf - old_buf;
+}
+#else
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+				 void *buf, int len, unsigned int gup_flags)
+{
+	return 0;
+}
+#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
+
 /*
  * Access another process' address space as given in mm.
  */
@@ -7071,8 +7220,23 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
 	void *old_buf = buf;
 	int write = gup_flags & FOLL_WRITE;
 
+	/*
+	 * Try the lockless fast path for reads first; it transfers what it can
+	 * from resident memory without taking mmap_lock, and leaves the
+	 * remainder (if any) to the slow path below.
+	 */
+	if (!write) {
+		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
+
+		addr += done;
+		buf += done;
+		len -= done;
+		if (!len)
+			return buf - old_buf;
+	}
+
 	if (mmap_read_lock_killable(mm))
-		return 0;
+		return buf - old_buf;
 
 	/* Untag the address before looking up the VMA */
 	addr = untagged_addr_remote(mm, addr);
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-16 19:02 [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
                   ` (2 preceding siblings ...)
  2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
@ 2026-06-17  1:10 ` Suren Baghdasaryan
  2026-06-17  9:42   ` David Hildenbrand (Arm)
  3 siblings, 1 reply; 13+ messages in thread
From: Suren Baghdasaryan @ 2026-06-17  1:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka

On Tue, Jun 16, 2026 at 12:04 PM Rik van Riel <riel@surriel.com> wrote:
>
> Sometimes processes can get stuck with the mmap_lock held for
> a long time. This slows down, and can even prevent system monitoring
> tools from assessing and logging the situation, because they themselves
> end up getting stuck on the mmap_lock.
>
> However, with the introduction of per-VMA locks, we can improve the
> reliability of system monitoring, and generally speed up __access_remote_vm
> under mmap_loc contention, by adding a fast path that does not require
> the process-wide mmap_lock.
>
> This fast path is only compiled in and used when it is safe to do so,
> meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
> is not hugetlbfs, iomap, pfnmap, etc...
>
> The code seems to work, but could still use some more cleaning up
> and benchmarking.

Thanks for the patchset Rik!
Previously when I looked into using per-VMA locks in
access_remote_vm(), the biggest hurdle was get_user_pages_remote(),
which required mmap_lock. Your implementation avoids altogether and
keeps the code much simpler than what I expected. I very much support
this approach and will start reviewing in more details.
One question: CONFIG_MMU_GATHER_RCU_TABLE_FREE still has this comment
about "Semi RCU freeing of the page directories." [1]. Does that
mechanism being "semi RCU safe" pose an issue here? I'll need to
refresh my memory on why it's only semi-RCU safe but you might have
the answer ready.

[1] https://elixir.bootlin.com/linux/v7.1/source/mm/mmu_gather.c#L236

Thanks,
Suren.

>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
@ 2026-06-17  6:19   ` Suren Baghdasaryan
  2026-06-18 17:01   ` Usama Arif
  1 sibling, 0 replies; 13+ messages in thread
From: Suren Baghdasaryan @ 2026-06-17  6:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka

On Tue, Jun 16, 2026 at 12:04 PM Rik van Riel <riel@surriel.com> wrote:
>
> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> uses get_user_pages_remote(), which faults pages in.  For the common
> case of reading memory that is already resident -- /proc/PID/cmdline,
> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> unnecessary and is badly contended on large machines.
>
> Add an opportunistic, read-only fast path that transfers what it can
> without the mmap lock.  For each address it takes the per-VMA lock with
> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
> a present page before copying it out.  Anything non-trivial -- a not-
> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
> a race with a VMA writer -- falls back to the existing mmap_lock path
> for the remainder.

I don't think we should be using per-VMA locks if the read spans
multiple VMAs. Doing that would risk a possibility of reading
inconsistent data since we are locking one VMA at a time. While we
load and read VMA, its neighboring VMA can be unmapped and another one
can be mapped in its place. So, our read spanning both VMAs will
return inconsistent data. access_remote_vm_fast() can check if the
entire read is contained within one VMA and if not, fall back to
mmap_lock.

>
> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
> for the fast path; the untag mask is a stable per-mm value.
>
> Only reads are handled here; writes keep using the slow path.
>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/uaccess_64.h |  12 +++
>  include/linux/uaccess.h           |  11 ++
>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>  3 files changed, 188 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 4a52497ba6a1..c6fac900a747 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>         (__force __typeof__(addr))__untagged_addr_remote(mm, __addr);   \
>  })
>
> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
> +                                                           unsigned long addr)
> +{
> +       return addr & READ_ONCE((mm)->context.untag_mask);
> +}
> +
> +#define untagged_addr_remote_unlocked(mm, addr)        ({                      \
> +       unsigned long __addr = (__force unsigned long)(addr);           \
> +       (__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
> +})
> +
>  #endif
>
>  #define valid_user_address(x) \
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 8a264662b242..c8c83372c9d8 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -34,6 +34,17 @@
>  })
>  #endif
>
> +/*
> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
> + */
> +#ifndef untagged_addr_remote_unlocked
> +#define untagged_addr_remote_unlocked(mm, addr)        ({      \
> +       (void)(mm);                                     \
> +       untagged_addr(addr);                            \
> +})
> +#endif
> +
>  #ifdef masked_user_access_begin
>   #define can_do_masked_user_access() 1
>  # ifndef masked_user_write_access_begin
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..0b23b82eaa18 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,8 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/secretmem.h>
> +#include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>  EXPORT_SYMBOL_GPL(generic_access_phys);
>  #endif
>
> +/*
> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
> + */
> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)

Just FYI, Dave Hansen is working on a patchset [1] that removes
CONFIG_PER_VMA_LOCK and makes per-VMA locks always available. This
would simplify this patchset too.

[1] https://lore.kernel.org/all/20260610230409.A44D29FA@davehans-spike.ostc.intel.com/

> +/*
> + * Opportunistic lockless fast path for __access_remote_vm() reads.
> + *
> + * Memory already resident in @mm can be read without taking the heavily
> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
> + *
> + * Anything that would require faulting a page in, touching a hugetlb or
> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
> + * path in __access_remote_vm().  Only reads are handled here.
> + *
> + * Returns the number of bytes transferred via the fast path.
> + */
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +                                void *buf, int len, unsigned int gup_flags)
> +{
> +       void *old_buf = buf;
> +
> +       addr = untagged_addr_remote_unlocked(mm, addr);
> +
> +       while (len) {
> +               struct vm_area_struct *vma;
> +               vm_flags_t vm_flags;
> +
> +               vma = lock_vma_under_rcu(mm, addr);
> +               if (!vma)
> +                       break;
> +
> +               /*
> +                * Mirror the read-side permission checks of check_vma_flags(),
> +                * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
> +                * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
> +                * per VMA; anything not positively allowed falls back to the
> +                * slow path, which re-validates everything.
> +                */
> +               vm_flags = vma->vm_flags;
> +               if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
> +                   is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
> +                   (!(vm_flags & VM_READ) &&
> +                    (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
> +                       vma_end_read(vma);
> +                       break;
> +               }
> +
> +               /*
> +                * Copy as much of this VMA as we can without re-acquiring the
> +                * per-VMA lock; re-lock only when @addr leaves the VMA.
> +                */
> +               while (len && addr < vma->vm_end) {
> +                       struct folio_walk fw;
> +                       struct folio *folio;
> +                       struct page *page;
> +                       unsigned long entry_size, entry_left, folio_left, span;
> +                       unsigned long copied, idx0;
> +                       int offset;
> +
> +                       folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
> +                       if (!folio) {
> +                               vma_end_read(vma);
> +                               goto out;
> +                       }
> +                       page = fw.page;
> +                       if (!page) {
> +                               folio_walk_end(&fw, vma);
> +                               vma_end_read(vma);
> +                               goto out;
> +                       }
> +                       /* Pin the folio so it stays valid after the PTL is dropped. */
> +                       folio_get(folio);
> +                       folio_walk_end(&fw, vma);
> +
> +                       /*
> +                        * folio_walk_start() validated exactly one mapping entry,
> +                        * which covers a contiguous, present run of this folio:
> +                        * PAGE_SIZE for a pte, PMD_SIZE for a pmd leaf, PUD_SIZE
> +                        * for a pud leaf.  Copy up to the end of that entry,
> +                        * bounded by the folio, the VMA and len, so a huge mapping
> +                        * is handled in one walk instead of per page.
> +                        */
> +                       offset = offset_in_page(addr);
> +                       switch (fw.level) {
> +                       case FW_LEVEL_PUD:
> +                               entry_size = PUD_SIZE;
> +                               break;
> +                       case FW_LEVEL_PMD:
> +                               entry_size = PMD_SIZE;
> +                               break;
> +                       default:
> +                               entry_size = PAGE_SIZE;
> +                               break;
> +                       }
> +                       entry_left = entry_size - (addr & (entry_size - 1));
> +                       idx0 = folio_page_idx(folio, page);
> +                       folio_left = ((folio_nr_pages(folio) - idx0) << PAGE_SHIFT) -
> +                                    offset;
> +                       span = min3((unsigned long)len, entry_left, folio_left);
> +                       span = min(span, vma->vm_end - addr);
> +
> +                       /*
> +                        * Copy the span page-by-page: kmap_local_folio() maps one
> +                        * page on HIGHMEM and copy_from_user_page() flushes per
> +                        * page on aliasing caches, but the page tables are not
> +                        * re-walked.  The span borrows the single folio reference
> +                        * taken above, so each mapping is dropped with
> +                        * kunmap_local() (not folio_release_kmap(), which would
> +                        * also drop a folio reference per page).
> +                        */
> +                       for (copied = 0; copied < span; ) {
> +                               unsigned long foff = offset + copied;
> +                               unsigned long pidx = idx0 + (foff >> PAGE_SHIFT);
> +                               int poff = foff & ~PAGE_MASK;
> +                               int chunk = min_t(unsigned long, span - copied,
> +                                                 PAGE_SIZE - poff);
> +                               void *maddr = kmap_local_folio(folio,
> +                                               pidx << PAGE_SHIFT);
> +
> +                               copy_from_user_page(vma, folio_page(folio, pidx),
> +                                                   addr + copied, buf + copied,
> +                                                   maddr + poff, chunk);
> +                               kunmap_local(maddr);
> +                               copied += chunk;
> +                       }
> +
> +                       folio_put(folio);
> +                       len -= span;
> +                       buf += span;
> +                       addr += span;
> +               }
> +               vma_end_read(vma);
> +       }
> +out:
> +       return buf - old_buf;
> +}
> +#else
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +                                void *buf, int len, unsigned int gup_flags)
> +{
> +       return 0;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
> +
>  /*
>   * Access another process' address space as given in mm.
>   */
> @@ -7071,8 +7220,23 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
>         void *old_buf = buf;
>         int write = gup_flags & FOLL_WRITE;
>
> +       /*
> +        * Try the lockless fast path for reads first; it transfers what it can
> +        * from resident memory without taking mmap_lock, and leaves the
> +        * remainder (if any) to the slow path below.
> +        */
> +       if (!write) {
> +               int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
> +
> +               addr += done;
> +               buf += done;
> +               len -= done;
> +               if (!len)
> +                       return buf - old_buf;
> +       }
> +
>         if (mmap_read_lock_killable(mm))
> -               return 0;
> +               return buf - old_buf;
>
>         /* Untag the address before looking up the VMA */
>         addr = untagged_addr_remote(mm, addr);
> --
> 2.53.0-Meta
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-17  1:10 ` [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Suren Baghdasaryan
@ 2026-06-17  9:42   ` David Hildenbrand (Arm)
  2026-06-17 13:33     ` Suren Baghdasaryan
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-17  9:42 UTC (permalink / raw)
  To: Suren Baghdasaryan, Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

On 6/17/26 03:10, Suren Baghdasaryan wrote:
> On Tue, Jun 16, 2026 at 12:04 PM Rik van Riel <riel@surriel.com> wrote:
>>
>> Sometimes processes can get stuck with the mmap_lock held for
>> a long time. This slows down, and can even prevent system monitoring
>> tools from assessing and logging the situation, because they themselves
>> end up getting stuck on the mmap_lock.
>>
>> However, with the introduction of per-VMA locks, we can improve the
>> reliability of system monitoring, and generally speed up __access_remote_vm
>> under mmap_loc contention, by adding a fast path that does not require
>> the process-wide mmap_lock.
>>
>> This fast path is only compiled in and used when it is safe to do so,
>> meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
>> is not hugetlbfs, iomap, pfnmap, etc...
>>
>> The code seems to work, but could still use some more cleaning up
>> and benchmarking.
> 
> Thanks for the patchset Rik!
> Previously when I looked into using per-VMA locks in
> access_remote_vm(), the biggest hurdle was get_user_pages_remote(),
> which required mmap_lock. Your implementation avoids altogether and
> keeps the code much simpler than what I expected.

But, wouldn't we, in general, also want to teach GUP to just work with per-VMA
locks?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-17  9:42   ` David Hildenbrand (Arm)
@ 2026-06-17 13:33     ` Suren Baghdasaryan
  2026-06-18 20:37       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 13+ messages in thread
From: Suren Baghdasaryan @ 2026-06-17 13:33 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Rik van Riel, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

On Wed, Jun 17, 2026 at 2:42 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 6/17/26 03:10, Suren Baghdasaryan wrote:
> > On Tue, Jun 16, 2026 at 12:04 PM Rik van Riel <riel@surriel.com> wrote:
> >>
> >> Sometimes processes can get stuck with the mmap_lock held for
> >> a long time. This slows down, and can even prevent system monitoring
> >> tools from assessing and logging the situation, because they themselves
> >> end up getting stuck on the mmap_lock.
> >>
> >> However, with the introduction of per-VMA locks, we can improve the
> >> reliability of system monitoring, and generally speed up __access_remote_vm
> >> under mmap_loc contention, by adding a fast path that does not require
> >> the process-wide mmap_lock.
> >>
> >> This fast path is only compiled in and used when it is safe to do so,
> >> meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
> >> is not hugetlbfs, iomap, pfnmap, etc...
> >>
> >> The code seems to work, but could still use some more cleaning up
> >> and benchmarking.
> >
> > Thanks for the patchset Rik!
> > Previously when I looked into using per-VMA locks in
> > access_remote_vm(), the biggest hurdle was get_user_pages_remote(),
> > which required mmap_lock. Your implementation avoids altogether and
> > keeps the code much simpler than what I expected.
>
> But, wouldn't we, in general, also want to teach GUP to just work with per-VMA
> locks?

Matthew suggested using gup_fast in access_remote_vm before, and I
looked into that. The biggest issue there is that gup_fast assumes it
always operates on current->mm, not on the remote one. Reworking that
is quite an undertaking.
Teaching GUP in general to work with per-VMA locks I think would also
be much harder than what this patchset does and would require a GUP
expert (which I am unfortunately not).

>
> --
> Cheers,
>
> David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask
  2026-06-16 19:02 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
@ 2026-06-18 16:40   ` Usama Arif
  0 siblings, 0 replies; 13+ messages in thread
From: Usama Arif @ 2026-06-18 16:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Usama Arif, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan

On Tue, 16 Jun 2026 15:02:58 -0400 Rik van Riel <riel@surriel.com> wrote:

> mm->context.untag_mask is written once, when LAM is enabled
> (mm_enable_lam(), under mmap_write_lock and while the process is still
> single-threaded), and is otherwise stable and never reverted.
> untagged_addr_remote() reads it for a remote mm, and the new
> untagged_addr_remote_unlocked() (used by the per-VMA-lock
> access_remote_vm() fast path) reads it without the mmap lock.
> 
> The field is a single aligned word and cannot tear, but annotate the
> reads and writes with READ_ONCE()/WRITE_ONCE() to make the lockless
> access explicit and keep the compiler from reloading or tearing it.
> 
> No functional change.
> 
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/mmu_context.h | 6 +++---
>  arch/x86/include/asm/uaccess_64.h  | 2 +-
>  arch/x86/kernel/process_64.c       | 2 +-
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 

Acked-by: Usama Arif <usama.arif@linux.dev>
 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
  2026-06-17  6:19   ` Suren Baghdasaryan
@ 2026-06-18 17:01   ` Usama Arif
  2026-06-18 17:07     ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 13+ messages in thread
From: Usama Arif @ 2026-06-18 17:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Usama Arif, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan

On Tue, 16 Jun 2026 15:03:00 -0400 Rik van Riel <riel@surriel.com> wrote:

> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> uses get_user_pages_remote(), which faults pages in.  For the common
> case of reading memory that is already resident -- /proc/PID/cmdline,
> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> unnecessary and is badly contended on large machines.
> 
> Add an opportunistic, read-only fast path that transfers what it can
> without the mmap lock.  For each address it takes the per-VMA lock with
> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
> a present page before copying it out.  Anything non-trivial -- a not-
> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
> a race with a VMA writer -- falls back to the existing mmap_lock path
> for the remainder.
> 
> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
> for the fast path; the untag mask is a stable per-mm value.
> 
> Only reads are handled here; writes keep using the slow path.
> 
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/uaccess_64.h |  12 +++
>  include/linux/uaccess.h           |  11 ++
>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>  3 files changed, 188 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 4a52497ba6a1..c6fac900a747 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>  	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
>  })
>  
> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
> +							    unsigned long addr)
> +{
> +	return addr & READ_ONCE((mm)->context.untag_mask);
> +}
> +
> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
> +	unsigned long __addr = (__force unsigned long)(addr);		\
> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
> +})
> +
>  #endif
>  
>  #define valid_user_address(x) \
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 8a264662b242..c8c83372c9d8 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -34,6 +34,17 @@
>  })
>  #endif
>  
> +/*
> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
> + */
> +#ifndef untagged_addr_remote_unlocked
> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
> +	(void)(mm);					\
> +	untagged_addr(addr);				\
> +})
> +#endif
> +
>  #ifdef masked_user_access_begin
>   #define can_do_masked_user_access() 1
>  # ifndef masked_user_write_access_begin
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..0b23b82eaa18 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,8 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/secretmem.h>
> +#include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>  EXPORT_SYMBOL_GPL(generic_access_phys);
>  #endif
>  
> +/*
> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
> + */
> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
> +/*
> + * Opportunistic lockless fast path for __access_remote_vm() reads.
> + *
> + * Memory already resident in @mm can be read without taking the heavily
> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
> + *
> + * Anything that would require faulting a page in, touching a hugetlb or
> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
> + * path in __access_remote_vm().  Only reads are handled here.
> + *
> + * Returns the number of bytes transferred via the fast path.
> + */
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	void *old_buf = buf;
> +
> +	addr = untagged_addr_remote_unlocked(mm, addr);
> +
> +	while (len) {
> +		struct vm_area_struct *vma;
> +		vm_flags_t vm_flags;
> +
> +		vma = lock_vma_under_rcu(mm, addr);
> +		if (!vma)
> +			break;
> +
> +		/*
> +		 * Mirror the read-side permission checks of check_vma_flags(),
> +		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
> +		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
> +		 * per VMA; anything not positively allowed falls back to the
> +		 * slow path, which re-validates everything.
> +		 */
> +		vm_flags = vma->vm_flags;
> +		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
> +		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
> +		    (!(vm_flags & VM_READ) &&
> +		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
> +			vma_end_read(vma);
> +			break;
> +		}

This should also do the FOLL_ANON check from check_vma_flags().

check_vma_flags() rejects non-anonymous VMAs when FOLL_ANON is set:

	if ((gup_flags & FOLL_ANON) && !vma_anon)
		return -EFAULT;

That flag is used by fs/proc/base.c for /proc/PID/cmdline and
/proc/PID/environ.  It was added by commit 7f7ccc2ccc2e ("proc: do not
access cmdline nor environ from file-backed areas"), which fixed
CVE-2018-1120 by making those proc files refuse file-backed argv/env
areas.

> +
> +		/*
> +		 * Copy as much of this VMA as we can without re-acquiring the
> +		 * per-VMA lock; re-lock only when @addr leaves the VMA.
> +		 */
> +		while (len && addr < vma->vm_end) {
> +			struct folio_walk fw;
> +			struct folio *folio;
> +			struct page *page;
> +			unsigned long entry_size, entry_left, folio_left, span;
> +			unsigned long copied, idx0;
> +			int offset;
> +
> +			folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
> +			if (!folio) {
> +				vma_end_read(vma);
> +				goto out;
> +			}
> +			page = fw.page;
> +			if (!page) {
> +				folio_walk_end(&fw, vma);
> +				vma_end_read(vma);
> +				goto out;
> +			}
> +			/* Pin the folio so it stays valid after the PTL is dropped. */
> +			folio_get(folio);
> +			folio_walk_end(&fw, vma);
> +
> +			/*
> +			 * folio_walk_start() validated exactly one mapping entry,
> +			 * which covers a contiguous, present run of this folio:
> +			 * PAGE_SIZE for a pte, PMD_SIZE for a pmd leaf, PUD_SIZE
> +			 * for a pud leaf.  Copy up to the end of that entry,
> +			 * bounded by the folio, the VMA and len, so a huge mapping
> +			 * is handled in one walk instead of per page.
> +			 */
> +			offset = offset_in_page(addr);
> +			switch (fw.level) {
> +			case FW_LEVEL_PUD:
> +				entry_size = PUD_SIZE;
> +				break;
> +			case FW_LEVEL_PMD:
> +				entry_size = PMD_SIZE;
> +				break;
> +			default:
> +				entry_size = PAGE_SIZE;
> +				break;
> +			}
> +			entry_left = entry_size - (addr & (entry_size - 1));
> +			idx0 = folio_page_idx(folio, page);
> +			folio_left = ((folio_nr_pages(folio) - idx0) << PAGE_SHIFT) -
> +				     offset;
> +			span = min3((unsigned long)len, entry_left, folio_left);
> +			span = min(span, vma->vm_end - addr);
> +
> +			/*
> +			 * Copy the span page-by-page: kmap_local_folio() maps one
> +			 * page on HIGHMEM and copy_from_user_page() flushes per
> +			 * page on aliasing caches, but the page tables are not
> +			 * re-walked.  The span borrows the single folio reference
> +			 * taken above, so each mapping is dropped with
> +			 * kunmap_local() (not folio_release_kmap(), which would
> +			 * also drop a folio reference per page).
> +			 */
> +			for (copied = 0; copied < span; ) {
> +				unsigned long foff = offset + copied;
> +				unsigned long pidx = idx0 + (foff >> PAGE_SHIFT);
> +				int poff = foff & ~PAGE_MASK;
> +				int chunk = min_t(unsigned long, span - copied,
> +						  PAGE_SIZE - poff);
> +				void *maddr = kmap_local_folio(folio,
> +						pidx << PAGE_SHIFT);
> +
> +				copy_from_user_page(vma, folio_page(folio, pidx),
> +						    addr + copied, buf + copied,
> +						    maddr + poff, chunk);
> +				kunmap_local(maddr);
> +				copied += chunk;
> +			}

__access_remote_vm() slow path calls get_user_page_vma_remote() which calls
get_user_pages_remote(). get_user_pages_remote() adds FOLL_TOUCH and then the
page-table walk eventually reaches follow_page_pte().

The new resident-page fast path copies the same data without doing an
equivalent folio_mark_accessed().

That changes reclaim behaviour for pages repeatedly read through
access_remote_vm(), such as /proc/PID/cmdline polling. I think you
should mark the folio as accessed.


> +
> +			folio_put(folio);
> +			len -= span;
> +			buf += span;
> +			addr += span;
> +		}
> +		vma_end_read(vma);
> +	}
> +out:
> +	return buf - old_buf;
> +}
> +#else
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
> +
>  /*
>   * Access another process' address space as given in mm.
>   */
> @@ -7071,8 +7220,23 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
>  	void *old_buf = buf;
>  	int write = gup_flags & FOLL_WRITE;
>  
> +	/*
> +	 * Try the lockless fast path for reads first; it transfers what it can
> +	 * from resident memory without taking mmap_lock, and leaves the
> +	 * remainder (if any) to the slow path below.
> +	 */
> +	if (!write) {
> +		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
> +
> +		addr += done;
> +		buf += done;
> +		len -= done;
> +		if (!len)
> +			return buf - old_buf;
> +	}
> +
>  	if (mmap_read_lock_killable(mm))
> -		return 0;
> +		return buf - old_buf;
>  
>  	/* Untag the address before looking up the VMA */
>  	addr = untagged_addr_remote(mm, addr);
> -- 
> 2.53.0-Meta
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-18 17:01   ` Usama Arif
@ 2026-06-18 17:07     ` David Hildenbrand (Arm)
  2026-06-18 17:22       ` Usama Arif
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-18 17:07 UTC (permalink / raw)
  To: Usama Arif, Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan

On 6/18/26 19:01, Usama Arif wrote:
> On Tue, 16 Jun 2026 15:03:00 -0400 Rik van Riel <riel@surriel.com> wrote:
> 
>> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
>> uses get_user_pages_remote(), which faults pages in.  For the common
>> case of reading memory that is already resident -- /proc/PID/cmdline,
>> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
>> unnecessary and is badly contended on large machines.
>>
>> Add an opportunistic, read-only fast path that transfers what it can
>> without the mmap lock.  For each address it takes the per-VMA lock with
>> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
>> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
>> a present page before copying it out.  Anything non-trivial -- a not-
>> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
>> a race with a VMA writer -- falls back to the existing mmap_lock path
>> for the remainder.
>>
>> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
>> for the fast path; the untag mask is a stable per-mm value.
>>
>> Only reads are handled here; writes keep using the slow path.
>>
>> Assisted-by: Claude:claude-opus-4-8
>> Signed-off-by: Rik van Riel <riel@surriel.com>
>> ---
>>  arch/x86/include/asm/uaccess_64.h |  12 +++
>>  include/linux/uaccess.h           |  11 ++
>>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>>  3 files changed, 188 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
>> index 4a52497ba6a1..c6fac900a747 100644
>> --- a/arch/x86/include/asm/uaccess_64.h
>> +++ b/arch/x86/include/asm/uaccess_64.h
>> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>>  	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
>>  })
>>  
>> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
>> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
>> +							    unsigned long addr)
>> +{
>> +	return addr & READ_ONCE((mm)->context.untag_mask);
>> +}
>> +
>> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
>> +	unsigned long __addr = (__force unsigned long)(addr);		\
>> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
>> +})
>> +
>>  #endif
>>  
>>  #define valid_user_address(x) \
>> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
>> index 8a264662b242..c8c83372c9d8 100644
>> --- a/include/linux/uaccess.h
>> +++ b/include/linux/uaccess.h
>> @@ -34,6 +34,17 @@
>>  })
>>  #endif
>>  
>> +/*
>> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
>> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
>> + */
>> +#ifndef untagged_addr_remote_unlocked
>> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
>> +	(void)(mm);					\
>> +	untagged_addr(addr);				\
>> +})
>> +#endif
>> +
>>  #ifdef masked_user_access_begin
>>   #define can_do_masked_user_access() 1
>>  # ifndef masked_user_write_access_begin
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 86a973119bd4..0b23b82eaa18 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -42,6 +42,8 @@
>>  #include <linux/kernel_stat.h>
>>  #include <linux/mm.h>
>>  #include <linux/mm_inline.h>
>> +#include <linux/secretmem.h>
>> +#include <linux/pagewalk.h>
>>  #include <linux/sched/mm.h>
>>  #include <linux/sched/numa_balancing.h>
>>  #include <linux/sched/task.h>
>> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>>  EXPORT_SYMBOL_GPL(generic_access_phys);
>>  #endif
>>  
>> +/*
>> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
>> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
>> + */
>> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
>> +/*
>> + * Opportunistic lockless fast path for __access_remote_vm() reads.
>> + *
>> + * Memory already resident in @mm can be read without taking the heavily
>> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
>> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
>> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
>> + *
>> + * Anything that would require faulting a page in, touching a hugetlb or
>> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
>> + * path in __access_remote_vm().  Only reads are handled here.
>> + *
>> + * Returns the number of bytes transferred via the fast path.
>> + */
>> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
>> +				 void *buf, int len, unsigned int gup_flags)
>> +{
>> +	void *old_buf = buf;
>> +
>> +	addr = untagged_addr_remote_unlocked(mm, addr);
>> +
>> +	while (len) {
>> +		struct vm_area_struct *vma;
>> +		vm_flags_t vm_flags;
>> +
>> +		vma = lock_vma_under_rcu(mm, addr);
>> +		if (!vma)
>> +			break;
>> +
>> +		/*
>> +		 * Mirror the read-side permission checks of check_vma_flags(),
>> +		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
>> +		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
>> +		 * per VMA; anything not positively allowed falls back to the
>> +		 * slow path, which re-validates everything.
>> +		 */
>> +		vm_flags = vma->vm_flags;
>> +		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
>> +		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
>> +		    (!(vm_flags & VM_READ) &&
>> +		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
>> +			vma_end_read(vma);
>> +			break;
>> +		}
> 
> This should also do the FOLL_ANON check from check_vma_flags().
> 
> check_vma_flags() rejects non-anonymous VMAs when FOLL_ANON is set:
> 
> 	if ((gup_flags & FOLL_ANON) && !vma_anon)
> 		return -EFAULT;

Duplicating GUP logic in a non-GUP file. Splendid. :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-18 17:07     ` David Hildenbrand (Arm)
@ 2026-06-18 17:22       ` Usama Arif
  0 siblings, 0 replies; 13+ messages in thread
From: Usama Arif @ 2026-06-18 17:22 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan



On 18/06/2026 18:07, David Hildenbrand (Arm) wrote:
> On 6/18/26 19:01, Usama Arif wrote:
>> On Tue, 16 Jun 2026 15:03:00 -0400 Rik van Riel <riel@surriel.com> wrote:
>>
>>> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
>>> uses get_user_pages_remote(), which faults pages in.  For the common
>>> case of reading memory that is already resident -- /proc/PID/cmdline,
>>> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
>>> unnecessary and is badly contended on large machines.
>>>
>>> Add an opportunistic, read-only fast path that transfers what it can
>>> without the mmap lock.  For each address it takes the per-VMA lock with
>>> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
>>> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
>>> a present page before copying it out.  Anything non-trivial -- a not-
>>> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
>>> a race with a VMA writer -- falls back to the existing mmap_lock path
>>> for the remainder.
>>>
>>> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
>>> for the fast path; the untag mask is a stable per-mm value.
>>>
>>> Only reads are handled here; writes keep using the slow path.
>>>
>>> Assisted-by: Claude:claude-opus-4-8
>>> Signed-off-by: Rik van Riel <riel@surriel.com>
>>> ---
>>>  arch/x86/include/asm/uaccess_64.h |  12 +++
>>>  include/linux/uaccess.h           |  11 ++
>>>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>>>  3 files changed, 188 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
>>> index 4a52497ba6a1..c6fac900a747 100644
>>> --- a/arch/x86/include/asm/uaccess_64.h
>>> +++ b/arch/x86/include/asm/uaccess_64.h
>>> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>>>  	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
>>>  })
>>>  
>>> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
>>> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
>>> +							    unsigned long addr)
>>> +{
>>> +	return addr & READ_ONCE((mm)->context.untag_mask);
>>> +}
>>> +
>>> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
>>> +	unsigned long __addr = (__force unsigned long)(addr);		\
>>> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
>>> +})
>>> +
>>>  #endif
>>>  
>>>  #define valid_user_address(x) \
>>> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
>>> index 8a264662b242..c8c83372c9d8 100644
>>> --- a/include/linux/uaccess.h
>>> +++ b/include/linux/uaccess.h
>>> @@ -34,6 +34,17 @@
>>>  })
>>>  #endif
>>>  
>>> +/*
>>> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
>>> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
>>> + */
>>> +#ifndef untagged_addr_remote_unlocked
>>> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
>>> +	(void)(mm);					\
>>> +	untagged_addr(addr);				\
>>> +})
>>> +#endif
>>> +
>>>  #ifdef masked_user_access_begin
>>>   #define can_do_masked_user_access() 1
>>>  # ifndef masked_user_write_access_begin
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 86a973119bd4..0b23b82eaa18 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -42,6 +42,8 @@
>>>  #include <linux/kernel_stat.h>
>>>  #include <linux/mm.h>
>>>  #include <linux/mm_inline.h>
>>> +#include <linux/secretmem.h>
>>> +#include <linux/pagewalk.h>
>>>  #include <linux/sched/mm.h>
>>>  #include <linux/sched/numa_balancing.h>
>>>  #include <linux/sched/task.h>
>>> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>>>  EXPORT_SYMBOL_GPL(generic_access_phys);
>>>  #endif
>>>  
>>> +/*
>>> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
>>> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
>>> + */
>>> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
>>> +/*
>>> + * Opportunistic lockless fast path for __access_remote_vm() reads.
>>> + *
>>> + * Memory already resident in @mm can be read without taking the heavily
>>> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
>>> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
>>> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
>>> + *
>>> + * Anything that would require faulting a page in, touching a hugetlb or
>>> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
>>> + * path in __access_remote_vm().  Only reads are handled here.
>>> + *
>>> + * Returns the number of bytes transferred via the fast path.
>>> + */
>>> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
>>> +				 void *buf, int len, unsigned int gup_flags)
>>> +{
>>> +	void *old_buf = buf;
>>> +
>>> +	addr = untagged_addr_remote_unlocked(mm, addr);
>>> +
>>> +	while (len) {
>>> +		struct vm_area_struct *vma;
>>> +		vm_flags_t vm_flags;
>>> +
>>> +		vma = lock_vma_under_rcu(mm, addr);
>>> +		if (!vma)
>>> +			break;
>>> +
>>> +		/*
>>> +		 * Mirror the read-side permission checks of check_vma_flags(),
>>> +		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
>>> +		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
>>> +		 * per VMA; anything not positively allowed falls back to the
>>> +		 * slow path, which re-validates everything.
>>> +		 */
>>> +		vm_flags = vma->vm_flags;
>>> +		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
>>> +		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
>>> +		    (!(vm_flags & VM_READ) &&
>>> +		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
>>> +			vma_end_read(vma);
>>> +			break;
>>> +		}
>>
>> This should also do the FOLL_ANON check from check_vma_flags().
>>
>> check_vma_flags() rejects non-anonymous VMAs when FOLL_ANON is set:
>>
>> 	if ((gup_flags & FOLL_ANON) && !vma_anon)
>> 		return -EFAULT;
> 
> Duplicating GUP logic in a non-GUP file. Splendid. :)
> 

Haha probably just need a common helper.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-17 13:33     ` Suren Baghdasaryan
@ 2026-06-18 20:37       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-18 20:37 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Rik van Riel, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

On 6/17/26 15:33, Suren Baghdasaryan wrote:
> On Wed, Jun 17, 2026 at 2:42 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
>>
>> On 6/17/26 03:10, Suren Baghdasaryan wrote:
>>>
>>> Thanks for the patchset Rik!
>>> Previously when I looked into using per-VMA locks in
>>> access_remote_vm(), the biggest hurdle was get_user_pages_remote(),
>>> which required mmap_lock. Your implementation avoids altogether and
>>> keeps the code much simpler than what I expected.
>>
>> But, wouldn't we, in general, also want to teach GUP to just work with per-VMA
>> locks?
> 
> Matthew suggested using gup_fast in access_remote_vm before, and I
> looked into that. The biggest issue there is that gup_fast assumes it
> always operates on current->mm, not on the remote one. Reworking that
> is quite an undertaking.

Right, that's more tricky, IIRC the CPU from a remote MM might not get an IPI
sent to sync. (but my memory is fuzzy on that)

> Teaching GUP in general to work with per-VMA locks I think would also
> be much harder than what this patchset does and would require a GUP
> expert (which I am unfortunately not).

Well, "harder" is not really an excuse ;)

Where a folio_walk really shines at is that you can just walk a PMD entry and
process it all at once, instead of returning 512. Where it doesn't shine is that
you have to walk the complete page table again for each individual PTE.

... which is also what we do right now through get_user_page_vma_remote(), which
is rather suboptimal.

Ideally, you'd obtain multiple page ranges (with upper limit on the ranges) in
one shot, whereby each page range belongs to the same compound page, and there
is exactly one page/folio ref on a range. [we discussed that in other context
recently]

Then, you can just let GUP do the GUP work, without re-implementing it for some
special cases elsewhere. And others can benefit from the work.


So I'd really like us to find out what it would take to teach ordinary GUP (or
better, some new GUP interface) to run under the VMA lock. We can start with the
existing interface to GUP a single page to KIS.

Maybe having a new GUP interface that consumes a VMA instead of an MM could be
the first start to enable per-VMA locks?

All GUP does is walk page tables and call fault handlers. userfaultfd is nasty,
but existing page faults must also deal with that having to fallback to the MM
lock, so it sounds like a solvable problem with some churn?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-06-18 20:37 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16 19:02 [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
2026-06-16 19:02 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
2026-06-18 16:40   ` Usama Arif
2026-06-16 19:02 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
2026-06-17  6:19   ` Suren Baghdasaryan
2026-06-18 17:01   ` Usama Arif
2026-06-18 17:07     ` David Hildenbrand (Arm)
2026-06-18 17:22       ` Usama Arif
2026-06-17  1:10 ` [PATCH 0/3] mm: __access_remote_vm with per-VMA lock Suren Baghdasaryan
2026-06-17  9:42   ` David Hildenbrand (Arm)
2026-06-17 13:33     ` Suren Baghdasaryan
2026-06-18 20:37       ` David Hildenbrand (Arm)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.