[PATCH 3/3] mm: read remote memory without the mmap lock where possible

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-16 19:02 [PATCH " Rik van Riel
@ 2026-06-16 19:03 ` Rik van Riel
  2026-06-17  6:19   ` Suren Baghdasaryan
                     ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Rik van Riel @ 2026-06-16 19:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan

__access_remote_vm() takes mmap_read_lock() for the entire transfer and
uses get_user_pages_remote(), which faults pages in.  For the common
case of reading memory that is already resident -- /proc/PID/cmdline,
/proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
unnecessary and is badly contended on large machines.

Add an opportunistic, read-only fast path that transfers what it can
without the mmap lock.  For each address it takes the per-VMA lock with
lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
a present page before copying it out.  Anything non-trivial -- a not-
present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
a race with a VMA writer -- falls back to the existing mmap_lock path
for the remainder.

untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
for the fast path; the untag mask is a stable per-mm value.

Only reads are handled here; writes keep using the slow path.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/uaccess_64.h |  12 +++
 include/linux/uaccess.h           |  11 ++
 mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
 3 files changed, 188 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 4a52497ba6a1..c6fac900a747 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
 })
 
+/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
+static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
+							    unsigned long addr)
+{
+	return addr & READ_ONCE((mm)->context.untag_mask);
+}
+
+#define untagged_addr_remote_unlocked(mm, addr)	({			\
+	unsigned long __addr = (__force unsigned long)(addr);		\
+	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
+})
+
 #endif
 
 #define valid_user_address(x) \
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 8a264662b242..c8c83372c9d8 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -34,6 +34,17 @@
 })
 #endif
 
+/*
+ * Like untagged_addr_remote(), but for callers that stabilize @mm by other
+ * means (e.g. a per-VMA lock) and must not assert the mmap lock.
+ */
+#ifndef untagged_addr_remote_unlocked
+#define untagged_addr_remote_unlocked(mm, addr)	({	\
+	(void)(mm);					\
+	untagged_addr(addr);				\
+})
+#endif
+
 #ifdef masked_user_access_begin
  #define can_do_masked_user_access() 1
 # ifndef masked_user_write_access_begin
diff --git a/mm/memory.c b/mm/memory.c
index 86a973119bd4..0b23b82eaa18 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -42,6 +42,8 @@
 #include <linux/kernel_stat.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
+#include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/sched/task.h>
@@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 EXPORT_SYMBOL_GPL(generic_access_phys);
 #endif
 
+/*
+ * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
+ * lock and RCU-freed page tables to walk page tables without the mmap lock.
+ */
+#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
+/*
+ * Opportunistic lockless fast path for __access_remote_vm() reads.
+ *
+ * Memory already resident in @mm can be read without taking the heavily
+ * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
+ * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
+ * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
+ *
+ * Anything that would require faulting a page in, touching a hugetlb or
+ * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
+ * path in __access_remote_vm().  Only reads are handled here.
+ *
+ * Returns the number of bytes transferred via the fast path.
+ */
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+				 void *buf, int len, unsigned int gup_flags)
+{
+	void *old_buf = buf;
+
+	addr = untagged_addr_remote_unlocked(mm, addr);
+
+	while (len) {
+		struct vm_area_struct *vma;
+		vm_flags_t vm_flags;
+
+		vma = lock_vma_under_rcu(mm, addr);
+		if (!vma)
+			break;
+
+		/*
+		 * Mirror the read-side permission checks of check_vma_flags(),
+		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
+		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
+		 * per VMA; anything not positively allowed falls back to the
+		 * slow path, which re-validates everything.
+		 */
+		vm_flags = vma->vm_flags;
+		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
+		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
+		    (!(vm_flags & VM_READ) &&
+		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
+			vma_end_read(vma);
+			break;
+		}
+
+		/*
+		 * Copy as much of this VMA as we can without re-acquiring the
+		 * per-VMA lock; re-lock only when @addr leaves the VMA.
+		 */
+		while (len && addr < vma->vm_end) {
+			struct folio_walk fw;
+			struct folio *folio;
+			struct page *page;
+			unsigned long entry_size, entry_left, folio_left, span;
+			unsigned long copied, idx0;
+			int offset;
+
+			folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
+			if (!folio) {
+				vma_end_read(vma);
+				goto out;
+			}
+			page = fw.page;
+			if (!page) {
+				folio_walk_end(&fw, vma);
+				vma_end_read(vma);
+				goto out;
+			}
+			/* Pin the folio so it stays valid after the PTL is dropped. */
+			folio_get(folio);
+			folio_walk_end(&fw, vma);
+
+			/*
+			 * folio_walk_start() validated exactly one mapping entry,
+			 * which covers a contiguous, present run of this folio:
+			 * PAGE_SIZE for a pte, PMD_SIZE for a pmd leaf, PUD_SIZE
+			 * for a pud leaf.  Copy up to the end of that entry,
+			 * bounded by the folio, the VMA and len, so a huge mapping
+			 * is handled in one walk instead of per page.
+			 */
+			offset = offset_in_page(addr);
+			switch (fw.level) {
+			case FW_LEVEL_PUD:
+				entry_size = PUD_SIZE;
+				break;
+			case FW_LEVEL_PMD:
+				entry_size = PMD_SIZE;
+				break;
+			default:
+				entry_size = PAGE_SIZE;
+				break;
+			}
+			entry_left = entry_size - (addr & (entry_size - 1));
+			idx0 = folio_page_idx(folio, page);
+			folio_left = ((folio_nr_pages(folio) - idx0) << PAGE_SHIFT) -
+				     offset;
+			span = min3((unsigned long)len, entry_left, folio_left);
+			span = min(span, vma->vm_end - addr);
+
+			/*
+			 * Copy the span page-by-page: kmap_local_folio() maps one
+			 * page on HIGHMEM and copy_from_user_page() flushes per
+			 * page on aliasing caches, but the page tables are not
+			 * re-walked.  The span borrows the single folio reference
+			 * taken above, so each mapping is dropped with
+			 * kunmap_local() (not folio_release_kmap(), which would
+			 * also drop a folio reference per page).
+			 */
+			for (copied = 0; copied < span; ) {
+				unsigned long foff = offset + copied;
+				unsigned long pidx = idx0 + (foff >> PAGE_SHIFT);
+				int poff = foff & ~PAGE_MASK;
+				int chunk = min_t(unsigned long, span - copied,
+						  PAGE_SIZE - poff);
+				void *maddr = kmap_local_folio(folio,
+						pidx << PAGE_SHIFT);
+
+				copy_from_user_page(vma, folio_page(folio, pidx),
+						    addr + copied, buf + copied,
+						    maddr + poff, chunk);
+				kunmap_local(maddr);
+				copied += chunk;
+			}
+
+			folio_put(folio);
+			len -= span;
+			buf += span;
+			addr += span;
+		}
+		vma_end_read(vma);
+	}
+out:
+	return buf - old_buf;
+}
+#else
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+				 void *buf, int len, unsigned int gup_flags)
+{
+	return 0;
+}
+#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
+
 /*
  * Access another process' address space as given in mm.
  */
@@ -7071,8 +7220,23 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
 	void *old_buf = buf;
 	int write = gup_flags & FOLL_WRITE;
 
+	/*
+	 * Try the lockless fast path for reads first; it transfers what it can
+	 * from resident memory without taking mmap_lock, and leaves the
+	 * remainder (if any) to the slow path below.
+	 */
+	if (!write) {
+		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
+
+		addr += done;
+		buf += done;
+		len -= done;
+		if (!len)
+			return buf - old_buf;
+	}
+
 	if (mmap_read_lock_killable(mm))
-		return 0;
+		return buf - old_buf;
 
 	/* Untag the address before looking up the VMA */
 	addr = untagged_addr_remote(mm, addr);
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
@ 2026-06-17  6:19   ` Suren Baghdasaryan
  2026-06-19 12:24     ` Lorenzo Stoakes
  2026-06-19 13:46     ` Rik van Riel
  2026-06-18 17:01   ` Usama Arif
  2026-06-19 12:20   ` Lorenzo Stoakes
  2 siblings, 2 replies; 18+ messages in thread
From: Suren Baghdasaryan @ 2026-06-17  6:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka

On Tue, Jun 16, 2026 at 12:04 PM Rik van Riel <riel@surriel.com> wrote:
>
> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> uses get_user_pages_remote(), which faults pages in.  For the common
> case of reading memory that is already resident -- /proc/PID/cmdline,
> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> unnecessary and is badly contended on large machines.
>
> Add an opportunistic, read-only fast path that transfers what it can
> without the mmap lock.  For each address it takes the per-VMA lock with
> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
> a present page before copying it out.  Anything non-trivial -- a not-
> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
> a race with a VMA writer -- falls back to the existing mmap_lock path
> for the remainder.

I don't think we should be using per-VMA locks if the read spans
multiple VMAs. Doing that would risk a possibility of reading
inconsistent data since we are locking one VMA at a time. While we
load and read VMA, its neighboring VMA can be unmapped and another one
can be mapped in its place. So, our read spanning both VMAs will
return inconsistent data. access_remote_vm_fast() can check if the
entire read is contained within one VMA and if not, fall back to
mmap_lock.

>
> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
> for the fast path; the untag mask is a stable per-mm value.
>
> Only reads are handled here; writes keep using the slow path.
>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/uaccess_64.h |  12 +++
>  include/linux/uaccess.h           |  11 ++
>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>  3 files changed, 188 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 4a52497ba6a1..c6fac900a747 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>         (__force __typeof__(addr))__untagged_addr_remote(mm, __addr);   \
>  })
>
> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
> +                                                           unsigned long addr)
> +{
> +       return addr & READ_ONCE((mm)->context.untag_mask);
> +}
> +
> +#define untagged_addr_remote_unlocked(mm, addr)        ({                      \
> +       unsigned long __addr = (__force unsigned long)(addr);           \
> +       (__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
> +})
> +
>  #endif
>
>  #define valid_user_address(x) \
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 8a264662b242..c8c83372c9d8 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -34,6 +34,17 @@
>  })
>  #endif
>
> +/*
> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
> + */
> +#ifndef untagged_addr_remote_unlocked
> +#define untagged_addr_remote_unlocked(mm, addr)        ({      \
> +       (void)(mm);                                     \
> +       untagged_addr(addr);                            \
> +})
> +#endif
> +
>  #ifdef masked_user_access_begin
>   #define can_do_masked_user_access() 1
>  # ifndef masked_user_write_access_begin
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..0b23b82eaa18 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,8 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/secretmem.h>
> +#include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>  EXPORT_SYMBOL_GPL(generic_access_phys);
>  #endif
>
> +/*
> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
> + */
> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)

Just FYI, Dave Hansen is working on a patchset [1] that removes
CONFIG_PER_VMA_LOCK and makes per-VMA locks always available. This
would simplify this patchset too.

[1] https://lore.kernel.org/all/20260610230409.A44D29FA@davehans-spike.ostc.intel.com/

> +/*
> + * Opportunistic lockless fast path for __access_remote_vm() reads.
> + *
> + * Memory already resident in @mm can be read without taking the heavily
> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
> + *
> + * Anything that would require faulting a page in, touching a hugetlb or
> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
> + * path in __access_remote_vm().  Only reads are handled here.
> + *
> + * Returns the number of bytes transferred via the fast path.
> + */
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +                                void *buf, int len, unsigned int gup_flags)
> +{
> +       void *old_buf = buf;
> +
> +       addr = untagged_addr_remote_unlocked(mm, addr);
> +
> +       while (len) {
> +               struct vm_area_struct *vma;
> +               vm_flags_t vm_flags;
> +
> +               vma = lock_vma_under_rcu(mm, addr);
> +               if (!vma)
> +                       break;
> +
> +               /*
> +                * Mirror the read-side permission checks of check_vma_flags(),
> +                * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
> +                * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
> +                * per VMA; anything not positively allowed falls back to the
> +                * slow path, which re-validates everything.
> +                */
> +               vm_flags = vma->vm_flags;
> +               if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
> +                   is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
> +                   (!(vm_flags & VM_READ) &&
> +                    (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
> +                       vma_end_read(vma);
> +                       break;
> +               }
> +
> +               /*
> +                * Copy as much of this VMA as we can without re-acquiring the
> +                * per-VMA lock; re-lock only when @addr leaves the VMA.
> +                */
> +               while (len && addr < vma->vm_end) {
> +                       struct folio_walk fw;
> +                       struct folio *folio;
> +                       struct page *page;
> +                       unsigned long entry_size, entry_left, folio_left, span;
> +                       unsigned long copied, idx0;
> +                       int offset;
> +
> +                       folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
> +                       if (!folio) {
> +                               vma_end_read(vma);
> +                               goto out;
> +                       }
> +                       page = fw.page;
> +                       if (!page) {
> +                               folio_walk_end(&fw, vma);
> +                               vma_end_read(vma);
> +                               goto out;
> +                       }
> +                       /* Pin the folio so it stays valid after the PTL is dropped. */
> +                       folio_get(folio);
> +                       folio_walk_end(&fw, vma);
> +
> +                       /*
> +                        * folio_walk_start() validated exactly one mapping entry,
> +                        * which covers a contiguous, present run of this folio:
> +                        * PAGE_SIZE for a pte, PMD_SIZE for a pmd leaf, PUD_SIZE
> +                        * for a pud leaf.  Copy up to the end of that entry,
> +                        * bounded by the folio, the VMA and len, so a huge mapping
> +                        * is handled in one walk instead of per page.
> +                        */
> +                       offset = offset_in_page(addr);
> +                       switch (fw.level) {
> +                       case FW_LEVEL_PUD:
> +                               entry_size = PUD_SIZE;
> +                               break;
> +                       case FW_LEVEL_PMD:
> +                               entry_size = PMD_SIZE;
> +                               break;
> +                       default:
> +                               entry_size = PAGE_SIZE;
> +                               break;
> +                       }
> +                       entry_left = entry_size - (addr & (entry_size - 1));
> +                       idx0 = folio_page_idx(folio, page);
> +                       folio_left = ((folio_nr_pages(folio) - idx0) << PAGE_SHIFT) -
> +                                    offset;
> +                       span = min3((unsigned long)len, entry_left, folio_left);
> +                       span = min(span, vma->vm_end - addr);
> +
> +                       /*
> +                        * Copy the span page-by-page: kmap_local_folio() maps one
> +                        * page on HIGHMEM and copy_from_user_page() flushes per
> +                        * page on aliasing caches, but the page tables are not
> +                        * re-walked.  The span borrows the single folio reference
> +                        * taken above, so each mapping is dropped with
> +                        * kunmap_local() (not folio_release_kmap(), which would
> +                        * also drop a folio reference per page).
> +                        */
> +                       for (copied = 0; copied < span; ) {
> +                               unsigned long foff = offset + copied;
> +                               unsigned long pidx = idx0 + (foff >> PAGE_SHIFT);
> +                               int poff = foff & ~PAGE_MASK;
> +                               int chunk = min_t(unsigned long, span - copied,
> +                                                 PAGE_SIZE - poff);
> +                               void *maddr = kmap_local_folio(folio,
> +                                               pidx << PAGE_SHIFT);
> +
> +                               copy_from_user_page(vma, folio_page(folio, pidx),
> +                                                   addr + copied, buf + copied,
> +                                                   maddr + poff, chunk);
> +                               kunmap_local(maddr);
> +                               copied += chunk;
> +                       }
> +
> +                       folio_put(folio);
> +                       len -= span;
> +                       buf += span;
> +                       addr += span;
> +               }
> +               vma_end_read(vma);
> +       }
> +out:
> +       return buf - old_buf;
> +}
> +#else
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +                                void *buf, int len, unsigned int gup_flags)
> +{
> +       return 0;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
> +
>  /*
>   * Access another process' address space as given in mm.
>   */
> @@ -7071,8 +7220,23 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
>         void *old_buf = buf;
>         int write = gup_flags & FOLL_WRITE;
>
> +       /*
> +        * Try the lockless fast path for reads first; it transfers what it can
> +        * from resident memory without taking mmap_lock, and leaves the
> +        * remainder (if any) to the slow path below.
> +        */
> +       if (!write) {
> +               int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
> +
> +               addr += done;
> +               buf += done;
> +               len -= done;
> +               if (!len)
> +                       return buf - old_buf;
> +       }
> +
>         if (mmap_read_lock_killable(mm))
> -               return 0;
> +               return buf - old_buf;
>
>         /* Untag the address before looking up the VMA */
>         addr = untagged_addr_remote(mm, addr);
> --
> 2.53.0-Meta
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
  2026-06-17  6:19   ` Suren Baghdasaryan
@ 2026-06-18 17:01   ` Usama Arif
  2026-06-18 17:07     ` David Hildenbrand (Arm)
  2026-06-19 12:20   ` Lorenzo Stoakes
  2 siblings, 1 reply; 18+ messages in thread
From: Usama Arif @ 2026-06-18 17:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Usama Arif, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan

On Tue, 16 Jun 2026 15:03:00 -0400 Rik van Riel <riel@surriel.com> wrote:

> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> uses get_user_pages_remote(), which faults pages in.  For the common
> case of reading memory that is already resident -- /proc/PID/cmdline,
> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> unnecessary and is badly contended on large machines.
> 
> Add an opportunistic, read-only fast path that transfers what it can
> without the mmap lock.  For each address it takes the per-VMA lock with
> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
> a present page before copying it out.  Anything non-trivial -- a not-
> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
> a race with a VMA writer -- falls back to the existing mmap_lock path
> for the remainder.
> 
> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
> for the fast path; the untag mask is a stable per-mm value.
> 
> Only reads are handled here; writes keep using the slow path.
> 
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/uaccess_64.h |  12 +++
>  include/linux/uaccess.h           |  11 ++
>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>  3 files changed, 188 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 4a52497ba6a1..c6fac900a747 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>  	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
>  })
>  
> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
> +							    unsigned long addr)
> +{
> +	return addr & READ_ONCE((mm)->context.untag_mask);
> +}
> +
> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
> +	unsigned long __addr = (__force unsigned long)(addr);		\
> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
> +})
> +
>  #endif
>  
>  #define valid_user_address(x) \
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 8a264662b242..c8c83372c9d8 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -34,6 +34,17 @@
>  })
>  #endif
>  
> +/*
> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
> + */
> +#ifndef untagged_addr_remote_unlocked
> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
> +	(void)(mm);					\
> +	untagged_addr(addr);				\
> +})
> +#endif
> +
>  #ifdef masked_user_access_begin
>   #define can_do_masked_user_access() 1
>  # ifndef masked_user_write_access_begin
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..0b23b82eaa18 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,8 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/secretmem.h>
> +#include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>  EXPORT_SYMBOL_GPL(generic_access_phys);
>  #endif
>  
> +/*
> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
> + */
> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
> +/*
> + * Opportunistic lockless fast path for __access_remote_vm() reads.
> + *
> + * Memory already resident in @mm can be read without taking the heavily
> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
> + *
> + * Anything that would require faulting a page in, touching a hugetlb or
> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
> + * path in __access_remote_vm().  Only reads are handled here.
> + *
> + * Returns the number of bytes transferred via the fast path.
> + */
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	void *old_buf = buf;
> +
> +	addr = untagged_addr_remote_unlocked(mm, addr);
> +
> +	while (len) {
> +		struct vm_area_struct *vma;
> +		vm_flags_t vm_flags;
> +
> +		vma = lock_vma_under_rcu(mm, addr);
> +		if (!vma)
> +			break;
> +
> +		/*
> +		 * Mirror the read-side permission checks of check_vma_flags(),
> +		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
> +		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
> +		 * per VMA; anything not positively allowed falls back to the
> +		 * slow path, which re-validates everything.
> +		 */
> +		vm_flags = vma->vm_flags;
> +		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
> +		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
> +		    (!(vm_flags & VM_READ) &&
> +		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
> +			vma_end_read(vma);
> +			break;
> +		}

This should also do the FOLL_ANON check from check_vma_flags().

check_vma_flags() rejects non-anonymous VMAs when FOLL_ANON is set:

	if ((gup_flags & FOLL_ANON) && !vma_anon)
		return -EFAULT;

That flag is used by fs/proc/base.c for /proc/PID/cmdline and
/proc/PID/environ.  It was added by commit 7f7ccc2ccc2e ("proc: do not
access cmdline nor environ from file-backed areas"), which fixed
CVE-2018-1120 by making those proc files refuse file-backed argv/env
areas.

> +
> +		/*
> +		 * Copy as much of this VMA as we can without re-acquiring the
> +		 * per-VMA lock; re-lock only when @addr leaves the VMA.
> +		 */
> +		while (len && addr < vma->vm_end) {
> +			struct folio_walk fw;
> +			struct folio *folio;
> +			struct page *page;
> +			unsigned long entry_size, entry_left, folio_left, span;
> +			unsigned long copied, idx0;
> +			int offset;
> +
> +			folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
> +			if (!folio) {
> +				vma_end_read(vma);
> +				goto out;
> +			}
> +			page = fw.page;
> +			if (!page) {
> +				folio_walk_end(&fw, vma);
> +				vma_end_read(vma);
> +				goto out;
> +			}
> +			/* Pin the folio so it stays valid after the PTL is dropped. */
> +			folio_get(folio);
> +			folio_walk_end(&fw, vma);
> +
> +			/*
> +			 * folio_walk_start() validated exactly one mapping entry,
> +			 * which covers a contiguous, present run of this folio:
> +			 * PAGE_SIZE for a pte, PMD_SIZE for a pmd leaf, PUD_SIZE
> +			 * for a pud leaf.  Copy up to the end of that entry,
> +			 * bounded by the folio, the VMA and len, so a huge mapping
> +			 * is handled in one walk instead of per page.
> +			 */
> +			offset = offset_in_page(addr);
> +			switch (fw.level) {
> +			case FW_LEVEL_PUD:
> +				entry_size = PUD_SIZE;
> +				break;
> +			case FW_LEVEL_PMD:
> +				entry_size = PMD_SIZE;
> +				break;
> +			default:
> +				entry_size = PAGE_SIZE;
> +				break;
> +			}
> +			entry_left = entry_size - (addr & (entry_size - 1));
> +			idx0 = folio_page_idx(folio, page);
> +			folio_left = ((folio_nr_pages(folio) - idx0) << PAGE_SHIFT) -
> +				     offset;
> +			span = min3((unsigned long)len, entry_left, folio_left);
> +			span = min(span, vma->vm_end - addr);
> +
> +			/*
> +			 * Copy the span page-by-page: kmap_local_folio() maps one
> +			 * page on HIGHMEM and copy_from_user_page() flushes per
> +			 * page on aliasing caches, but the page tables are not
> +			 * re-walked.  The span borrows the single folio reference
> +			 * taken above, so each mapping is dropped with
> +			 * kunmap_local() (not folio_release_kmap(), which would
> +			 * also drop a folio reference per page).
> +			 */
> +			for (copied = 0; copied < span; ) {
> +				unsigned long foff = offset + copied;
> +				unsigned long pidx = idx0 + (foff >> PAGE_SHIFT);
> +				int poff = foff & ~PAGE_MASK;
> +				int chunk = min_t(unsigned long, span - copied,
> +						  PAGE_SIZE - poff);
> +				void *maddr = kmap_local_folio(folio,
> +						pidx << PAGE_SHIFT);
> +
> +				copy_from_user_page(vma, folio_page(folio, pidx),
> +						    addr + copied, buf + copied,
> +						    maddr + poff, chunk);
> +				kunmap_local(maddr);
> +				copied += chunk;
> +			}

__access_remote_vm() slow path calls get_user_page_vma_remote() which calls
get_user_pages_remote(). get_user_pages_remote() adds FOLL_TOUCH and then the
page-table walk eventually reaches follow_page_pte().

The new resident-page fast path copies the same data without doing an
equivalent folio_mark_accessed().

That changes reclaim behaviour for pages repeatedly read through
access_remote_vm(), such as /proc/PID/cmdline polling. I think you
should mark the folio as accessed.


> +
> +			folio_put(folio);
> +			len -= span;
> +			buf += span;
> +			addr += span;
> +		}
> +		vma_end_read(vma);
> +	}
> +out:
> +	return buf - old_buf;
> +}
> +#else
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
> +
>  /*
>   * Access another process' address space as given in mm.
>   */
> @@ -7071,8 +7220,23 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
>  	void *old_buf = buf;
>  	int write = gup_flags & FOLL_WRITE;
>  
> +	/*
> +	 * Try the lockless fast path for reads first; it transfers what it can
> +	 * from resident memory without taking mmap_lock, and leaves the
> +	 * remainder (if any) to the slow path below.
> +	 */
> +	if (!write) {
> +		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
> +
> +		addr += done;
> +		buf += done;
> +		len -= done;
> +		if (!len)
> +			return buf - old_buf;
> +	}
> +
>  	if (mmap_read_lock_killable(mm))
> -		return 0;
> +		return buf - old_buf;
>  
>  	/* Untag the address before looking up the VMA */
>  	addr = untagged_addr_remote(mm, addr);
> -- 
> 2.53.0-Meta
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-18 17:01   ` Usama Arif
@ 2026-06-18 17:07     ` David Hildenbrand (Arm)
  2026-06-18 17:22       ` Usama Arif
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-18 17:07 UTC (permalink / raw)
  To: Usama Arif, Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan

On 6/18/26 19:01, Usama Arif wrote:
> On Tue, 16 Jun 2026 15:03:00 -0400 Rik van Riel <riel@surriel.com> wrote:
> 
>> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
>> uses get_user_pages_remote(), which faults pages in.  For the common
>> case of reading memory that is already resident -- /proc/PID/cmdline,
>> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
>> unnecessary and is badly contended on large machines.
>>
>> Add an opportunistic, read-only fast path that transfers what it can
>> without the mmap lock.  For each address it takes the per-VMA lock with
>> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
>> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
>> a present page before copying it out.  Anything non-trivial -- a not-
>> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
>> a race with a VMA writer -- falls back to the existing mmap_lock path
>> for the remainder.
>>
>> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
>> for the fast path; the untag mask is a stable per-mm value.
>>
>> Only reads are handled here; writes keep using the slow path.
>>
>> Assisted-by: Claude:claude-opus-4-8
>> Signed-off-by: Rik van Riel <riel@surriel.com>
>> ---
>>  arch/x86/include/asm/uaccess_64.h |  12 +++
>>  include/linux/uaccess.h           |  11 ++
>>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>>  3 files changed, 188 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
>> index 4a52497ba6a1..c6fac900a747 100644
>> --- a/arch/x86/include/asm/uaccess_64.h
>> +++ b/arch/x86/include/asm/uaccess_64.h
>> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>>  	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
>>  })
>>  
>> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
>> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
>> +							    unsigned long addr)
>> +{
>> +	return addr & READ_ONCE((mm)->context.untag_mask);
>> +}
>> +
>> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
>> +	unsigned long __addr = (__force unsigned long)(addr);		\
>> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
>> +})
>> +
>>  #endif
>>  
>>  #define valid_user_address(x) \
>> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
>> index 8a264662b242..c8c83372c9d8 100644
>> --- a/include/linux/uaccess.h
>> +++ b/include/linux/uaccess.h
>> @@ -34,6 +34,17 @@
>>  })
>>  #endif
>>  
>> +/*
>> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
>> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
>> + */
>> +#ifndef untagged_addr_remote_unlocked
>> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
>> +	(void)(mm);					\
>> +	untagged_addr(addr);				\
>> +})
>> +#endif
>> +
>>  #ifdef masked_user_access_begin
>>   #define can_do_masked_user_access() 1
>>  # ifndef masked_user_write_access_begin
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 86a973119bd4..0b23b82eaa18 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -42,6 +42,8 @@
>>  #include <linux/kernel_stat.h>
>>  #include <linux/mm.h>
>>  #include <linux/mm_inline.h>
>> +#include <linux/secretmem.h>
>> +#include <linux/pagewalk.h>
>>  #include <linux/sched/mm.h>
>>  #include <linux/sched/numa_balancing.h>
>>  #include <linux/sched/task.h>
>> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>>  EXPORT_SYMBOL_GPL(generic_access_phys);
>>  #endif
>>  
>> +/*
>> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
>> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
>> + */
>> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
>> +/*
>> + * Opportunistic lockless fast path for __access_remote_vm() reads.
>> + *
>> + * Memory already resident in @mm can be read without taking the heavily
>> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
>> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
>> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
>> + *
>> + * Anything that would require faulting a page in, touching a hugetlb or
>> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
>> + * path in __access_remote_vm().  Only reads are handled here.
>> + *
>> + * Returns the number of bytes transferred via the fast path.
>> + */
>> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
>> +				 void *buf, int len, unsigned int gup_flags)
>> +{
>> +	void *old_buf = buf;
>> +
>> +	addr = untagged_addr_remote_unlocked(mm, addr);
>> +
>> +	while (len) {
>> +		struct vm_area_struct *vma;
>> +		vm_flags_t vm_flags;
>> +
>> +		vma = lock_vma_under_rcu(mm, addr);
>> +		if (!vma)
>> +			break;
>> +
>> +		/*
>> +		 * Mirror the read-side permission checks of check_vma_flags(),
>> +		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
>> +		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
>> +		 * per VMA; anything not positively allowed falls back to the
>> +		 * slow path, which re-validates everything.
>> +		 */
>> +		vm_flags = vma->vm_flags;
>> +		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
>> +		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
>> +		    (!(vm_flags & VM_READ) &&
>> +		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
>> +			vma_end_read(vma);
>> +			break;
>> +		}
> 
> This should also do the FOLL_ANON check from check_vma_flags().
> 
> check_vma_flags() rejects non-anonymous VMAs when FOLL_ANON is set:
> 
> 	if ((gup_flags & FOLL_ANON) && !vma_anon)
> 		return -EFAULT;

Duplicating GUP logic in a non-GUP file. Splendid. :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-18 17:07     ` David Hildenbrand (Arm)
@ 2026-06-18 17:22       ` Usama Arif
  0 siblings, 0 replies; 18+ messages in thread
From: Usama Arif @ 2026-06-18 17:22 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan



On 18/06/2026 18:07, David Hildenbrand (Arm) wrote:
> On 6/18/26 19:01, Usama Arif wrote:
>> On Tue, 16 Jun 2026 15:03:00 -0400 Rik van Riel <riel@surriel.com> wrote:
>>
>>> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
>>> uses get_user_pages_remote(), which faults pages in.  For the common
>>> case of reading memory that is already resident -- /proc/PID/cmdline,
>>> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
>>> unnecessary and is badly contended on large machines.
>>>
>>> Add an opportunistic, read-only fast path that transfers what it can
>>> without the mmap lock.  For each address it takes the per-VMA lock with
>>> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
>>> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
>>> a present page before copying it out.  Anything non-trivial -- a not-
>>> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
>>> a race with a VMA writer -- falls back to the existing mmap_lock path
>>> for the remainder.
>>>
>>> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
>>> for the fast path; the untag mask is a stable per-mm value.
>>>
>>> Only reads are handled here; writes keep using the slow path.
>>>
>>> Assisted-by: Claude:claude-opus-4-8
>>> Signed-off-by: Rik van Riel <riel@surriel.com>
>>> ---
>>>  arch/x86/include/asm/uaccess_64.h |  12 +++
>>>  include/linux/uaccess.h           |  11 ++
>>>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>>>  3 files changed, 188 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
>>> index 4a52497ba6a1..c6fac900a747 100644
>>> --- a/arch/x86/include/asm/uaccess_64.h
>>> +++ b/arch/x86/include/asm/uaccess_64.h
>>> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>>>  	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
>>>  })
>>>  
>>> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */
>>> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
>>> +							    unsigned long addr)
>>> +{
>>> +	return addr & READ_ONCE((mm)->context.untag_mask);
>>> +}
>>> +
>>> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
>>> +	unsigned long __addr = (__force unsigned long)(addr);		\
>>> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
>>> +})
>>> +
>>>  #endif
>>>  
>>>  #define valid_user_address(x) \
>>> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
>>> index 8a264662b242..c8c83372c9d8 100644
>>> --- a/include/linux/uaccess.h
>>> +++ b/include/linux/uaccess.h
>>> @@ -34,6 +34,17 @@
>>>  })
>>>  #endif
>>>  
>>> +/*
>>> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
>>> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
>>> + */
>>> +#ifndef untagged_addr_remote_unlocked
>>> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
>>> +	(void)(mm);					\
>>> +	untagged_addr(addr);				\
>>> +})
>>> +#endif
>>> +
>>>  #ifdef masked_user_access_begin
>>>   #define can_do_masked_user_access() 1
>>>  # ifndef masked_user_write_access_begin
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 86a973119bd4..0b23b82eaa18 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -42,6 +42,8 @@
>>>  #include <linux/kernel_stat.h>
>>>  #include <linux/mm.h>
>>>  #include <linux/mm_inline.h>
>>> +#include <linux/secretmem.h>
>>> +#include <linux/pagewalk.h>
>>>  #include <linux/sched/mm.h>
>>>  #include <linux/sched/numa_balancing.h>
>>>  #include <linux/sched/task.h>
>>> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>>>  EXPORT_SYMBOL_GPL(generic_access_phys);
>>>  #endif
>>>  
>>> +/*
>>> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
>>> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
>>> + */
>>> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
>>> +/*
>>> + * Opportunistic lockless fast path for __access_remote_vm() reads.
>>> + *
>>> + * Memory already resident in @mm can be read without taking the heavily
>>> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
>>> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
>>> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
>>> + *
>>> + * Anything that would require faulting a page in, touching a hugetlb or
>>> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
>>> + * path in __access_remote_vm().  Only reads are handled here.
>>> + *
>>> + * Returns the number of bytes transferred via the fast path.
>>> + */
>>> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
>>> +				 void *buf, int len, unsigned int gup_flags)
>>> +{
>>> +	void *old_buf = buf;
>>> +
>>> +	addr = untagged_addr_remote_unlocked(mm, addr);
>>> +
>>> +	while (len) {
>>> +		struct vm_area_struct *vma;
>>> +		vm_flags_t vm_flags;
>>> +
>>> +		vma = lock_vma_under_rcu(mm, addr);
>>> +		if (!vma)
>>> +			break;
>>> +
>>> +		/*
>>> +		 * Mirror the read-side permission checks of check_vma_flags(),
>>> +		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
>>> +		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
>>> +		 * per VMA; anything not positively allowed falls back to the
>>> +		 * slow path, which re-validates everything.
>>> +		 */
>>> +		vm_flags = vma->vm_flags;
>>> +		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
>>> +		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
>>> +		    (!(vm_flags & VM_READ) &&
>>> +		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {
>>> +			vma_end_read(vma);
>>> +			break;
>>> +		}
>>
>> This should also do the FOLL_ANON check from check_vma_flags().
>>
>> check_vma_flags() rejects non-anonymous VMAs when FOLL_ANON is set:
>>
>> 	if ((gup_flags & FOLL_ANON) && !vma_anon)
>> 		return -EFAULT;
> 
> Duplicating GUP logic in a non-GUP file. Splendid. :)
> 

Haha probably just need a common helper.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
  2026-06-17  6:19   ` Suren Baghdasaryan
  2026-06-18 17:01   ` Usama Arif
@ 2026-06-19 12:20   ` Lorenzo Stoakes
  2 siblings, 0 replies; 18+ messages in thread
From: Lorenzo Stoakes @ 2026-06-19 12:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan

On Tue, Jun 16, 2026 at 03:03:00PM -0400, Rik van Riel wrote:
> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> uses get_user_pages_remote(), which faults pages in.  For the common
> case of reading memory that is already resident -- /proc/PID/cmdline,
> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> unnecessary and is badly contended on large machines.
>
> Add an opportunistic, read-only fast path that transfers what it can
> without the mmap lock.  For each address it takes the per-VMA lock with
> lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
> folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
> a present page before copying it out.  Anything non-trivial -- a not-
> present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
> a race with a VMA writer -- falls back to the existing mmap_lock path
> for the remainder.
>
> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
> for the fast path; the untag mask is a stable per-mm value.
>
> Only reads are handled here; writes keep using the slow path.
>
> Assisted-by: Claude:claude-opus-4-8

This feels as if there was a little too much left to AI :)

> Signed-off-by: Rik van Riel <riel@surriel.com>

This needs to be separated into more patches, functions, and thoroughly reworked
to be upstreamable, unfortunately.

It's additionally quite hard to review in this form.

> ---
>  arch/x86/include/asm/uaccess_64.h |  12 +++
>  include/linux/uaccess.h           |  11 ++
>  mm/memory.c                       | 166 +++++++++++++++++++++++++++++-
>  3 files changed, 188 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 4a52497ba6a1..c6fac900a747 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -51,6 +51,18 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>  	(__force __typeof__(addr))__untagged_addr_remote(mm, __addr);	\
>  })
>
> +/* Same as __untagged_addr_remote(), but usable without the mmap lock held. */

How? This is pretty vague.

> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
> +							    unsigned long addr)
> +{
> +	return addr & READ_ONCE((mm)->context.untag_mask);
> +}
> +
> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
> +	unsigned long __addr = (__force unsigned long)(addr);		\
> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
> +})

I'm confused why you're implementing this and not just calling untagged_addr()
from untagged_addr_remote_unlocked()?

You don't comment or explain this in the commit msg afaict.

> +
>  #endif
>
>  #define valid_user_address(x) \
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 8a264662b242..c8c83372c9d8 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -34,6 +34,17 @@
>  })
>  #endif
>
> +/*
> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
> + */

It's odd you'll comment this like this but not explain the confusing bit as to
why you can't just call untagged_addr()

> +#ifndef untagged_addr_remote_unlocked
> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
> +	(void)(mm);					\

I'm not sure this is required?

> +	untagged_addr(addr);				\

Weird again that x86 needs special treatment but not other arches?

> +})
> +#endif
> +

You should really make untagged_addr_remote() call
untagged_addr_remote_unlocked() after its assert, otherwise it's a really odd
inconsistency.

>  #ifdef masked_user_access_begin
>   #define can_do_masked_user_access() 1
>  # ifndef masked_user_write_access_begin
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..0b23b82eaa18 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,8 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/secretmem.h>
> +#include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -7062,6 +7064,153 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>  EXPORT_SYMBOL_GPL(generic_access_phys);
>  #endif
>
> +/*
> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
> + */
> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)

Shall we wait for, or rely on Dave's series to remove CONFIG_PER_VMA_LOCK + make
it permanently on here?

> +/*
> + * Opportunistic lockless fast path for __access_remote_vm() reads.
> + *
> + * Memory already resident in @mm can be read without taking the heavily
> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page via an
> + * RCU/PTL protected page table walk (relying on MMU_GATHER_RCU_TABLE_FREE).
> + *
> + * Anything that would require faulting a page in, touching a hugetlb or
> + * VM_IO/VM_PFNMAP mapping, or that races a VMA writer is left to the mmap_lock
> + * path in __access_remote_vm().  Only reads are handled here.

I think referencing the confusing mess that is the special VMA flags is best
avoided (and anyway I think and you should just say

I think we could be clearer here like:

This is the read fast patch, writes are handled by the slow path in
__access_remote_vm() - faulting in, touching hugetlb or a remap.


> + *
> + * Returns the number of bytes transferred via the fast path.
> + */
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)

In general this function is... ugly. Very ugly :)

You nest while -> while -> for and it's all open-coded.

It needs major refactoring - separate out smaller functions, improve the
comments, separate out logic that can be shared with gup etc.

> +{
> +	void *old_buf = buf;
> +
> +	addr = untagged_addr_remote_unlocked(mm, addr);
> +
> +	while (len) {
> +		struct vm_area_struct *vma;
> +		vm_flags_t vm_flags;
> +
> +		vma = lock_vma_under_rcu(mm, addr);
> +		if (!vma)
> +			break;
> +
> +		/*
> +		 * Mirror the read-side permission checks of check_vma_flags(),
> +		 * and exclude what FW_VMA_LOCKED cannot handle (hugetlb) or what
> +		 * needs the ->access() handler (VM_IO/VM_PFNMAP).  Checked once
> +		 * per VMA; anything not positively allowed falls back to the
> +		 * slow path, which re-validates everything.
> +		 */

This feels very overwrought. You're compressing far to omuch into one lump of
text.

> +		vm_flags = vma->vm_flags;

Please don't use the old VMA flags API for new code. And definitely don't do
some weird vm_flags we keep separate from vma->vm_flags thing here.

	vma_test_any(vma, VMA_IO_BIT, VMA_PFNMAP_BIT)

> +		if ((vm_flags & (VM_IO | VM_PFNMAP)) ||
> +		    is_vm_hugetlb_page(vma) || vma_is_secretmem(vma) ||
> +		    (!(vm_flags & VM_READ) &&

But what does !VM_READ mean exactly? Are you really checking for PROT_NONE?

So vma_is_accessible() is right here no?

> +		     (!(gup_flags & FOLL_FORCE) || !(vm_flags & VM_MAYREAD)))) {

This conditional is abhorrent...

I think a good rule of thumb for this kind of thing is to read it out loud in
English - 'if IO or PFN map or hugetlb or secret mem or either not VM_READ etc.'
- if you're confused by it in English then don't put it in code.

Anyway it should clearly be a separate function like:

	static bool vma_can_use_fast_path(const struct vm_area_struct *vma)
	{
		/* We cannot GUP PFN maps or I/O memory. */
		if (vma_test_any(vma, VMA_IO_BIT, VMA_PFNMAP_BIT))
			return false;
		/* Hugetlb is a special snowflake. */
		if (is_vm_hugetlb_page(vma))
			return false;
		... etc. etc. ...
		return true;
	}

Which is vastly clearer.


> +			vma_end_read(vma);
> +			break;
> +		}
> +
> +		/*
> +		 * Copy as much of this VMA as we can without re-acquiring the
> +		 * per-VMA lock; re-lock only when @addr leaves the VMA.
> +		 */

Strange phrasing. I'm not even sure it's a useful comment?

> +		while (len && addr < vma->vm_end) {
> +			struct folio_walk fw;

Be good to avoid mystery meat varible names. 'walk'?

> +			struct folio *folio;
> +			struct page *page;
> +			unsigned long entry_size, entry_left, folio_left, span;
> +			unsigned long copied, idx0;

idx0 is a terrible name :)

All these variables tells you the function is too long.

> +			int offset;
> +
> +			folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
> +			if (!folio) {
> +				vma_end_read(vma);
> +				goto out;
> +			}
> +			page = fw.page;
> +			if (!page) {

under what circumstances would !fw.page when folio is non-NULL?

> +				folio_walk_end(&fw, vma);
> +				vma_end_read(vma);
> +				goto out;
> +			}
> +			/* Pin the folio so it stays valid after the PTL is dropped. */
> +			folio_get(folio);
> +			folio_walk_end(&fw, vma);
> +
> +			/*
> +			 * folio_walk_start() validated exactly one mapping entry,
> +			 * which covers a contiguous, present run of this folio:
> +			 * PAGE_SIZE for a pte, PMD_SIZE for a pmd leaf, PUD_SIZE
> +			 * for a pud leaf.  Copy up to the end of that entry,
> +			 * bounded by the folio, the VMA and len, so a huge mapping
> +			 * is handled in one walk instead of per page.
> +			 */
> +			offset = offset_in_page(addr);
> +			switch (fw.level) {
> +			case FW_LEVEL_PUD:
> +				entry_size = PUD_SIZE;
> +				break;
> +			case FW_LEVEL_PMD:
> +				entry_size = PMD_SIZE;
> +				break;
> +			default:
> +				entry_size = PAGE_SIZE;
> +				break;
> +			}
> +			entry_left = entry_size - (addr & (entry_size - 1));

Surely we have a better way of doing this? At least needs abstracting, a random
switch in the middle of this code is horrid.

> +			idx0 = folio_page_idx(folio, page);
> +			folio_left = ((folio_nr_pages(folio) - idx0) << PAGE_SHIFT) -
> +				     offset;

Couldn't we just keep track of this without this horrid expression?

> +			span = min3((unsigned long)len, entry_left, folio_left);
> +			span = min(span, vma->vm_end - addr);

You add massive comments for some bits, then do extremely confusing open coded
stuff here?

This needs a lot of breaking up.

> +
> +			/*
> +			 * Copy the span page-by-page: kmap_local_folio() maps one
> +			 * page on HIGHMEM and copy_from_user_page() flushes per
> +			 * page on aliasing caches, but the page tables are not
> +			 * re-walked.  The span borrows the single folio reference
> +			 * taken above, so each mapping is dropped with
> +			 * kunmap_local() (not folio_release_kmap(), which would
> +			 * also drop a folio reference per page).
> +			 */

This is a really confusing mass of text that is really dense and hard to
parse. Clarity is king.

In any case this should be separted out.

> +			for (copied = 0; copied < span; ) {

Very odd for loop.

	copied = 0;
	while (copied < span) {
		...
	}

Would be better. But I think reworking it so a normal for (init; cond; incr)
loop would work would be better?

> +				unsigned long foff = offset + copied;

foff :)) now I won't be childish :P

I really dislike overly compressed variable names. It's vague. File offset? I
guess you mean folio offset right?

and why did you call it plain 'offset' before but now specify folio but as 'f'?

> +				unsigned long pidx = idx0 + (foff >> PAGE_SHIFT);

'pidx'?

Equally unnecessarily and confusingly compressed variable name. We can live with
page_index if that's what you mean?

Also it's unceratin what the units are. It's ok to say 'bytes' and 'nr_pages' or
'page_nr' or something, and far clearer.

This should obviously be in another function anyway given indentation levels here.

> +				int poff = foff & ~PAGE_MASK;
> +				int chunk = min_t(unsigned long, span - copied,
> +						  PAGE_SIZE - poff);
> +				void *maddr = kmap_local_folio(folio,
> +						pidx << PAGE_SHIFT);
> +
> +				copy_from_user_page(vma, folio_page(folio, pidx),
> +						    addr + copied, buf + copied,
> +						    maddr + poff, chunk);
> +				kunmap_local(maddr);
> +				copied += chunk;
> +			}
> +
> +			folio_put(folio);
> +			len -= span;
> +			buf += span;
> +			addr += span;
> +		}
> +		vma_end_read(vma);

Really hard to keep track of what's what here.

> +	}
> +out:
> +	return buf - old_buf;
> +}
> +#else
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
> +
>  /*
>   * Access another process' address space as given in mm.
>   */
> @@ -7071,8 +7220,23 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
>  	void *old_buf = buf;
>  	int write = gup_flags & FOLL_WRITE;
>
> +	/*
> +	 * Try the lockless fast path for reads first; it transfers what it can
> +	 * from resident memory without taking mmap_lock, and leaves the
> +	 * remainder (if any) to the slow path below.
> +	 */

This is a weird comment, you should describe what access_remote_vm_fast() does
in access_remote_vm_fast(). You also don't mention the !write here which is the
thing people might wonder about.

I think the code is self-documenting anyway - try fast path - pretty clear.


> +	if (!write) {
> +		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);

Can be const.

Can't errors arise in access_remote_vm_fast()? And in general seems it'd make
more sense to return an error/bool rather and have done as output param rather
than infer stuff from done.

> +
> +		addr += done;
> +		buf += done;
> +		len -= done;
> +		if (!len)
> +			return buf - old_buf;

So usual case will be it does everything right? So you do some useless
arithmetic and then return buf - old_buf.

Should probably instead have a return value.

But in general __access_remote_vm() is horrible. I think if you add new features
it's only right you spend some commits cleaning up first.

Otherwise we heap more stuff on top of broken stuff on and on and things get
messier + messier.


> +	}
> +
>  	if (mmap_read_lock_killable(mm))
> -		return 0;
> +		return buf - old_buf;

Err there's other cases where you return 0 here, e.g.:

	/* Avoid triggering the temporary warning in __get_user_pages */
	if (!vma_lookup(mm, addr) && !expand_stack(mm, addr))
		return 0;

So you probably need to fix those up to?

Probably better to just have a return value declared in the function.

>
>  	/* Untag the address before looking up the VMA */
>  	addr = untagged_addr_remote(mm, addr);
> --
> 2.53.0-Meta
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-17  6:19   ` Suren Baghdasaryan
@ 2026-06-19 12:24     ` Lorenzo Stoakes
  2026-06-19 13:46     ` Rik van Riel
  1 sibling, 0 replies; 18+ messages in thread
From: Lorenzo Stoakes @ 2026-06-19 12:24 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Rik van Riel, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka

On Tue, Jun 16, 2026 at 11:19:12PM -0700, Suren Baghdasaryan wrote:
> On Tue, Jun 16, 2026 at 12:04 PM Rik van Riel <riel@surriel.com> wrote:
> >
> > __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> > uses get_user_pages_remote(), which faults pages in.  For the common
> > case of reading memory that is already resident -- /proc/PID/cmdline,
> > /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> > unnecessary and is badly contended on large machines.
> >
> > Add an opportunistic, read-only fast path that transfers what it can
> > without the mmap lock.  For each address it takes the per-VMA lock with
> > lock_vma_under_rcu(), re-checks the read-side VMA permissions, and uses
> > folio_walk_start(..., FW_VMA_LOCKED) to grab a short-lived reference to
> > a present page before copying it out.  Anything non-trivial -- a not-
> > present page (needs faulting), a hugetlb or VM_IO/VM_PFNMAP mapping, or
> > a race with a VMA writer -- falls back to the existing mmap_lock path
> > for the remainder.
>
> I don't think we should be using per-VMA locks if the read spans
> multiple VMAs. Doing that would risk a possibility of reading
> inconsistent data since we are locking one VMA at a time. While we

Yeah, very true.

Suren has expounded on the possible cases that can occur elsewhere but you can
observe strange states like that.

You can see tools/testing/selftests/proc/proc-maps-race.c for a sense of it and
https://lore.kernel.org/all/20260426062718.1238437-1-surenb@google.com/

Note that for e.g. madvise() this is exactly what we do.

> load and read VMA, its neighboring VMA can be unmapped and another one
> can be mapped in its place. So, our read spanning both VMAs will
> return inconsistent data. access_remote_vm_fast() can check if the
> entire read is contained within one VMA and if not, fall back to
> mmap_lock.

This would also vastly simplify the code. I expect most real-world cases are
like this anyway?

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-17  6:19   ` Suren Baghdasaryan
  2026-06-19 12:24     ` Lorenzo Stoakes
@ 2026-06-19 13:46     ` Rik van Riel
  2026-06-19 14:03       ` Suren Baghdasaryan
  1 sibling, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2026-06-19 13:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka

On Tue, 2026-06-16 at 23:19 -0700, Suren Baghdasaryan wrote:
> 
> I don't think we should be using per-VMA locks if the read spans
> multiple VMAs. Doing that would risk a possibility of reading
> inconsistent data since we are locking one VMA at a time. While we
> load and read VMA, its neighboring VMA can be unmapped and another
> one
> can be mapped in its place. 

How is that different from userspace overwriting
data while we are reading it, and us reading half
new, and half old contents?

We already have nothing synchronizing the contents
today.

All our locking does is protect the metadata we
use to find the memory.

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-19 13:46     ` Rik van Riel
@ 2026-06-19 14:03       ` Suren Baghdasaryan
  2026-06-19 14:33         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2026-06-19 14:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka

On Fri, Jun 19, 2026 at 6:47 AM Rik van Riel <riel@surriel.com> wrote:
>
> On Tue, 2026-06-16 at 23:19 -0700, Suren Baghdasaryan wrote:
> >
> > I don't think we should be using per-VMA locks if the read spans
> > multiple VMAs. Doing that would risk a possibility of reading
> > inconsistent data since we are locking one VMA at a time. While we
> > load and read VMA, its neighboring VMA can be unmapped and another
> > one
> > can be mapped in its place.
>
> How is that different from userspace overwriting
> data while we are reading it, and us reading half
> new, and half old contents?
>
> We already have nothing synchronizing the contents
> today.
>
> All our locking does is protect the metadata we
> use to find the memory.

Current locking ensures that when reading across VMAs, the structure
of the VMA tree stays consistent during that read. I think that
*structural consistency* of the area we are reading is important to
preserve. Someone can be overriding the data while we are reading it,
but that's *content consistency* and yes, we don't have protection
against that both before and after your change.

>
> --
> All Rights Reversed.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-19 14:03       ` Suren Baghdasaryan
@ 2026-06-19 14:33         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-19 14:33 UTC (permalink / raw)
  To: Suren Baghdasaryan, Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka

On 6/19/26 16:03, Suren Baghdasaryan wrote:
> On Fri, Jun 19, 2026 at 6:47 AM Rik van Riel <riel@surriel.com> wrote:
>>
>> On Tue, 2026-06-16 at 23:19 -0700, Suren Baghdasaryan wrote:
>>>
>>> I don't think we should be using per-VMA locks if the read spans
>>> multiple VMAs. Doing that would risk a possibility of reading
>>> inconsistent data since we are locking one VMA at a time. While we
>>> load and read VMA, its neighboring VMA can be unmapped and another
>>> one
>>> can be mapped in its place.
>>
>> How is that different from userspace overwriting
>> data while we are reading it, and us reading half
>> new, and half old contents?
>>
>> We already have nothing synchronizing the contents
>> today.
>>
>> All our locking does is protect the metadata we
>> use to find the memory.
> 
> Current locking ensures that when reading across VMAs, the structure
> of the VMA tree stays consistent during that read. I think that
> *structural consistency* of the area we are reading is important to
> preserve.

I tend to agree: if more than one VMA is involved, it's best to fallback to the
mmap lock for now.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock
@ 2026-06-25  1:50 Rik van Riel
  2026-06-25  1:50 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Rik van Riel @ 2026-06-25  1:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, kernel-team

Sometimes processes can get stuck with the mmap_lock held for
a long time. This slows down, and can even prevent system monitoring
tools from assessing and logging the situation, because they themselves
end up getting stuck on the mmap_lock.

However, with the introduction of per-VMA locks, we can improve the
reliability of system monitoring, and generally speed up __access_remote_vm
under mmap_loc contention, by adding a fast path that does not require
the process-wide mmap_lock.

This fast path is only compiled in and used when it is safe to do so,
meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
is not hugetlbfs, iomap, pfnmap, etc...

v2:
 - simplify the code, which should be ok because these copies are < PAGE_SIZE
 - clean up the code
 - fix locking wrt tlb_remove_table_sync_one()
 - hopefully address all the other comments

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask
  2026-06-25  1:50 [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
@ 2026-06-25  1:50 ` Rik van Riel
  2026-06-25  1:50 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2026-06-25  1:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, kernel-team

mm->context.untag_mask is written once, when LAM is enabled
(mm_enable_lam(), under mmap_write_lock and while the process is still
single-threaded), and is otherwise stable and never reverted.
untagged_addr_remote() reads it for a remote mm, and the new
untagged_addr_remote_unlocked() (used by the per-VMA-lock
access_remote_vm() fast path) reads it without the mmap lock.

The field is a single aligned word and cannot tear, but annotate the
reads and writes with READ_ONCE()/WRITE_ONCE() to make the lockless
access explicit and keep the compiler from reloading or tearing it.

No functional change.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/mmu_context.h | 6 +++---
 arch/x86/include/asm/uaccess_64.h  | 2 +-
 arch/x86/kernel/process_64.c       | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index ef5b507de34e..cee710f64658 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -100,18 +100,18 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
 static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
-	mm->context.untag_mask = oldmm->context.untag_mask;
+	WRITE_ONCE(mm->context.untag_mask, READ_ONCE(oldmm->context.untag_mask));
 }
 
 #define mm_untag_mask mm_untag_mask
 static inline unsigned long mm_untag_mask(struct mm_struct *mm)
 {
-	return mm->context.untag_mask;
+	return READ_ONCE(mm->context.untag_mask);
 }
 
 static inline void mm_reset_untag_mask(struct mm_struct *mm)
 {
-	mm->context.untag_mask = -1UL;
+	WRITE_ONCE(mm->context.untag_mask, -1UL);
 }
 
 #define arch_pgtable_dma_compat arch_pgtable_dma_compat
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 20de34cc9aa6..4a52497ba6a1 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -43,7 +43,7 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 						   unsigned long addr)
 {
 	mmap_assert_locked(mm);
-	return addr & (mm)->context.untag_mask;
+	return addr & READ_ONCE((mm)->context.untag_mask);
 }
 
 #define untagged_addr_remote(mm, addr)	({				\
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index d44afbe005bb..55096136de53 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -814,7 +814,7 @@ static void enable_lam_func(void *__mm)
 static void mm_enable_lam(struct mm_struct *mm)
 {
 	mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
-	mm->context.untag_mask =  ~GENMASK(62, 57);
+	WRITE_ONCE(mm->context.untag_mask, ~GENMASK(62, 57));
 
 	/*
 	 * Even though the process must still be single-threaded at this
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock
  2026-06-25  1:50 [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
  2026-06-25  1:50 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
@ 2026-06-25  1:50 ` Rik van Riel
  2026-06-25  7:34   ` Lorenzo Stoakes
  2026-06-25  1:50 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
  2026-06-25  6:32 ` [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock David Hildenbrand (Arm)
  3 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2026-06-25  1:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, kernel-team

folio_walk_start() asserts the mmap lock is held.  For callers that only
need to read a single, already-present page, the mmap lock is a heavy and
often badly contended hammer.  Such a caller can instead hold the per-VMA
lock, which keeps the VMA itself stable.

The per-VMA lock does not, however, keep the page tables walked below that
VMA from being freed.  A concurrent munmap() or THP collapse of an
adjacent region in the same mm can free a shared upper-level table, and
THP collapse (collapse_huge_page() -> retract_page_tables()) frees page
tables of VMAs whose lock it does not hold.  Page table freeing
synchronizes against lockless walkers the way gup_fast relies on:
tlb_remove_table_sync_one() sends an IPI and waits for every CPU to enable
interrupts, so a walker that keeps interrupts disabled across the walk
cannot be observing a table that is about to be freed.  rcu_read_lock() is
not sufficient -- it does not block that IPI -- so the caller must keep
interrupts disabled, not merely hold an RCU read-side critical section.

Add an FW_VMA_LOCKED flag.  When passed, folio_walk_start() asserts the
per-VMA lock and that interrupts are disabled, instead of asserting the
mmap lock; it requires CONFIG_MMU_GATHER_RCU_TABLE_FREE and refuses
hugetlb VMAs (PMD sharing maps page tables this VMA's lock does not
cover).  The caller must keep interrupts disabled until folio_walk_end().

No existing caller passes FW_VMA_LOCKED, so behaviour is unchanged.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 include/linux/pagewalk.h |  7 +++++++
 mm/pagewalk.c            | 29 +++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index b41d7265c01b..d0387470d732 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -150,6 +150,13 @@ typedef int __bitwise folio_walk_flags_t;
 
 /* Walk shared zeropages (small + huge) as well. */
 #define FW_ZEROPAGE			((__force folio_walk_flags_t)BIT(0))
+/*
+ * The caller holds the per-VMA lock instead of the mmap lock, with interrupts
+ * disabled across the walk (until folio_walk_end()) to serialize against page
+ * table freeing, the same way gup_fast does. Only valid with RCU-freed page
+ * tables (CONFIG_MMU_GATHER_RCU_TABLE_FREE) and not for hugetlb.
+ */
+#define FW_VMA_LOCKED			((__force folio_walk_flags_t)BIT(1))
 
 enum folio_walk_level {
 	FW_LEVEL_PTE,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..ab1e81983cb8 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -890,7 +890,10 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
  * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
  * not correspond to the first physical entry of a logical hugetlb entry.
  *
- * The mmap lock must be held in read mode.
+ * The mmap lock must be held in read mode. Alternatively, if @FW_VMA_LOCKED is
+ * passed, the VMA's per-VMA lock must be held and interrupts must be disabled
+ * across the walk and until folio_walk_end() (only supported with RCU-freed page
+ * tables, i.e. CONFIG_MMU_GATHER_RCU_TABLE_FREE, and not for hugetlb).
  *
  * Return: folio pointer on success, otherwise NULL.
  */
@@ -908,7 +911,29 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 	pgd_t *pgdp;
 	p4d_t *p4dp;
 
-	mmap_assert_locked(vma->vm_mm);
+	if (flags & FW_VMA_LOCKED) {
+		/*
+		 * Lockless walk under the per-VMA lock instead of the mmap
+		 * lock. The VMA lock keeps the VMA stable, but the page tables
+		 * walked below it can still be freed concurrently: a munmap() or
+		 * THP collapse of an adjacent region in the same mm can free a
+		 * shared upper-level table, and collapse_huge_page() ->
+		 * retract_page_tables() frees page tables of VMAs whose lock it
+		 * does not hold. Page table freeing serializes against lockless
+		 * walkers via tlb_remove_table_sync_one(), which IPIs and waits
+		 * for every CPU to enable interrupts; an RCU read-side critical
+		 * section does not block that IPI, so the caller must keep
+		 * interrupts disabled across the whole walk, like gup_fast.
+		 * Hugetlb (PMD sharing) maps page tables not covered by this
+		 * VMA's lock and is not supported.
+		 */
+		VM_WARN_ON_ONCE(!IS_ENABLED(CONFIG_MMU_GATHER_RCU_TABLE_FREE));
+		VM_WARN_ON_ONCE(is_vm_hugetlb_page(vma));
+		lockdep_assert_irqs_disabled();
+		vma_assert_locked(vma);
+	} else {
+		mmap_assert_locked(vma->vm_mm);
+	}
 	vma_pgtable_walk_begin(vma);
 
 	if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-25  1:50 [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
  2026-06-25  1:50 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
  2026-06-25  1:50 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
@ 2026-06-25  1:50 ` Rik van Riel
  2026-06-25  7:39   ` Lorenzo Stoakes
  2026-06-25  6:32 ` [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock David Hildenbrand (Arm)
  3 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2026-06-25  1:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Rik van Riel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Suren Baghdasaryan, kernel-team

__access_remote_vm() takes mmap_read_lock() for the entire transfer and
uses get_user_pages_remote(), which faults pages in.  For the common case
of reading memory that is already resident -- /proc/PID/cmdline,
/proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
unnecessary and is badly contended on large machines.

Add an opportunistic, read-only fast path.  It takes the per-VMA lock with
lock_vma_under_rcu() and, only when the whole request lies within that one
VMA, copies the resident pages out using folio_walk_start(FW_VMA_LOCKED)
to grab a short-lived page reference from a page table walk run with
interrupts disabled.  Interrupts are disabled only across the walk (until
the folio is pinned): page table freeing -- a concurrent munmap() or THP
collapse of an adjacent region -- serializes against lockless walkers via
tlb_remove_table_sync_one(), which IPIs and waits for every CPU to enable
interrupts, the same contract gup_fast relies on.  The copy then runs with
interrupts on, holding only the folio reference.

A request that spans more than one VMA is left entirely to the mmap_lock
path: relocking per VMA could observe a structurally inconsistent address
space (a neighbouring VMA unmapped and a different one mapped in its place
between locks), whereas the mmap_lock path sees a stable VMA tree for the
whole transfer.

The per-VMA permission check mirrors the read side of check_vma_flags(),
including the FOLL_ANON restriction that /proc/PID/{cmdline,environ} rely
on (CVE-2018-1120).  Anything not positively allowed -- a not-present
page, a hugetlb or VM_IO/VM_PFNMAP or secretmem mapping, or a race with a
VMA writer -- falls back to the mmap_lock path for the remainder, which
re-validates everything.  Pages read on the fast path are marked accessed,
matching the FOLL_TOUCH behaviour of the get_user_pages_remote() slow
path.

untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
for the fast path; the untag mask is a stable per-mm value.

Only reads are handled here; writes keep using the slow path.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/uaccess_64.h |  14 ++-
 include/linux/uaccess.h           |  11 ++
 mm/memory.c                       | 195 +++++++++++++++++++++++++++++-
 3 files changed, 217 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 4a52497ba6a1..933b0b8b4d60 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -39,11 +39,23 @@ static inline unsigned long __untagged_addr(unsigned long addr)
 	(__force __typeof__(addr))__untagged_addr(__addr);		\
 })
 
+/* Strip the tag bits from a remote mm's address; usable without the mmap lock. */
+static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
+							    unsigned long addr)
+{
+	return addr & READ_ONCE(mm->context.untag_mask);
+}
+
+#define untagged_addr_remote_unlocked(mm, addr)	({			\
+	unsigned long __addr = (__force unsigned long)(addr);		\
+	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
+})
+
 static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 						   unsigned long addr)
 {
 	mmap_assert_locked(mm);
-	return addr & READ_ONCE((mm)->context.untag_mask);
+	return __untagged_addr_remote_unlocked(mm, addr);
 }
 
 #define untagged_addr_remote(mm, addr)	({				\
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 8a264662b242..c8c83372c9d8 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -34,6 +34,17 @@
 })
 #endif
 
+/*
+ * Like untagged_addr_remote(), but for callers that stabilize @mm by other
+ * means (e.g. a per-VMA lock) and must not assert the mmap lock.
+ */
+#ifndef untagged_addr_remote_unlocked
+#define untagged_addr_remote_unlocked(mm, addr)	({	\
+	(void)(mm);					\
+	untagged_addr(addr);				\
+})
+#endif
+
 #ifdef masked_user_access_begin
  #define can_do_masked_user_access() 1
 # ifndef masked_user_write_access_begin
diff --git a/mm/memory.c b/mm/memory.c
index 86a973119bd4..d2b2f0014a0c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -42,6 +42,8 @@
 #include <linux/kernel_stat.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
+#include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/sched/task.h>
@@ -7062,6 +7064,180 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 EXPORT_SYMBOL_GPL(generic_access_phys);
 #endif
 
+/*
+ * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
+ * lock and RCU-freed page tables to walk page tables without the mmap lock.
+ */
+#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
+/*
+ * Read-side VMA checks for the lockless fast path, mirroring the read side of
+ * check_vma_flags(): reject what FW_VMA_LOCKED cannot handle (hugetlb), what
+ * needs the ->access() handler (VM_IO/VM_PFNMAP), or what has no struct page to
+ * copy (secretmem); enforce the FOLL_ANON restriction that
+ * /proc/PID/{cmdline,environ} rely on (CVE-2018-1120); and require read access
+ * (honoring FOLL_FORCE).  Anything not positively allowed falls back to the slow
+ * path, which re-validates everything.
+ */
+static bool vma_permits_fast_access(struct vm_area_struct *vma,
+				    unsigned int gup_flags)
+{
+	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+		return false;
+	if (is_vm_hugetlb_page(vma) || vma_is_secretmem(vma))
+		return false;
+	if ((gup_flags & FOLL_ANON) && !vma_is_anonymous(vma))
+		return false;
+	if (!(vma->vm_flags & VM_READ) &&
+	    (!(gup_flags & FOLL_FORCE) || !(vma->vm_flags & VM_MAYREAD)))
+		return false;
+	return true;
+}
+
+/* Size of the single mapping entry folio_walk_start() landed on. */
+static unsigned long fw_entry_size(enum folio_walk_level level)
+{
+	switch (level) {
+	case FW_LEVEL_PUD:
+		return PUD_SIZE;
+	case FW_LEVEL_PMD:
+		return PMD_SIZE;
+	default:
+		return PAGE_SIZE;
+	}
+}
+
+/*
+ * Copy @len bytes of the pinned @folio out to @buf, starting at byte offset
+ * @folio_off within the folio (the position of @addr).  Maps and copies one
+ * page at a time -- kmap_local_folio() for HIGHMEM, copy_from_user_page() for
+ * the per-page flush on aliasing caches -- without re-walking page tables.
+ * Each page borrows the caller's single folio reference, so the mapping is
+ * dropped with kunmap_local() rather than folio_release_kmap().
+ */
+static void copy_folio_pages(struct vm_area_struct *vma, struct folio *folio,
+			     unsigned long folio_off, unsigned long addr,
+			     void *buf, unsigned long len)
+{
+	unsigned long done = 0;
+
+	while (done < len) {
+		unsigned long pos = folio_off + done;
+		unsigned long page_idx = pos >> PAGE_SHIFT;
+		unsigned int page_off = pos & ~PAGE_MASK;
+		unsigned int chunk = min_t(unsigned long, len - done,
+					   PAGE_SIZE - page_off);
+		void *kaddr = kmap_local_folio(folio, page_idx << PAGE_SHIFT);
+
+		copy_from_user_page(vma, folio_page(folio, page_idx),
+				    addr + done, buf + done, kaddr + page_off,
+				    chunk);
+		kunmap_local(kaddr);
+		done += chunk;
+	}
+}
+
+/*
+ * Opportunistic lockless fast path for __access_remote_vm() reads.
+ *
+ * Memory already resident in @mm can be read without taking the frequently
+ * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
+ * with FW_VMA_LOCKED grabs a short-lived reference to a present page from a page
+ * table walk run with interrupts disabled, which serializes against concurrent
+ * page table freeing the same way gup_fast does (relying on
+ * MMU_GATHER_RCU_TABLE_FREE).
+ *
+ * Only a request that lies entirely within a single VMA is handled here,
+ * which should not be an issue in practice since every caller has a
+ * buffer of PAGE_SIZE or smaller. Loop iteration inside this function
+ * should be rare, too.
+ *
+ * Returns the number of bytes transferred via the fast path.
+ */
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+				 void *buf, int len, unsigned int gup_flags)
+{
+	void *old_buf = buf;
+	struct vm_area_struct *vma;
+
+	addr = untagged_addr_remote_unlocked(mm, addr);
+
+	vma = lock_vma_under_rcu(mm, addr);
+	if (!vma)
+		return 0;
+
+	/* Only handle a request contained entirely within this one VMA. */
+	if (len > vma->vm_end - addr)
+		goto out_unlock;
+
+	if (!vma_permits_fast_access(vma, gup_flags))
+		goto out_unlock;
+
+	while (len) {
+		struct folio_walk fw;
+		struct folio *folio;
+		struct page *page;
+		unsigned long entry_size, folio_off, span, irq_flags;
+
+		/*
+		 * The lockless page table walk must run with interrupts
+		 * disabled: page table freeing (munmap or THP collapse, which
+		 * IPI via tlb_remove_table_sync_one() and wait) then cannot free
+		 * a table mid-walk -- the same contract gup_fast relies on.  IRQs
+		 * are restored once the folio is pinned; the copy below holds only
+		 * the folio reference.
+		 */
+		local_irq_save(irq_flags);
+		folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
+		if (!folio) {
+			local_irq_restore(irq_flags);
+			goto out_unlock;	/* not present: let the slow path fault it in */
+		}
+		page = fw.page;
+		if (!page) {
+			/* No struct page to copy (e.g. a special PTE). */
+			folio_walk_end(&fw, vma);
+			local_irq_restore(irq_flags);
+			goto out_unlock;
+		}
+		entry_size = fw_entry_size(fw.level);
+		folio_get(folio);
+		folio_walk_end(&fw, vma);
+		local_irq_restore(irq_flags);
+
+		/*
+		 * folio_walk_start() validated one present mapping entry
+		 * (PAGE/PMD/PUD_SIZE).  Copy to the end of that entry, bounded by
+		 * the folio and the remaining length (already within the VMA), so
+		 * a huge mapping is handled in a single walk.
+		 */
+		folio_off = (folio_page_idx(folio, page) << PAGE_SHIFT) +
+			    offset_in_page(addr);
+		span = min3((unsigned long)len,
+			    entry_size - (addr & (entry_size - 1)),
+			    (folio_nr_pages(folio) << PAGE_SHIFT) - folio_off);
+
+		copy_folio_pages(vma, folio, folio_off, addr, buf, span);
+
+		/* Match the FOLL_TOUCH behaviour of the slow (GUP) path. */
+		folio_mark_accessed(folio);
+		folio_put(folio);
+		len -= span;
+		buf += span;
+		addr += span;
+	}
+
+out_unlock:
+	vma_end_read(vma);
+	return buf - old_buf;
+}
+#else
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+				 void *buf, int len, unsigned int gup_flags)
+{
+	return 0;
+}
+#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
+
 /*
  * Access another process' address space as given in mm.
  */
@@ -7071,15 +7247,30 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
 	void *old_buf = buf;
 	int write = gup_flags & FOLL_WRITE;
 
+	/*
+	 * Try the lockless fast path for reads first; it transfers what it can
+	 * from resident memory without taking mmap_lock, and leaves the
+	 * remainder (if any) to the slow path below.
+	 */
+	if (!write) {
+		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
+
+		addr += done;
+		buf += done;
+		len -= done;
+		if (!len)
+			return buf - old_buf;
+	}
+
 	if (mmap_read_lock_killable(mm))
-		return 0;
+		return buf - old_buf;
 
 	/* Untag the address before looking up the VMA */
 	addr = untagged_addr_remote(mm, addr);
 
 	/* Avoid triggering the temporary warning in __get_user_pages */
 	if (!vma_lookup(mm, addr) && !expand_stack(mm, addr))
-		return 0;
+		return buf - old_buf;
 
 	/* ignore errors, just check how much was successfully transferred */
 	while (len) {
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-25  1:50 [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
                   ` (2 preceding siblings ...)
  2026-06-25  1:50 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
@ 2026-06-25  6:32 ` David Hildenbrand (Arm)
  2026-06-25  7:47   ` Lorenzo Stoakes
  3 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-25  6:32 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: x86, linux-mm, Thomas Gleixner, Ingo Molnar, Dmitry Ilvokhin,
	Borislav Petkov, Dave Hansen, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan, kernel-team

On 6/25/26 03:50, Rik van Riel wrote:
> Sometimes processes can get stuck with the mmap_lock held for
> a long time. This slows down, and can even prevent system monitoring
> tools from assessing and logging the situation, because they themselves
> end up getting stuck on the mmap_lock.
> 
> However, with the introduction of per-VMA locks, we can improve the
> reliability of system monitoring, and generally speed up __access_remote_vm
> under mmap_loc contention, by adding a fast path that does not require
> the process-wide mmap_lock.
> 
> This fast path is only compiled in and used when it is safe to do so,
> meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
> is not hugetlbfs, iomap, pfnmap, etc...
> 
> v2:
>  - simplify the code, which should be ok because these copies are < PAGE_SIZE
>  - clean up the code
>  - fix locking wrt tlb_remove_table_sync_one()
>  - hopefully address all the other comments

You mean, ignoring my comments about not reiplementing GUP entirely?

NAK

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock
  2026-06-25  1:50 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
@ 2026-06-25  7:34   ` Lorenzo Stoakes
  0 siblings, 0 replies; 18+ messages in thread
From: Lorenzo Stoakes @ 2026-06-25  7:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan, kernel-team

Rik, it really would have helped if you'd replied to review :)

On Wed, Jun 24, 2026 at 09:50:52PM -0400, Rik van Riel wrote:
> folio_walk_start() asserts the mmap lock is held.  For callers that only
> need to read a single, already-present page, the mmap lock is a heavy and
> often badly contended hammer.  Such a caller can instead hold the per-VMA
> lock, which keeps the VMA itself stable.

<newline>

> The per-VMA lock does not, however, keep the page tables walked below that
> VMA from being freed.  A concurrent munmap() or THP collapse of an
> adjacent region in the same mm can free a shared upper-level table, and

Yeah I need to update the documentation on this at
https://docs.kernel.org/mm/process_addrs.html it's more subtle than written
there.

Firstly you're wrong about munmap() - it acquires the VMA lock of the VMAs freed
in the range and will only remove an upper level table if the entire range is
spanned.

And that's the only way higher level tables can be removed.

PTE page tables can be removed via MADV_DONTNEED, but that a. acquires the VMA
lock and b. frees the PTE page table under RCU.

A THP collapse can happen concurrently, but PTEs are freed under RCU so you
don't need to do this GUP fast imitating stuff.

> THP collapse (collapse_huge_page() -> retract_page_tables()) frees page
> tables of VMAs whose lock it does not hold.  Page table freeing

retract_page_tables() -> pte_free_defer() -> RCU
try_collapse_pte_mapped_thp() -> pte_free_defer() -> RCU

> synchronizes against lockless walkers the way gup_fast relies on:
> tlb_remove_table_sync_one() sends an IPI and waits for every CPU to enable
> interrupts, so a walker that keeps interrupts disabled across the walk
> cannot be observing a table that is about to be freed.  rcu_read_lock() is
> not sufficient -- it does not block that IPI -- so the caller must keep

Yes it is?

I mean unless I'm missing something here.

> interrupts disabled, not merely hold an RCU read-side critical section.
>
> Add an FW_VMA_LOCKED flag.  When passed, folio_walk_start() asserts the
> per-VMA lock and that interrupts are disabled, instead of asserting the
> mmap lock; it requires CONFIG_MMU_GATHER_RCU_TABLE_FREE and refuses
> hugetlb VMAs (PMD sharing maps page tables this VMA's lock does not
> cover).  The caller must keep interrupts disabled until folio_walk_end().
>
> No existing caller passes FW_VMA_LOCKED, so behaviour is unchanged.
>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  include/linux/pagewalk.h |  7 +++++++
>  mm/pagewalk.c            | 29 +++++++++++++++++++++++++++--
>  2 files changed, 34 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index b41d7265c01b..d0387470d732 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -150,6 +150,13 @@ typedef int __bitwise folio_walk_flags_t;
>
>  /* Walk shared zeropages (small + huge) as well. */
>  #define FW_ZEROPAGE			((__force folio_walk_flags_t)BIT(0))
> +/*
> + * The caller holds the per-VMA lock instead of the mmap lock, with interrupts
> + * disabled across the walk (until folio_walk_end()) to serialize against page
> + * table freeing, the same way gup_fast does. Only valid with RCU-freed page
> + * tables (CONFIG_MMU_GATHER_RCU_TABLE_FREE) and not for hugetlb.
> + */
> +#define FW_VMA_LOCKED			((__force folio_walk_flags_t)BIT(1))
>
>  enum folio_walk_level {
>  	FW_LEVEL_PTE,
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 3ae2586ff45b..ab1e81983cb8 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -890,7 +890,10 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
>   * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
>   * not correspond to the first physical entry of a logical hugetlb entry.
>   *
> - * The mmap lock must be held in read mode.
> + * The mmap lock must be held in read mode. Alternatively, if @FW_VMA_LOCKED is
> + * passed, the VMA's per-VMA lock must be held and interrupts must be disabled
> + * across the walk and until folio_walk_end() (only supported with RCU-freed page
> + * tables, i.e. CONFIG_MMU_GATHER_RCU_TABLE_FREE, and not for hugetlb).
>   *
>   * Return: folio pointer on success, otherwise NULL.
>   */
> @@ -908,7 +911,29 @@ struct folio *folio_walk_start(struct folio_walk *fw,
>  	pgd_t *pgdp;
>  	p4d_t *p4dp;
>
> -	mmap_assert_locked(vma->vm_mm);
> +	if (flags & FW_VMA_LOCKED) {
> +		/*
> +		 * Lockless walk under the per-VMA lock instead of the mmap
> +		 * lock. The VMA lock keeps the VMA stable, but the page tables
> +		 * walked below it can still be freed concurrently: a munmap() or
> +		 * THP collapse of an adjacent region in the same mm can free a
> +		 * shared upper-level table, and collapse_huge_page() ->
> +		 * retract_page_tables() frees page tables of VMAs whose lock it
> +		 * does not hold. Page table freeing serializes against lockless
> +		 * walkers via tlb_remove_table_sync_one(), which IPIs and waits
> +		 * for every CPU to enable interrupts; an RCU read-side critical
> +		 * section does not block that IPI, so the caller must keep
> +		 * interrupts disabled across the whole walk, like gup_fast.
> +		 * Hugetlb (PMD sharing) maps page tables not covered by this
> +		 * VMA's lock and is not supported.
> +		 */

This is an unreadable wall of text, if it's AI generated please edit before
sending.

> +		VM_WARN_ON_ONCE(!IS_ENABLED(CONFIG_MMU_GATHER_RCU_TABLE_FREE));
> +		VM_WARN_ON_ONCE(is_vm_hugetlb_page(vma));
> +		lockdep_assert_irqs_disabled();
> +		vma_assert_locked(vma);
> +	} else {
> +		mmap_assert_locked(vma->vm_mm);
> +	}
>  	vma_pgtable_walk_begin(vma);
>
>  	if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
> --
> 2.53.0-Meta
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
  2026-06-25  1:50 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
@ 2026-06-25  7:39   ` Lorenzo Stoakes
  0 siblings, 0 replies; 18+ messages in thread
From: Lorenzo Stoakes @ 2026-06-25  7:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, linux-mm, Thomas Gleixner, Ingo Molnar,
	Dmitry Ilvokhin, Borislav Petkov, Dave Hansen, Andrew Morton,
	David Hildenbrand, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan, kernel-team

On Wed, Jun 24, 2026 at 09:50:53PM -0400, Rik van Riel wrote:
> __access_remote_vm() takes mmap_read_lock() for the entire transfer and
> uses get_user_pages_remote(), which faults pages in.  For the common case
> of reading memory that is already resident -- /proc/PID/cmdline,
> /proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
> unnecessary and is badly contended on large machines.
>
> Add an opportunistic, read-only fast path.  It takes the per-VMA lock with
> lock_vma_under_rcu() and, only when the whole request lies within that one
> VMA, copies the resident pages out using folio_walk_start(FW_VMA_LOCKED)
> to grab a short-lived page reference from a page table walk run with
> interrupts disabled.  Interrupts are disabled only across the walk (until
> the folio is pinned): page table freeing -- a concurrent munmap() or THP
> collapse of an adjacent region -- serializes against lockless walkers via
> tlb_remove_table_sync_one(), which IPIs and waits for every CPU to enable
> interrupts, the same contract gup_fast relies on.  The copy then runs with
> interrupts on, holding only the folio reference.

As I said in reply to 2/3 I don't think you need to do this.

Let's have a discussion _in the review_ about it :)

>
> A request that spans more than one VMA is left entirely to the mmap_lock
> path: relocking per VMA could observe a structurally inconsistent address
> space (a neighbouring VMA unmapped and a different one mapped in its place
> between locks), whereas the mmap_lock path sees a stable VMA tree for the
> whole transfer.
>
> The per-VMA permission check mirrors the read side of check_vma_flags(),
> including the FOLL_ANON restriction that /proc/PID/{cmdline,environ} rely
> on (CVE-2018-1120).  Anything not positively allowed -- a not-present
> page, a hugetlb or VM_IO/VM_PFNMAP or secretmem mapping, or a race with a
> VMA writer -- falls back to the mmap_lock path for the remainder, which
> re-validates everything.  Pages read on the fast path are marked accessed,
> matching the FOLL_TOUCH behaviour of the get_user_pages_remote() slow
> path.
>
> untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
> for the fast path; the untag mask is a stable per-mm value.
>
> Only reads are handled here; writes keep using the slow path.
>
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
>  arch/x86/include/asm/uaccess_64.h |  14 ++-
>  include/linux/uaccess.h           |  11 ++
>  mm/memory.c                       | 195 +++++++++++++++++++++++++++++-
>  3 files changed, 217 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 4a52497ba6a1..933b0b8b4d60 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -39,11 +39,23 @@ static inline unsigned long __untagged_addr(unsigned long addr)
>  	(__force __typeof__(addr))__untagged_addr(__addr);		\
>  })
>
> +/* Strip the tag bits from a remote mm's address; usable without the mmap lock. */
> +static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
> +							    unsigned long addr)
> +{
> +	return addr & READ_ONCE(mm->context.untag_mask);
> +}
> +
> +#define untagged_addr_remote_unlocked(mm, addr)	({			\
> +	unsigned long __addr = (__force unsigned long)(addr);		\
> +	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
> +})
> +
>  static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
>  						   unsigned long addr)
>  {
>  	mmap_assert_locked(mm);
> -	return addr & READ_ONCE((mm)->context.untag_mask);
> +	return __untagged_addr_remote_unlocked(mm, addr);
>  }
>
>  #define untagged_addr_remote(mm, addr)	({				\
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 8a264662b242..c8c83372c9d8 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -34,6 +34,17 @@
>  })
>  #endif
>
> +/*
> + * Like untagged_addr_remote(), but for callers that stabilize @mm by other
> + * means (e.g. a per-VMA lock) and must not assert the mmap lock.
> + */
> +#ifndef untagged_addr_remote_unlocked
> +#define untagged_addr_remote_unlocked(mm, addr)	({	\
> +	(void)(mm);					\
> +	untagged_addr(addr);				\
> +})
> +#endif
> +
>  #ifdef masked_user_access_begin
>   #define can_do_masked_user_access() 1
>  # ifndef masked_user_write_access_begin
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..d2b2f0014a0c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,8 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/secretmem.h>
> +#include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -7062,6 +7064,180 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>  EXPORT_SYMBOL_GPL(generic_access_phys);
>  #endif
>
> +/*
> + * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
> + * lock and RCU-freed page tables to walk page tables without the mmap lock.
> + */
> +#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
> +/*
> + * Read-side VMA checks for the lockless fast path, mirroring the read side of
> + * check_vma_flags(): reject what FW_VMA_LOCKED cannot handle (hugetlb), what
> + * needs the ->access() handler (VM_IO/VM_PFNMAP), or what has no struct page to
> + * copy (secretmem); enforce the FOLL_ANON restriction that
> + * /proc/PID/{cmdline,environ} rely on (CVE-2018-1120); and require read access
> + * (honoring FOLL_FORCE).  Anything not positively allowed falls back to the slow
> + * path, which re-validates everything.
> + */

No wall of text please :)

Same comments for the rest, please use paragraphs, try to be succinct, etc.

> +static bool vma_permits_fast_access(struct vm_area_struct *vma,
> +				    unsigned int gup_flags)
> +{
> +	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> +		return false;
> +	if (is_vm_hugetlb_page(vma) || vma_is_secretmem(vma))
> +		return false;
> +	if ((gup_flags & FOLL_ANON) && !vma_is_anonymous(vma))
> +		return false;
> +	if (!(vma->vm_flags & VM_READ) &&
> +	    (!(gup_flags & FOLL_FORCE) || !(vma->vm_flags & VM_MAYREAD)))
> +		return false;
> +	return true;
> +}
> +
> +/* Size of the single mapping entry folio_walk_start() landed on. */
> +static unsigned long fw_entry_size(enum folio_walk_level level)
> +{
> +	switch (level) {
> +	case FW_LEVEL_PUD:
> +		return PUD_SIZE;
> +	case FW_LEVEL_PMD:
> +		return PMD_SIZE;
> +	default:
> +		return PAGE_SIZE;
> +	}
> +}
> +
> +/*
> + * Copy @len bytes of the pinned @folio out to @buf, starting at byte offset
> + * @folio_off within the folio (the position of @addr).  Maps and copies one
> + * page at a time -- kmap_local_folio() for HIGHMEM, copy_from_user_page() for
> + * the per-page flush on aliasing caches -- without re-walking page tables.
> + * Each page borrows the caller's single folio reference, so the mapping is
> + * dropped with kunmap_local() rather than folio_release_kmap().
> + */
> +static void copy_folio_pages(struct vm_area_struct *vma, struct folio *folio,
> +			     unsigned long folio_off, unsigned long addr,
> +			     void *buf, unsigned long len)
> +{
> +	unsigned long done = 0;
> +
> +	while (done < len) {
> +		unsigned long pos = folio_off + done;
> +		unsigned long page_idx = pos >> PAGE_SHIFT;
> +		unsigned int page_off = pos & ~PAGE_MASK;
> +		unsigned int chunk = min_t(unsigned long, len - done,
> +					   PAGE_SIZE - page_off);
> +		void *kaddr = kmap_local_folio(folio, page_idx << PAGE_SHIFT);
> +
> +		copy_from_user_page(vma, folio_page(folio, page_idx),
> +				    addr + done, buf + done, kaddr + page_off,
> +				    chunk);
> +		kunmap_local(kaddr);
> +		done += chunk;
> +	}
> +}
> +
> +/*
> + * Opportunistic lockless fast path for __access_remote_vm() reads.
> + *
> + * Memory already resident in @mm can be read without taking the frequently
> + * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
> + * with FW_VMA_LOCKED grabs a short-lived reference to a present page from a page
> + * table walk run with interrupts disabled, which serializes against concurrent
> + * page table freeing the same way gup_fast does (relying on
> + * MMU_GATHER_RCU_TABLE_FREE).
> + *
> + * Only a request that lies entirely within a single VMA is handled here,
> + * which should not be an issue in practice since every caller has a
> + * buffer of PAGE_SIZE or smaller. Loop iteration inside this function
> + * should be rare, too.
> + *
> + * Returns the number of bytes transferred via the fast path.
> + */
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	void *old_buf = buf;
> +	struct vm_area_struct *vma;
> +
> +	addr = untagged_addr_remote_unlocked(mm, addr);
> +
> +	vma = lock_vma_under_rcu(mm, addr);
> +	if (!vma)
> +		return 0;
> +
> +	/* Only handle a request contained entirely within this one VMA. */
> +	if (len > vma->vm_end - addr)
> +		goto out_unlock;
> +
> +	if (!vma_permits_fast_access(vma, gup_flags))
> +		goto out_unlock;
> +
> +	while (len) {
> +		struct folio_walk fw;
> +		struct folio *folio;
> +		struct page *page;
> +		unsigned long entry_size, folio_off, span, irq_flags;
> +
> +		/*
> +		 * The lockless page table walk must run with interrupts
> +		 * disabled: page table freeing (munmap or THP collapse, which
> +		 * IPI via tlb_remove_table_sync_one() and wait) then cannot free
> +		 * a table mid-walk -- the same contract gup_fast relies on.  IRQs
> +		 * are restored once the folio is pinned; the copy below holds only
> +		 * the folio reference.
> +		 */
> +		local_irq_save(irq_flags);
> +		folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
> +		if (!folio) {
> +			local_irq_restore(irq_flags);
> +			goto out_unlock;	/* not present: let the slow path fault it in */
> +		}
> +		page = fw.page;
> +		if (!page) {
> +			/* No struct page to copy (e.g. a special PTE). */
> +			folio_walk_end(&fw, vma);
> +			local_irq_restore(irq_flags);
> +			goto out_unlock;
> +		}
> +		entry_size = fw_entry_size(fw.level);
> +		folio_get(folio);
> +		folio_walk_end(&fw, vma);
> +		local_irq_restore(irq_flags);
> +
> +		/*
> +		 * folio_walk_start() validated one present mapping entry
> +		 * (PAGE/PMD/PUD_SIZE).  Copy to the end of that entry, bounded by
> +		 * the folio and the remaining length (already within the VMA), so
> +		 * a huge mapping is handled in a single walk.
> +		 */
> +		folio_off = (folio_page_idx(folio, page) << PAGE_SHIFT) +
> +			    offset_in_page(addr);
> +		span = min3((unsigned long)len,
> +			    entry_size - (addr & (entry_size - 1)),
> +			    (folio_nr_pages(folio) << PAGE_SHIFT) - folio_off);
> +
> +		copy_folio_pages(vma, folio, folio_off, addr, buf, span);
> +
> +		/* Match the FOLL_TOUCH behaviour of the slow (GUP) path. */
> +		folio_mark_accessed(folio);
> +		folio_put(folio);
> +		len -= span;
> +		buf += span;
> +		addr += span;
> +	}
> +
> +out_unlock:
> +	vma_end_read(vma);
> +	return buf - old_buf;
> +}
> +#else
> +static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
> +				 void *buf, int len, unsigned int gup_flags)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
> +
>  /*
>   * Access another process' address space as given in mm.
>   */
> @@ -7071,15 +7247,30 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
>  	void *old_buf = buf;
>  	int write = gup_flags & FOLL_WRITE;
>
> +	/*
> +	 * Try the lockless fast path for reads first; it transfers what it can
> +	 * from resident memory without taking mmap_lock, and leaves the
> +	 * remainder (if any) to the slow path below.
> +	 */
> +	if (!write) {
> +		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
> +
> +		addr += done;
> +		buf += done;
> +		len -= done;
> +		if (!len)
> +			return buf - old_buf;
> +	}
> +
>  	if (mmap_read_lock_killable(mm))
> -		return 0;
> +		return buf - old_buf;
>
>  	/* Untag the address before looking up the VMA */
>  	addr = untagged_addr_remote(mm, addr);
>
>  	/* Avoid triggering the temporary warning in __get_user_pages */
>  	if (!vma_lookup(mm, addr) && !expand_stack(mm, addr))
> -		return 0;
> +		return buf - old_buf;
>
>  	/* ignore errors, just check how much was successfully transferred */
>  	while (len) {
> --
> 2.53.0-Meta
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock
  2026-06-25  6:32 ` [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock David Hildenbrand (Arm)
@ 2026-06-25  7:47   ` Lorenzo Stoakes
  0 siblings, 0 replies; 18+ messages in thread
From: Lorenzo Stoakes @ 2026-06-25  7:47 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Rik van Riel, linux-kernel, x86, linux-mm, Thomas Gleixner,
	Ingo Molnar, Dmitry Ilvokhin, Borislav Petkov, Dave Hansen,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Suren Baghdasaryan, kernel-team

(Note that it's the merge window, really ideally you should be saving sending
non-RFC series until after the merge window closes, none of these series will be
taken yet and you'll have to resend anyway)

On Thu, Jun 25, 2026 at 08:32:43AM +0200, David Hildenbrand (Arm) wrote:
> On 6/25/26 03:50, Rik van Riel wrote:
> > Sometimes processes can get stuck with the mmap_lock held for
> > a long time. This slows down, and can even prevent system monitoring
> > tools from assessing and logging the situation, because they themselves
> > end up getting stuck on the mmap_lock.
> >
> > However, with the introduction of per-VMA locks, we can improve the
> > reliability of system monitoring, and generally speed up __access_remote_vm
> > under mmap_loc contention, by adding a fast path that does not require
> > the process-wide mmap_lock.
> >
> > This fast path is only compiled in and used when it is safe to do so,
> > meaning a kernel with per-VMA locks, RCU pgae table freeing, the VMA
> > is not hugetlbfs, iomap, pfnmap, etc...
> >
> > v2:
> >  - simplify the code, which should be ok because these copies are < PAGE_SIZE
> >  - clean up the code
> >  - fix locking wrt tlb_remove_table_sync_one()
> >  - hopefully address all the other comments

This is really not sufficient :) you should break out the changes you
make and who suggested them.

If people give their time to review, you should take the time to make it clear
you've dealt with it.

>
> You mean, ignoring my comments about not reiplementing GUP entirely?
>
> NAK

Yeah agreed.

As replied elsewhere, I don't think we need to do that anyway?

But you _really_ need to respond to review inline so we can have those
discussions and ensure that you're moving forward in the correct way.

Just quietly resending with a catch-all comment as per above puts all the load
on us to go and check that you've done what we've asked.

On my side, my review time is very limited now, so triage dictates general
dismissals when the series looks wrong, engaging in discussion will help us all
move forward efficiently.

>
> --
> Cheers,
>
> David

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-06-25  7:48 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25  1:50 [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
2026-06-25  1:50 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
2026-06-25  1:50 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
2026-06-25  7:34   ` Lorenzo Stoakes
2026-06-25  1:50 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
2026-06-25  7:39   ` Lorenzo Stoakes
2026-06-25  6:32 ` [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock David Hildenbrand (Arm)
2026-06-25  7:47   ` Lorenzo Stoakes
  -- strict thread matches above, loose matches on Subject: below --
2026-06-16 19:02 [PATCH " Rik van Riel
2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
2026-06-17  6:19   ` Suren Baghdasaryan
2026-06-19 12:24     ` Lorenzo Stoakes
2026-06-19 13:46     ` Rik van Riel
2026-06-19 14:03       ` Suren Baghdasaryan
2026-06-19 14:33         ` David Hildenbrand (Arm)
2026-06-18 17:01   ` Usama Arif
2026-06-18 17:07     ` David Hildenbrand (Arm)
2026-06-18 17:22       ` Usama Arif
2026-06-19 12:20   ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox