[PATCH 3/3] mm: read remote memory without the mmap lock where possible

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: Rik van Riel <riel@surriel.com>,
	x86@kernel.org, linux-mm@kvack.org,
	"Thomas Gleixner" <tglx@kernel.org>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Dmitry Ilvokhin" <d@ilvokhin.com>,
	"Borislav Petkov" <bp@alien8.de>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"David Hildenbrand" <david@kernel.org>,
	"Lorenzo Stoakes" <ljs@kernel.org>,
	"Liam R. Howlett" <liam@infradead.org>,
	"Vlastimil Babka" <vbabka@kernel.org>,
	"Suren Baghdasaryan" <surenb@google.com>,
	kernel-team@meta.com
Subject: [PATCH 3/3] mm: read remote memory without the mmap lock where possible
Date: Wed, 24 Jun 2026 21:50:53 -0400	[thread overview]
Message-ID: <20260625015053.2445008-4-riel@surriel.com> (raw)
In-Reply-To: <20260625015053.2445008-1-riel@surriel.com>

__access_remote_vm() takes mmap_read_lock() for the entire transfer and
uses get_user_pages_remote(), which faults pages in.  For the common case
of reading memory that is already resident -- /proc/PID/cmdline,
/proc/PID/environ, ptrace PEEK of resident pages -- the mmap lock is
unnecessary and is badly contended on large machines.

Add an opportunistic, read-only fast path.  It takes the per-VMA lock with
lock_vma_under_rcu() and, only when the whole request lies within that one
VMA, copies the resident pages out using folio_walk_start(FW_VMA_LOCKED)
to grab a short-lived page reference from a page table walk run with
interrupts disabled.  Interrupts are disabled only across the walk (until
the folio is pinned): page table freeing -- a concurrent munmap() or THP
collapse of an adjacent region -- serializes against lockless walkers via
tlb_remove_table_sync_one(), which IPIs and waits for every CPU to enable
interrupts, the same contract gup_fast relies on.  The copy then runs with
interrupts on, holding only the folio reference.

A request that spans more than one VMA is left entirely to the mmap_lock
path: relocking per VMA could observe a structurally inconsistent address
space (a neighbouring VMA unmapped and a different one mapped in its place
between locks), whereas the mmap_lock path sees a stable VMA tree for the
whole transfer.

The per-VMA permission check mirrors the read side of check_vma_flags(),
including the FOLL_ANON restriction that /proc/PID/{cmdline,environ} rely
on (CVE-2018-1120).  Anything not positively allowed -- a not-present
page, a hugetlb or VM_IO/VM_PFNMAP or secretmem mapping, or a race with a
VMA writer -- falls back to the mmap_lock path for the remainder, which
re-validates everything.  Pages read on the fast path are marked accessed,
matching the FOLL_TOUCH behaviour of the get_user_pages_remote() slow
path.

untagged_addr_remote() asserts the mmap lock, so add an unlocked variant
for the fast path; the untag mask is a stable per-mm value.

Only reads are handled here; writes keep using the slow path.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 arch/x86/include/asm/uaccess_64.h |  14 ++-
 include/linux/uaccess.h           |  11 ++
 mm/memory.c                       | 195 +++++++++++++++++++++++++++++-
 3 files changed, 217 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 4a52497ba6a1..933b0b8b4d60 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -39,11 +39,23 @@ static inline unsigned long __untagged_addr(unsigned long addr)
 	(__force __typeof__(addr))__untagged_addr(__addr);		\
 })
 
+/* Strip the tag bits from a remote mm's address; usable without the mmap lock. */
+static inline unsigned long __untagged_addr_remote_unlocked(struct mm_struct *mm,
+							    unsigned long addr)
+{
+	return addr & READ_ONCE(mm->context.untag_mask);
+}
+
+#define untagged_addr_remote_unlocked(mm, addr)	({			\
+	unsigned long __addr = (__force unsigned long)(addr);		\
+	(__force __typeof__(addr))__untagged_addr_remote_unlocked(mm, __addr); \
+})
+
 static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
 						   unsigned long addr)
 {
 	mmap_assert_locked(mm);
-	return addr & READ_ONCE((mm)->context.untag_mask);
+	return __untagged_addr_remote_unlocked(mm, addr);
 }
 
 #define untagged_addr_remote(mm, addr)	({				\
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 8a264662b242..c8c83372c9d8 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -34,6 +34,17 @@
 })
 #endif
 
+/*
+ * Like untagged_addr_remote(), but for callers that stabilize @mm by other
+ * means (e.g. a per-VMA lock) and must not assert the mmap lock.
+ */
+#ifndef untagged_addr_remote_unlocked
+#define untagged_addr_remote_unlocked(mm, addr)	({	\
+	(void)(mm);					\
+	untagged_addr(addr);				\
+})
+#endif
+
 #ifdef masked_user_access_begin
  #define can_do_masked_user_access() 1
 # ifndef masked_user_write_access_begin
diff --git a/mm/memory.c b/mm/memory.c
index 86a973119bd4..d2b2f0014a0c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -42,6 +42,8 @@
 #include <linux/kernel_stat.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
+#include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/sched/task.h>
@@ -7062,6 +7064,180 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 EXPORT_SYMBOL_GPL(generic_access_phys);
 #endif
 
+/*
+ * The fast path uses folio_walk_start(FW_VMA_LOCKED), which needs the per-VMA
+ * lock and RCU-freed page tables to walk page tables without the mmap lock.
+ */
+#if defined(CONFIG_PER_VMA_LOCK) && defined(CONFIG_MMU_GATHER_RCU_TABLE_FREE)
+/*
+ * Read-side VMA checks for the lockless fast path, mirroring the read side of
+ * check_vma_flags(): reject what FW_VMA_LOCKED cannot handle (hugetlb), what
+ * needs the ->access() handler (VM_IO/VM_PFNMAP), or what has no struct page to
+ * copy (secretmem); enforce the FOLL_ANON restriction that
+ * /proc/PID/{cmdline,environ} rely on (CVE-2018-1120); and require read access
+ * (honoring FOLL_FORCE).  Anything not positively allowed falls back to the slow
+ * path, which re-validates everything.
+ */
+static bool vma_permits_fast_access(struct vm_area_struct *vma,
+				    unsigned int gup_flags)
+{
+	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+		return false;
+	if (is_vm_hugetlb_page(vma) || vma_is_secretmem(vma))
+		return false;
+	if ((gup_flags & FOLL_ANON) && !vma_is_anonymous(vma))
+		return false;
+	if (!(vma->vm_flags & VM_READ) &&
+	    (!(gup_flags & FOLL_FORCE) || !(vma->vm_flags & VM_MAYREAD)))
+		return false;
+	return true;
+}
+
+/* Size of the single mapping entry folio_walk_start() landed on. */
+static unsigned long fw_entry_size(enum folio_walk_level level)
+{
+	switch (level) {
+	case FW_LEVEL_PUD:
+		return PUD_SIZE;
+	case FW_LEVEL_PMD:
+		return PMD_SIZE;
+	default:
+		return PAGE_SIZE;
+	}
+}
+
+/*
+ * Copy @len bytes of the pinned @folio out to @buf, starting at byte offset
+ * @folio_off within the folio (the position of @addr).  Maps and copies one
+ * page at a time -- kmap_local_folio() for HIGHMEM, copy_from_user_page() for
+ * the per-page flush on aliasing caches -- without re-walking page tables.
+ * Each page borrows the caller's single folio reference, so the mapping is
+ * dropped with kunmap_local() rather than folio_release_kmap().
+ */
+static void copy_folio_pages(struct vm_area_struct *vma, struct folio *folio,
+			     unsigned long folio_off, unsigned long addr,
+			     void *buf, unsigned long len)
+{
+	unsigned long done = 0;
+
+	while (done < len) {
+		unsigned long pos = folio_off + done;
+		unsigned long page_idx = pos >> PAGE_SHIFT;
+		unsigned int page_off = pos & ~PAGE_MASK;
+		unsigned int chunk = min_t(unsigned long, len - done,
+					   PAGE_SIZE - page_off);
+		void *kaddr = kmap_local_folio(folio, page_idx << PAGE_SHIFT);
+
+		copy_from_user_page(vma, folio_page(folio, page_idx),
+				    addr + done, buf + done, kaddr + page_off,
+				    chunk);
+		kunmap_local(kaddr);
+		done += chunk;
+	}
+}
+
+/*
+ * Opportunistic lockless fast path for __access_remote_vm() reads.
+ *
+ * Memory already resident in @mm can be read without taking the frequently
+ * contended mmap_lock: a per-VMA lock stabilizes the VMA, and folio_walk_start()
+ * with FW_VMA_LOCKED grabs a short-lived reference to a present page from a page
+ * table walk run with interrupts disabled, which serializes against concurrent
+ * page table freeing the same way gup_fast does (relying on
+ * MMU_GATHER_RCU_TABLE_FREE).
+ *
+ * Only a request that lies entirely within a single VMA is handled here,
+ * which should not be an issue in practice since every caller has a
+ * buffer of PAGE_SIZE or smaller. Loop iteration inside this function
+ * should be rare, too.
+ *
+ * Returns the number of bytes transferred via the fast path.
+ */
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+				 void *buf, int len, unsigned int gup_flags)
+{
+	void *old_buf = buf;
+	struct vm_area_struct *vma;
+
+	addr = untagged_addr_remote_unlocked(mm, addr);
+
+	vma = lock_vma_under_rcu(mm, addr);
+	if (!vma)
+		return 0;
+
+	/* Only handle a request contained entirely within this one VMA. */
+	if (len > vma->vm_end - addr)
+		goto out_unlock;
+
+	if (!vma_permits_fast_access(vma, gup_flags))
+		goto out_unlock;
+
+	while (len) {
+		struct folio_walk fw;
+		struct folio *folio;
+		struct page *page;
+		unsigned long entry_size, folio_off, span, irq_flags;
+
+		/*
+		 * The lockless page table walk must run with interrupts
+		 * disabled: page table freeing (munmap or THP collapse, which
+		 * IPI via tlb_remove_table_sync_one() and wait) then cannot free
+		 * a table mid-walk -- the same contract gup_fast relies on.  IRQs
+		 * are restored once the folio is pinned; the copy below holds only
+		 * the folio reference.
+		 */
+		local_irq_save(irq_flags);
+		folio = folio_walk_start(&fw, vma, addr, FW_VMA_LOCKED);
+		if (!folio) {
+			local_irq_restore(irq_flags);
+			goto out_unlock;	/* not present: let the slow path fault it in */
+		}
+		page = fw.page;
+		if (!page) {
+			/* No struct page to copy (e.g. a special PTE). */
+			folio_walk_end(&fw, vma);
+			local_irq_restore(irq_flags);
+			goto out_unlock;
+		}
+		entry_size = fw_entry_size(fw.level);
+		folio_get(folio);
+		folio_walk_end(&fw, vma);
+		local_irq_restore(irq_flags);
+
+		/*
+		 * folio_walk_start() validated one present mapping entry
+		 * (PAGE/PMD/PUD_SIZE).  Copy to the end of that entry, bounded by
+		 * the folio and the remaining length (already within the VMA), so
+		 * a huge mapping is handled in a single walk.
+		 */
+		folio_off = (folio_page_idx(folio, page) << PAGE_SHIFT) +
+			    offset_in_page(addr);
+		span = min3((unsigned long)len,
+			    entry_size - (addr & (entry_size - 1)),
+			    (folio_nr_pages(folio) << PAGE_SHIFT) - folio_off);
+
+		copy_folio_pages(vma, folio, folio_off, addr, buf, span);
+
+		/* Match the FOLL_TOUCH behaviour of the slow (GUP) path. */
+		folio_mark_accessed(folio);
+		folio_put(folio);
+		len -= span;
+		buf += span;
+		addr += span;
+	}
+
+out_unlock:
+	vma_end_read(vma);
+	return buf - old_buf;
+}
+#else
+static int access_remote_vm_fast(struct mm_struct *mm, unsigned long addr,
+				 void *buf, int len, unsigned int gup_flags)
+{
+	return 0;
+}
+#endif /* CONFIG_PER_VMA_LOCK && CONFIG_MMU_GATHER_RCU_TABLE_FREE */
+
 /*
  * Access another process' address space as given in mm.
  */
@@ -7071,15 +7247,30 @@ static int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
 	void *old_buf = buf;
 	int write = gup_flags & FOLL_WRITE;
 
+	/*
+	 * Try the lockless fast path for reads first; it transfers what it can
+	 * from resident memory without taking mmap_lock, and leaves the
+	 * remainder (if any) to the slow path below.
+	 */
+	if (!write) {
+		int done = access_remote_vm_fast(mm, addr, buf, len, gup_flags);
+
+		addr += done;
+		buf += done;
+		len -= done;
+		if (!len)
+			return buf - old_buf;
+	}
+
 	if (mmap_read_lock_killable(mm))
-		return 0;
+		return buf - old_buf;
 
 	/* Untag the address before looking up the VMA */
 	addr = untagged_addr_remote(mm, addr);
 
 	/* Avoid triggering the temporary warning in __get_user_pages */
 	if (!vma_lookup(mm, addr) && !expand_stack(mm, addr))
-		return 0;
+		return buf - old_buf;
 
 	/* ignore errors, just check how much was successfully transferred */
 	while (len) {
-- 
2.53.0-Meta

next prev parent reply	other threads:[~2026-06-25  1:51 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25  1:50 [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock Rik van Riel
2026-06-25  1:50 ` [PATCH 1/3] x86/mm: use READ_ONCE/WRITE_ONCE for mm->context.untag_mask Rik van Riel
2026-06-25  1:50 ` [PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock Rik van Riel
2026-06-25  7:34   ` Lorenzo Stoakes
2026-06-25  1:50 ` Rik van Riel [this message]
2026-06-25  7:39   ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Lorenzo Stoakes
2026-06-25  6:32 ` [PATCH v2 0/3] mm: __access_remote_vm with per-VMA lock David Hildenbrand (Arm)
2026-06-25  7:47   ` Lorenzo Stoakes
  -- strict thread matches above, loose matches on Subject: below --
2026-06-16 19:02 [PATCH " Rik van Riel
2026-06-16 19:03 ` [PATCH 3/3] mm: read remote memory without the mmap lock where possible Rik van Riel
2026-06-17  6:19   ` Suren Baghdasaryan
2026-06-19 12:24     ` Lorenzo Stoakes
2026-06-19 13:46     ` Rik van Riel
2026-06-19 14:03       ` Suren Baghdasaryan
2026-06-19 14:33         ` David Hildenbrand (Arm)
2026-06-18 17:01   ` Usama Arif
2026-06-18 17:07     ` David Hildenbrand (Arm)
2026-06-18 17:22       ` Usama Arif
2026-06-19 12:20   ` Lorenzo Stoakes

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:4a52497ba6a dfblob:933b0b8b4d6 dfblob:8a264662b24
dfblob:c8c83372c9d dfblob:86a973119bd dfblob:d2b2f0014a0 )
 OR (
bs:"[PATCH 3/3] mm: read remote memory without the mmap lock where possible" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260625015053.2445008-4-riel@surriel.com \
    --to=riel@surriel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=d@ilvokhin.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mingo@redhat.com \
    --cc=surenb@google.com \
    --cc=tglx@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.