All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev
@ 2026-04-28 12:28 Alban Crequy
  2026-04-28 12:28 ` [PATCH v3 1/2] " Alban Crequy
  2026-04-28 12:28 ` [PATCH v3 2/2] selftests/mm: add tests for process_vm_readv flags Alban Crequy
  0 siblings, 2 replies; 8+ messages in thread
From: Alban Crequy @ 2026-04-28 12:28 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-kernel, linux-mm,
	Alban Crequy, Alban Crequy, Peter Xu, Willy Tarreau,
	linux-kselftest, shuah, Usama Arif, David Laight

This adds two flags to process_vm_readv/writev:

- PROCESS_VM_PIDFD: refer to the remote process via PID file descriptor
  instead of PID.
- PROCESS_VM_NOWAIT: do not block on IO if the memory access causes a
  page fault.

v2: https://lore.kernel.org/lkml/20260408145436.843538-1-alban.crequy@gmail.com/
v1: https://lore.kernel.org/lkml/20251118132348.2415603-1-alban.crequy@gmail.com/
Sashiko review of v2: https://sashiko.dev/#/patchset/20260408145436.843538-1-alban.crequy@gmail.com

Changes since v2:
- Fix ERR_PTR handling for pidfd_get_task(): use IS_ERR()/PTR_ERR()
  for the pidfd path, matching process_madvise() (Usama Arif, Sashiko)
- Add selftest for invalid pidfd (David Hildenbrand)
- Add selftest for invalid pid
- Remove hardcoded __NR_pidfd_open fallback, use <sys/syscall.h> (Sashiko)
- SKIP pidfd tests on kernels without pidfd_open (ENOSYS) (Sashiko)
- SKIP userfaultfd tests when unprivileged userfaultfd is disabled (EPERM) (Sashiko)
- Fault in test_data before NOWAIT tests to ensure page is resident (Sashiko)
- Add ksft_process_vm_readv.sh wrapper and run_vmtests.sh entry
  so the test runs in CI
- Rebase onto v7.1-rc1

Not addressed:
- uffd handler timeout causing test hang: kselftest_harness forks each
  test with a 30-second timeout, so an infinite hang cannot occur (Sashiko)
- 64-bit process reading 32-bit process high addresses: pre-existing
  concern in the existing process_vm_readv code, not introduced by this
  patch (David Laight)

Alban Crequy (2):
  mm/process_vm_access: pidfd and nowait support for
    process_vm_readv/writev
  selftests/mm: add tests for process_vm_readv flags

 MAINTAINERS                                   |   1 +
 include/uapi/linux/process_vm.h               |   9 +
 mm/process_vm_access.c                        |  34 +-
 tools/testing/selftests/mm/Makefile           |   2 +
 .../selftests/mm/ksft_process_vm_readv.sh     |   4 +
 tools/testing/selftests/mm/process_vm_readv.c | 421 ++++++++++++++++++
 tools/testing/selftests/mm/run_vmtests.sh     |   4 +
 7 files changed, 466 insertions(+), 9 deletions(-)
 create mode 100644 include/uapi/linux/process_vm.h
 create mode 100755 tools/testing/selftests/mm/ksft_process_vm_readv.sh
 create mode 100644 tools/testing/selftests/mm/process_vm_readv.c

-- 
2.45.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev
  2026-04-28 12:28 [PATCH v3 0/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev Alban Crequy
@ 2026-04-28 12:28 ` Alban Crequy
  2026-04-28 20:05   ` David Hildenbrand (Arm)
  2026-04-28 12:28 ` [PATCH v3 2/2] selftests/mm: add tests for process_vm_readv flags Alban Crequy
  1 sibling, 1 reply; 8+ messages in thread
From: Alban Crequy @ 2026-04-28 12:28 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-kernel, linux-mm,
	Alban Crequy, Alban Crequy, Peter Xu, Willy Tarreau,
	linux-kselftest, shuah, Usama Arif, David Laight

From: Alban Crequy <albancrequy@microsoft.com>

There are two categories of users for process_vm_readv:

1. Debuggers like GDB or strace.

   When a debugger attempts to read the target memory and triggers a
   page fault, the page fault needs to be resolved so that the debugger
   can accurately interpret the memory. A debugger is typically attached
   to a single process.

2. Profilers like OpenTelemetry eBPF Profiler.

   The profiler uses a perf event to get stack traces from all
   processes at 20Hz (20 stack traces to resolve per second). For
   interpreted languages (Ruby, Python, etc.), the profiler uses
   process_vm_readv to get the correct symbols. In this case,
   performance is the most important. It is fine if some stack traces
   cannot be resolved as long as it is not statistically significant.

The current behaviour of process_vm_readv is to resolve page faults in
the target VM. This is as desired for debuggers, but unwelcome for
profilers because the page fault resolution could take a lot of time
depending on the backing filesystem. Additionally, since profilers
monitor all processes, we don't want a slow page fault resolution for
one target process slowing down the monitoring for all other target
processes.

This patch adds the flag PROCESS_VM_NOWAIT, so the caller can choose to
not block on IO if the memory access causes a page fault.

Additionally, this patch adds the flag PROCESS_VM_PIDFD to refer to the
remote process via PID file descriptor instead of PID. Such a file
descriptor can be obtained with pidfd_open(2). This is useful to avoid
the pid number being reused. It is unlikely to happen for debuggers
because they can monitor the target process termination in other ways
(ptrace), but can be helpful in some profiling scenarios.

If a given flag is unsupported, the syscall returns the error EINVAL
without checking the buffers. This gives a way to userspace to detect
whether the current kernel supports a specific flag:

  process_vm_readv(pid, NULL, 1, NULL, 1, PROCESS_VM_PIDFD)
  -> EINVAL if the kernel does not support the flag PROCESS_VM_PIDFD
     (before this patch)
  -> EFAULT if the kernel supports the flag (after this patch)

Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
---
v3:
- Fix ERR_PTR handling for pidfd_get_task(): use IS_ERR()/PTR_ERR()
  for the pidfd path, matching process_madvise() (Usama Arif, Sashiko)

v2:
- Expand commit message with use-case motivation (David Hildenbrand)
- Use unsigned long consistently for pvm_flags parameter (David Hildenbrand)
- Add PROCESS_VM_SUPPORTED_FLAGS kernel-internal define (David Hildenbrand)
- Keep (1UL << N) in UAPI header: BIT() is defined in vdso/bits.h
  which is not exported to userspace, so UAPI headers using BIT() would
  break when included from userspace programs (David Hildenbrand)

 MAINTAINERS                     |  1 +
 include/uapi/linux/process_vm.h |  9 +++++++++
 mm/process_vm_access.c          | 34 ++++++++++++++++++++++++---------
 3 files changed, 35 insertions(+), 9 deletions(-)
 create mode 100644 include/uapi/linux/process_vm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 2fb1c75afd16..0f6ce21d6235 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16786,6 +16786,7 @@ F:	include/linux/ptdump.h
 F:	include/linux/vmpressure.h
 F:	include/linux/vmstat.h
 F:	fs/proc/meminfo.c
+F:	include/uapi/linux/process_vm.h
 F:	kernel/fork.c
 F:	mm/Kconfig
 F:	mm/debug.c
diff --git a/include/uapi/linux/process_vm.h b/include/uapi/linux/process_vm.h
new file mode 100644
index 000000000000..4168e09f3f4e
--- /dev/null
+++ b/include/uapi/linux/process_vm.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_PROCESS_VM_H
+#define _UAPI_LINUX_PROCESS_VM_H
+
+/* Flags for process_vm_readv/process_vm_writev */
+#define PROCESS_VM_PIDFD        (1UL << 0)
+#define PROCESS_VM_NOWAIT       (1UL << 1)
+
+#endif /* _UAPI_LINUX_PROCESS_VM_H */
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 656d3e88755b..dacef50be0be 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -14,6 +14,9 @@
 #include <linux/ptrace.h>
 #include <linux/slab.h>
 #include <linux/syscalls.h>
+#include <linux/process_vm.h>
+
+#define PROCESS_VM_SUPPORTED_FLAGS (PROCESS_VM_PIDFD | PROCESS_VM_NOWAIT)
 
 /**
  * process_vm_rw_pages - read/write pages from task specified
@@ -68,6 +71,7 @@ static int process_vm_rw_pages(struct page **pages,
  * @mm: mm for task
  * @task: task to read/write from
  * @vm_write: 0 means copy from, 1 means copy to
+ * @pvm_flags: PROCESS_VM_* flags
  * Returns 0 on success or on failure error code
  */
 static int process_vm_rw_single_vec(unsigned long addr,
@@ -76,7 +80,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
 				    struct page **process_pages,
 				    struct mm_struct *mm,
 				    struct task_struct *task,
-				    int vm_write)
+				    int vm_write,
+				    unsigned long pvm_flags)
 {
 	unsigned long pa = addr & PAGE_MASK;
 	unsigned long start_offset = addr - pa;
@@ -91,6 +96,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
 
 	if (vm_write)
 		flags |= FOLL_WRITE;
+	if (pvm_flags & PROCESS_VM_NOWAIT)
+		flags |= FOLL_NOWAIT;
 
 	while (!rc && nr_pages && iov_iter_count(iter)) {
 		int pinned_pages = min_t(unsigned long, nr_pages, PVM_MAX_USER_PAGES);
@@ -141,7 +148,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
  * @iter: where to copy to/from locally
  * @rvec: iovec array specifying where to copy to/from in the other process
  * @riovcnt: size of rvec array
- * @flags: currently unused
+ * @flags: process_vm_readv/writev flags
  * @vm_write: 0 if reading from other process, 1 if writing to other process
  *
  * Returns the number of bytes read/written or error code. May
@@ -163,6 +170,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
 	unsigned long nr_pages_iov;
 	ssize_t iov_len;
 	size_t total_len = iov_iter_count(iter);
+	unsigned int f_flags;
 
 	/*
 	 * Work out how many pages of struct pages we're going to need
@@ -194,10 +202,18 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
 	}
 
 	/* Get process information */
-	task = find_get_task_by_vpid(pid);
-	if (!task) {
-		rc = -ESRCH;
-		goto free_proc_pages;
+	if (flags & PROCESS_VM_PIDFD) {
+		task = pidfd_get_task(pid, &f_flags);
+		if (IS_ERR(task)) {
+			rc = PTR_ERR(task);
+			goto free_proc_pages;
+		}
+	} else {
+		task = find_get_task_by_vpid(pid);
+		if (!task) {
+			rc = -ESRCH;
+			goto free_proc_pages;
+		}
 	}
 
 	mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS);
@@ -215,7 +231,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
 	for (i = 0; i < riovcnt && iov_iter_count(iter) && !rc; i++)
 		rc = process_vm_rw_single_vec(
 			(unsigned long)rvec[i].iov_base, rvec[i].iov_len,
-			iter, process_pages, mm, task, vm_write);
+			iter, process_pages, mm, task, vm_write, flags);
 
 	/* copied = space before - space after */
 	total_len -= iov_iter_count(iter);
@@ -244,7 +260,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
  * @liovcnt: size of lvec array
  * @rvec: iovec array specifying where to copy to/from in the other process
  * @riovcnt: size of rvec array
- * @flags: currently unused
+ * @flags: process_vm_readv/writev flags
  * @vm_write: 0 if reading from other process, 1 if writing to other process
  *
  * Returns the number of bytes read/written or error code. May
@@ -266,7 +282,7 @@ static ssize_t process_vm_rw(pid_t pid,
 	ssize_t rc;
 	int dir = vm_write ? ITER_SOURCE : ITER_DEST;
 
-	if (flags != 0)
+	if (flags & ~PROCESS_VM_SUPPORTED_FLAGS)
 		return -EINVAL;
 
 	/* Check iovecs */
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/2] selftests/mm: add tests for process_vm_readv flags
  2026-04-28 12:28 [PATCH v3 0/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev Alban Crequy
  2026-04-28 12:28 ` [PATCH v3 1/2] " Alban Crequy
@ 2026-04-28 12:28 ` Alban Crequy
  1 sibling, 0 replies; 8+ messages in thread
From: Alban Crequy @ 2026-04-28 12:28 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-kernel, linux-mm,
	Alban Crequy, Alban Crequy, Peter Xu, Willy Tarreau,
	linux-kselftest, shuah, Usama Arif, David Laight

From: Alban Crequy <albancrequy@microsoft.com>

Add selftests for the PROCESS_VM_PIDFD and PROCESS_VM_NOWAIT flags
introduced in process_vm_readv/writev.

Tests cover:
- basic read with no flags
- invalid flags (EINVAL)
- invalid address (EFAULT)
- flag validation precedence over address validation
- invalid pidfd (EBADF)
- invalid pid (ESRCH)
- PROCESS_VM_PIDFD: read via pidfd
- PROCESS_VM_NOWAIT: read from resident memory
- PROCESS_VM_PIDFD | PROCESS_VM_NOWAIT combined
- userfaultfd blocking read (no flags)
- PROCESS_VM_NOWAIT with userfaultfd (non-blocking, returns EFAULT)

Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
---
v3:
- Add selftest for invalid pidfd (David Hildenbrand)
- Add selftest for invalid pid
- SKIP on kernels without PROCESS_VM_PIDFD support
- Remove hardcoded __NR_pidfd_open fallback, use <sys/syscall.h> (Sashiko)
- SKIP pidfd tests on kernels without pidfd_open (ENOSYS) (Sashiko)
- SKIP userfaultfd tests when unprivileged userfaultfd is disabled (EPERM) (Sashiko)
- Fault in test_data before NOWAIT tests to ensure page is resident (Sashiko)
- Add ksft_process_vm_readv.sh wrapper and run_vmtests.sh entry

v2:
- New patch.

 tools/testing/selftests/mm/Makefile           |   2 +
 .../selftests/mm/ksft_process_vm_readv.sh     |   4 +
 tools/testing/selftests/mm/process_vm_readv.c | 421 ++++++++++++++++++
 tools/testing/selftests/mm/run_vmtests.sh     |   4 +
 4 files changed, 431 insertions(+)
 create mode 100755 tools/testing/selftests/mm/ksft_process_vm_readv.sh
 create mode 100644 tools/testing/selftests/mm/process_vm_readv.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index cd24596cdd27..feb3a0b9a57e 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -106,6 +106,7 @@ TEST_GEN_FILES += guard-regions
 TEST_GEN_FILES += merge
 TEST_GEN_FILES += rmap
 TEST_GEN_FILES += folio_split_race_test
+TEST_GEN_FILES += process_vm_readv
 
 ifneq ($(ARCH),arm64)
 TEST_GEN_FILES += soft-dirty
@@ -167,6 +168,7 @@ TEST_PROGS += ksft_pfnmap.sh
 TEST_PROGS += ksft_pkey.sh
 TEST_PROGS += ksft_process_madv.sh
 TEST_PROGS += ksft_process_mrelease.sh
+TEST_PROGS += ksft_process_vm_readv.sh
 TEST_PROGS += ksft_rmap.sh
 TEST_PROGS += ksft_soft_dirty.sh
 TEST_PROGS += ksft_thp.sh
diff --git a/tools/testing/selftests/mm/ksft_process_vm_readv.sh b/tools/testing/selftests/mm/ksft_process_vm_readv.sh
new file mode 100755
index 000000000000..09d0fcc9a35d
--- /dev/null
+++ b/tools/testing/selftests/mm/ksft_process_vm_readv.sh
@@ -0,0 +1,4 @@
+#!/bin/sh -e
+# SPDX-License-Identifier: GPL-2.0
+
+./run_vmtests.sh -t process_vm_readv
diff --git a/tools/testing/selftests/mm/process_vm_readv.c b/tools/testing/selftests/mm/process_vm_readv.c
new file mode 100644
index 000000000000..0479ae424c78
--- /dev/null
+++ b/tools/testing/selftests/mm/process_vm_readv.c
@@ -0,0 +1,421 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <sys/uio.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <linux/userfaultfd.h>
+
+#include "kselftest_harness.h"
+
+#ifndef PROCESS_VM_PIDFD
+#define PROCESS_VM_PIDFD	(1UL << 0)
+#endif
+
+#ifndef PROCESS_VM_NOWAIT
+#define PROCESS_VM_NOWAIT	(1UL << 1)
+#endif
+
+static int sys_pidfd_open(pid_t pid, unsigned int flags)
+{
+	return syscall(__NR_pidfd_open, pid, flags);
+}
+
+static const uint8_t test_data[] = { 0x01, 0x02, 0x03, 0x04,
+				     0x05, 0x06, 0x07, 0x08 };
+#define POISON_BYTE 0xCC
+
+/*
+ * Test: basic process_vm_readv with no flags
+ */
+TEST(read_basic)
+{
+	uint8_t buf[sizeof(test_data)];
+	struct iovec local_iov = { .iov_base = buf, .iov_len = sizeof(buf) };
+	struct iovec remote_iov = {
+		.iov_base = (void *)test_data,
+		.iov_len = sizeof(test_data)
+	};
+	ssize_t n;
+
+	memset(buf, POISON_BYTE, sizeof(buf));
+	n = process_vm_readv(getpid(), &local_iov, 1, &remote_iov, 1, 0);
+	ASSERT_EQ(sizeof(test_data), n);
+	ASSERT_EQ(0, memcmp(buf, test_data, sizeof(test_data)));
+}
+
+/*
+ * Test: invalid flags should return EINVAL
+ */
+TEST(read_invalid_flags)
+{
+	uint8_t buf[8] = { 0 };
+	struct iovec local_iov = { .iov_base = buf, .iov_len = sizeof(buf) };
+	struct iovec remote_iov = {
+		.iov_base = (void *)test_data,
+		.iov_len = sizeof(test_data)
+	};
+	ssize_t n;
+
+	n = process_vm_readv(getpid(), &local_iov, 1, &remote_iov, 1, 255);
+	ASSERT_EQ(-1, n);
+	ASSERT_EQ(EINVAL, errno);
+}
+
+/*
+ * Test: invalid address should return EFAULT
+ */
+TEST(read_invalid_address)
+{
+	uint8_t buf[8] = { 0 };
+	struct iovec local_iov = { .iov_base = buf, .iov_len = sizeof(buf) };
+	struct iovec remote_iov = { .iov_base = NULL, .iov_len = 8 };
+	ssize_t n;
+
+	n = process_vm_readv(getpid(), &local_iov, 1, &remote_iov, 1, 0);
+	ASSERT_EQ(-1, n);
+	ASSERT_EQ(EFAULT, errno);
+}
+
+/*
+ * Test: invalid address with invalid flags should return EINVAL
+ * (flag check happens before address validation)
+ */
+TEST(read_invalid_address_invalid_flags)
+{
+	uint8_t buf[8] = { 0 };
+	struct iovec local_iov = { .iov_base = buf, .iov_len = sizeof(buf) };
+	struct iovec remote_iov = { .iov_base = NULL, .iov_len = 8 };
+	ssize_t n;
+
+	n = process_vm_readv(getpid(), &local_iov, 1, &remote_iov, 1, 255);
+	ASSERT_EQ(-1, n);
+	ASSERT_EQ(EINVAL, errno);
+}
+
+/*
+ * Test: invalid address with all valid flags should return EFAULT
+ * (flags are valid so we get past the flag check to the address check)
+ */
+TEST(read_invalid_address_all_valid_flags)
+{
+	int pidfd;
+	struct iovec local_iov = { .iov_base = NULL, .iov_len = 8 };
+	struct iovec remote_iov = { .iov_base = NULL, .iov_len = 8 };
+	ssize_t n;
+
+	pidfd = sys_pidfd_open(getpid(), 0);
+	if (pidfd < 0 && errno == ENOSYS)
+		SKIP(return, "pidfd_open not supported");
+	ASSERT_GE(pidfd, 0);
+
+	n = process_vm_readv(pidfd, &local_iov, 1, &remote_iov, 1,
+			     PROCESS_VM_PIDFD | PROCESS_VM_NOWAIT);
+	ASSERT_EQ(-1, n);
+	ASSERT_EQ(EFAULT, errno);
+
+	close(pidfd);
+}
+
+/*
+ * Test: read with an invalid pidfd should return an error, not crash
+ */
+TEST(read_invalid_pidfd)
+{
+	uint8_t buf[sizeof(test_data)] = { 0 };
+	struct iovec local_iov = { .iov_base = buf, .iov_len = sizeof(buf) };
+	struct iovec remote_iov = {
+		.iov_base = (void *)test_data,
+		.iov_len = sizeof(test_data)
+	};
+	ssize_t n;
+
+	/* fd 9999 is almost certainly not a valid pidfd */
+	n = process_vm_readv(9999, &local_iov, 1, &remote_iov, 1,
+			     PROCESS_VM_PIDFD);
+	ASSERT_EQ(-1, n);
+	if (errno == EINVAL)
+		SKIP(return, "PROCESS_VM_PIDFD not supported");
+	ASSERT_EQ(EBADF, errno);
+}
+
+/*
+ * Test: read with an invalid pid should return ESRCH
+ */
+TEST(read_invalid_pid)
+{
+	uint8_t buf[sizeof(test_data)] = { 0 };
+	struct iovec local_iov = { .iov_base = buf, .iov_len = sizeof(buf) };
+	struct iovec remote_iov = {
+		.iov_base = (void *)test_data,
+		.iov_len = sizeof(test_data)
+	};
+	ssize_t n;
+
+	/* pid 999999 is almost certainly not a valid process */
+	n = process_vm_readv(999999, &local_iov, 1, &remote_iov, 1, 0);
+	ASSERT_EQ(-1, n);
+	ASSERT_EQ(ESRCH, errno);
+}
+
+/*
+ * Test: read with PIDFD flag
+ */
+TEST(read_pidfd)
+{
+	uint8_t buf[sizeof(test_data)];
+	struct iovec local_iov = { .iov_base = buf, .iov_len = sizeof(buf) };
+	struct iovec remote_iov = {
+		.iov_base = (void *)test_data,
+		.iov_len = sizeof(test_data)
+	};
+	ssize_t n;
+	int pidfd;
+
+	memset(buf, POISON_BYTE, sizeof(buf));
+	pidfd = sys_pidfd_open(getpid(), 0);
+	if (pidfd < 0 && errno == ENOSYS)
+		SKIP(return, "pidfd_open not supported");
+	ASSERT_GE(pidfd, 0);
+
+	n = process_vm_readv(pidfd, &local_iov, 1, &remote_iov, 1,
+			     PROCESS_VM_PIDFD);
+	ASSERT_EQ(sizeof(test_data), n);
+	ASSERT_EQ(0, memcmp(buf, test_data, sizeof(test_data)));
+
+	close(pidfd);
+}
+
+/*
+ * Test: read with NOWAIT from resident memory (should succeed)
+ */
+TEST(read_nowait_resident)
+{
+	uint8_t buf[sizeof(test_data)];
+	struct iovec local_iov = { .iov_base = buf, .iov_len = sizeof(buf) };
+	struct iovec remote_iov = {
+		.iov_base = (void *)test_data,
+		.iov_len = sizeof(test_data)
+	};
+	ssize_t n;
+
+	*(volatile uint64_t *)test_data; /* fault in page for NOWAIT */
+	memset(buf, POISON_BYTE, sizeof(buf));
+	n = process_vm_readv(getpid(), &local_iov, 1, &remote_iov, 1,
+			     PROCESS_VM_NOWAIT);
+	ASSERT_EQ(sizeof(test_data), n);
+	ASSERT_EQ(0, memcmp(buf, test_data, sizeof(test_data)));
+}
+
+/*
+ * Test: read with PIDFD + NOWAIT from resident memory
+ */
+TEST(read_pidfd_nowait_resident)
+{
+	uint8_t buf[sizeof(test_data)];
+	struct iovec local_iov = { .iov_base = buf, .iov_len = sizeof(buf) };
+	struct iovec remote_iov = {
+		.iov_base = (void *)test_data,
+		.iov_len = sizeof(test_data)
+	};
+	ssize_t n;
+	int pidfd;
+
+	*(volatile uint64_t *)test_data; /* fault in page for NOWAIT */
+	memset(buf, POISON_BYTE, sizeof(buf));
+	pidfd = sys_pidfd_open(getpid(), 0);
+	if (pidfd < 0 && errno == ENOSYS)
+		SKIP(return, "pidfd_open not supported");
+	ASSERT_GE(pidfd, 0);
+
+	n = process_vm_readv(pidfd, &local_iov, 1, &remote_iov, 1,
+			     PROCESS_VM_PIDFD | PROCESS_VM_NOWAIT);
+	ASSERT_EQ(sizeof(test_data), n);
+	ASSERT_EQ(0, memcmp(buf, test_data, sizeof(test_data)));
+
+	close(pidfd);
+}
+
+/*
+ * Userfaultfd helpers for NOWAIT tests
+ */
+static int setup_userfaultfd(void)
+{
+	struct uffdio_api api = { .api = UFFD_API };
+	int uffd;
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd < 0)
+		return -errno;
+
+	if (ioctl(uffd, UFFDIO_API, &api)) {
+		close(uffd);
+		return -errno;
+	}
+
+	return uffd;
+}
+
+static void *register_uffd_region(int uffd, size_t size)
+{
+	struct uffdio_register reg;
+	void *mem;
+
+	mem = mmap(NULL, size, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (mem == MAP_FAILED)
+		return NULL;
+
+	reg.range.start = (unsigned long)mem;
+	reg.range.len = size;
+	reg.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (ioctl(uffd, UFFDIO_REGISTER, &reg)) {
+		munmap(mem, size);
+		return NULL;
+	}
+
+	return mem;
+}
+
+struct uffd_handler_args {
+	int uffd;
+	const void *content;
+	size_t content_len;
+};
+
+static void *uffd_handler_thread(void *arg)
+{
+	struct uffd_handler_args *ha = arg;
+	struct uffd_msg msg;
+	struct uffdio_copy uffd_copy;
+	struct pollfd pfd = {
+		.fd = ha->uffd,
+		.events = POLLIN
+	};
+	void *page;
+	long page_size = sysconf(_SC_PAGESIZE);
+	int ret;
+
+	page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
+		    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (page == MAP_FAILED)
+		return (void *)(long)-ENOMEM;
+
+	memcpy(page, ha->content, ha->content_len);
+
+	ret = poll(&pfd, 1, 5000);
+	if (ret <= 0)
+		goto out;
+
+	if (read(ha->uffd, &msg, sizeof(msg)) != sizeof(msg))
+		goto out;
+
+	if (msg.event != UFFD_EVENT_PAGEFAULT)
+		goto out;
+
+	uffd_copy.dst = msg.arg.pagefault.address & ~(page_size - 1);
+	uffd_copy.src = (unsigned long)page;
+	uffd_copy.len = page_size;
+	uffd_copy.mode = 0;
+	ioctl(ha->uffd, UFFDIO_COPY, &uffd_copy);
+
+out:
+	munmap(page, page_size);
+	return NULL;
+}
+
+/*
+ * Test: read from userfaultfd-registered memory (no flags, should block
+ * until page fault is resolved by handler thread)
+ */
+TEST(read_userfaultfd_blocking)
+{
+	int uffd;
+	void *mem;
+	long page_size = sysconf(_SC_PAGESIZE);
+	uint8_t buf[sizeof(test_data)];
+	struct iovec local_iov = { .iov_base = buf, .iov_len = sizeof(buf) };
+	struct iovec remote_iov;
+	struct uffd_handler_args ha;
+	pthread_t handler;
+	ssize_t n;
+
+	memset(buf, POISON_BYTE, sizeof(buf));
+
+	uffd = setup_userfaultfd();
+	if (uffd == -EPERM)
+		SKIP(return, "userfaultfd requires privileges (vm.unprivileged_userfaultfd=0)");
+	if (uffd == -ENOSYS)
+		SKIP(return, "userfaultfd not supported");
+	ASSERT_GE(uffd, 0);
+
+	mem = register_uffd_region(uffd, page_size);
+	ASSERT_NE(NULL, mem);
+
+	ha.uffd = uffd;
+	ha.content = test_data;
+	ha.content_len = sizeof(test_data);
+	ASSERT_EQ(0, pthread_create(&handler, NULL, uffd_handler_thread, &ha));
+
+	remote_iov.iov_base = mem;
+	remote_iov.iov_len = sizeof(test_data);
+	n = process_vm_readv(getpid(), &local_iov, 1, &remote_iov, 1, 0);
+	ASSERT_EQ(sizeof(test_data), n);
+	ASSERT_EQ(0, memcmp(buf, test_data, sizeof(test_data)));
+
+	pthread_join(handler, NULL);
+	munmap(mem, page_size);
+	close(uffd);
+}
+
+/*
+ * Test: read with NOWAIT from userfaultfd-registered memory that has
+ * not been faulted in yet. Should return EFAULT (not block).
+ */
+TEST(read_nowait_userfaultfd)
+{
+	int uffd;
+	void *mem;
+	long page_size = sysconf(_SC_PAGESIZE);
+	uint8_t buf[sizeof(test_data)] = { 0 };
+	struct iovec local_iov = { .iov_base = buf, .iov_len = sizeof(buf) };
+	struct iovec remote_iov;
+	ssize_t n;
+
+	uffd = setup_userfaultfd();
+	if (uffd == -EPERM)
+		SKIP(return, "userfaultfd requires privileges (vm.unprivileged_userfaultfd=0)");
+	if (uffd == -ENOSYS)
+		SKIP(return, "userfaultfd not supported");
+	ASSERT_GE(uffd, 0);
+
+	mem = register_uffd_region(uffd, page_size);
+	ASSERT_NE(NULL, mem);
+
+	/* Ensure the page is not present */
+	madvise(mem, page_size, MADV_DONTNEED);
+
+	remote_iov.iov_base = mem;
+	remote_iov.iov_len = sizeof(test_data);
+	n = process_vm_readv(getpid(), &local_iov, 1, &remote_iov, 1,
+			     PROCESS_VM_NOWAIT);
+	ASSERT_EQ(-1, n);
+	ASSERT_EQ(EFAULT, errno);
+
+	munmap(mem, page_size);
+	close(uffd);
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index d8468451b3a3..7d30f6101088 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -91,6 +91,8 @@ separated by spaces:
 	test VMA merge cases behave as expected
 - rmap
 	test rmap behaves as expected
+- process_vm_readv
+	test process_vm_readv flags (pidfd, nowait)
 - memory-failure
 	test memory-failure behaves as expected
 
@@ -531,6 +533,8 @@ CATEGORY="page_frag" run_test ./test_page_frag.sh nonaligned
 
 CATEGORY="rmap" run_test ./rmap
 
+CATEGORY="process_vm_readv" run_test ./process_vm_readv
+
 # Try to load hwpoison_inject if not present.
 HWPOISON_DIR=/sys/kernel/debug/hwpoison/
 if [ ! -d "$HWPOISON_DIR" ]; then
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev
  2026-04-28 12:28 ` [PATCH v3 1/2] " Alban Crequy
@ 2026-04-28 20:05   ` David Hildenbrand (Arm)
  2026-04-29  6:41     ` Christian Brauner
                       ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-28 20:05 UTC (permalink / raw)
  To: Alban Crequy, Andrew Morton, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-kernel, linux-mm,
	Alban Crequy, Peter Xu, Willy Tarreau, linux-kselftest, shuah,
	Usama Arif, David Laight

On 4/28/26 14:28, Alban Crequy wrote:
> From: Alban Crequy <albancrequy@microsoft.com>

Hi,

some more smaller comments. Overall, LGTM.

> 
> There are two categories of users for process_vm_readv:
> 
> 1. Debuggers like GDB or strace.
> 
>    When a debugger attempts to read the target memory and triggers a
>    page fault, the page fault needs to be resolved so that the debugger
>    can accurately interpret the memory. A debugger is typically attached
>    to a single process.
> 
> 2. Profilers like OpenTelemetry eBPF Profiler.
> 
>    The profiler uses a perf event to get stack traces from all
>    processes at 20Hz (20 stack traces to resolve per second). For
>    interpreted languages (Ruby, Python, etc.), the profiler uses
>    process_vm_readv to get the correct symbols. In this case,
>    performance is the most important. It is fine if some stack traces
>    cannot be resolved as long as it is not statistically significant.
> 
> The current behaviour of process_vm_readv is to resolve page faults in
> the target VM. This is as desired for debuggers, but unwelcome for
> profilers because the page fault resolution could take a lot of time
> depending on the backing filesystem. Additionally, since profilers
> monitor all processes, we don't want a slow page fault resolution for
> one target process slowing down the monitoring for all other target
> processes.
> 
> This patch adds the flag PROCESS_VM_NOWAIT, so the caller can choose to
> not block on IO if the memory access causes a page fault.

What is the expected return value to user space if we run into this case?

And in the same context: Will you send a man page update? :)

> 
> Additionally, this patch adds the flag PROCESS_VM_PIDFD to refer to the
> remote process via PID file descriptor instead of PID. Such a file
> descriptor can be obtained with pidfd_open(2). This is useful to avoid
> the pid number being reused. It is unlikely to happen for debuggers
> because they can monitor the target process termination in other ways
> (ptrace), but can be helpful in some profiling scenarios.
> 
> If a given flag is unsupported, the syscall returns the error EINVAL
> without checking the buffers. This gives a way to userspace to detect
> whether the current kernel supports a specific flag:
> 
>   process_vm_readv(pid, NULL, 1, NULL, 1, PROCESS_VM_PIDFD)
>   -> EINVAL if the kernel does not support the flag PROCESS_VM_PIDFD
>      (before this patch)
>   -> EFAULT if the kernel supports the flag (after this patch)
> 
> Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
> ---
> v3:
> - Fix ERR_PTR handling for pidfd_get_task(): use IS_ERR()/PTR_ERR()
>   for the pidfd path, matching process_madvise() (Usama Arif, Sashiko)
> 
> v2:
> - Expand commit message with use-case motivation (David Hildenbrand)
> - Use unsigned long consistently for pvm_flags parameter (David Hildenbrand)
> - Add PROCESS_VM_SUPPORTED_FLAGS kernel-internal define (David Hildenbrand)
> - Keep (1UL << N) in UAPI header: BIT() is defined in vdso/bits.h
>   which is not exported to userspace, so UAPI headers using BIT() would
>   break when included from userspace programs (David Hildenbrand)
> 
>  MAINTAINERS                     |  1 +
>  include/uapi/linux/process_vm.h |  9 +++++++++
>  mm/process_vm_access.c          | 34 ++++++++++++++++++++++++---------
>  3 files changed, 35 insertions(+), 9 deletions(-)
>  create mode 100644 include/uapi/linux/process_vm.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2fb1c75afd16..0f6ce21d6235 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16786,6 +16786,7 @@ F:	include/linux/ptdump.h
>  F:	include/linux/vmpressure.h
>  F:	include/linux/vmstat.h
>  F:	fs/proc/meminfo.c
> +F:	include/uapi/linux/process_vm.h

We try to sort this alphabetically. Sometimes we failed. Likely this should just
go to one more line up.

>  F:	kernel/fork.c
>  F:	mm/Kconfig
>  F:	mm/debug.c
> diff --git a/include/uapi/linux/process_vm.h b/include/uapi/linux/process_vm.h
> new file mode 100644
> index 000000000000..4168e09f3f4e
> --- /dev/null
> +++ b/include/uapi/linux/process_vm.h

Thinking out loud: the c file is called "process_vm_access.c", should we name
the header like that as well?

> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_PROCESS_VM_H
> +#define _UAPI_LINUX_PROCESS_VM_H
> +
> +/* Flags for process_vm_readv/process_vm_writev */
> +#define PROCESS_VM_PIDFD        (1UL << 0)
> +#define PROCESS_VM_NOWAIT       (1UL << 1)

Should we use BIT here? I see some usage in other uapi headers (e.g., tcp.h)

> +
> +#endif /* _UAPI_LINUX_PROCESS_VM_H */
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 656d3e88755b..dacef50be0be 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -14,6 +14,9 @@
>  #include <linux/ptrace.h>
>  #include <linux/slab.h>
>  #include <linux/syscalls.h>
> +#include <linux/process_vm.h>
> +
> +#define PROCESS_VM_SUPPORTED_FLAGS (PROCESS_VM_PIDFD | PROCESS_VM_NOWAIT)
>  
>  /**
>   * process_vm_rw_pages - read/write pages from task specified
> @@ -68,6 +71,7 @@ static int process_vm_rw_pages(struct page **pages,
>   * @mm: mm for task
>   * @task: task to read/write from
>   * @vm_write: 0 means copy from, 1 means copy to
> + * @pvm_flags: PROCESS_VM_* flags
>   * Returns 0 on success or on failure error code
>   */
>  static int process_vm_rw_single_vec(unsigned long addr,
> @@ -76,7 +80,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
>  				    struct page **process_pages,
>  				    struct mm_struct *mm,
>  				    struct task_struct *task,
> -				    int vm_write)
> +				    int vm_write,
> +				    unsigned long pvm_flags)
>  {
>  	unsigned long pa = addr & PAGE_MASK;
>  	unsigned long start_offset = addr - pa;
> @@ -91,6 +96,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
>  
>  	if (vm_write)
>  		flags |= FOLL_WRITE;
> +	if (pvm_flags & PROCESS_VM_NOWAIT)
> +		flags |= FOLL_NOWAIT;
>  
>  	while (!rc && nr_pages && iov_iter_count(iter)) {
>  		int pinned_pages = min_t(unsigned long, nr_pages, PVM_MAX_USER_PAGES);
> @@ -141,7 +148,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
>   * @iter: where to copy to/from locally
>   * @rvec: iovec array specifying where to copy to/from in the other process
>   * @riovcnt: size of rvec array
> - * @flags: currently unused
> + * @flags: process_vm_readv/writev flags
>   * @vm_write: 0 if reading from other process, 1 if writing to other process
>   *
>   * Returns the number of bytes read/written or error code. May
> @@ -163,6 +170,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
>  	unsigned long nr_pages_iov;
>  	ssize_t iov_len;
>  	size_t total_len = iov_iter_count(iter);
> +	unsigned int f_flags;
>  
>  	/*
>  	 * Work out how many pages of struct pages we're going to need
> @@ -194,10 +202,18 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
>  	}
>  
>  	/* Get process information */
> -	task = find_get_task_by_vpid(pid);
> -	if (!task) {
> -		rc = -ESRCH;
> -		goto free_proc_pages;
> +	if (flags & PROCESS_VM_PIDFD) {
> +		task = pidfd_get_task(pid, &f_flags);
> +		if (IS_ERR(task)) {
> +			rc = PTR_ERR(task);

This could return -EBADF or -ESRCH. We should document both in the man page. (or
decide to always return -ESRCH, dunno)



-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev
  2026-04-28 20:05   ` David Hildenbrand (Arm)
@ 2026-04-29  6:41     ` Christian Brauner
  2026-04-29  6:41     ` Mike Rapoport
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2026-04-29  6:41 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Alban Crequy, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-kernel, linux-mm, Alban Crequy, Peter Xu, Willy Tarreau,
	linux-kselftest, shuah, Usama Arif, David Laight

On Tue, Apr 28, 2026 at 10:05:49PM +0200, David Hildenbrand (Arm) wrote:
> On 4/28/26 14:28, Alban Crequy wrote:
> > From: Alban Crequy <albancrequy@microsoft.com>
> 
> Hi,
> 
> some more smaller comments. Overall, LGTM.
> 
> > 
> > There are two categories of users for process_vm_readv:
> > 
> > 1. Debuggers like GDB or strace.
> > 
> >    When a debugger attempts to read the target memory and triggers a
> >    page fault, the page fault needs to be resolved so that the debugger
> >    can accurately interpret the memory. A debugger is typically attached
> >    to a single process.
> > 
> > 2. Profilers like OpenTelemetry eBPF Profiler.
> > 
> >    The profiler uses a perf event to get stack traces from all
> >    processes at 20Hz (20 stack traces to resolve per second). For
> >    interpreted languages (Ruby, Python, etc.), the profiler uses
> >    process_vm_readv to get the correct symbols. In this case,
> >    performance is the most important. It is fine if some stack traces
> >    cannot be resolved as long as it is not statistically significant.
> > 
> > The current behaviour of process_vm_readv is to resolve page faults in
> > the target VM. This is as desired for debuggers, but unwelcome for
> > profilers because the page fault resolution could take a lot of time
> > depending on the backing filesystem. Additionally, since profilers
> > monitor all processes, we don't want a slow page fault resolution for
> > one target process slowing down the monitoring for all other target
> > processes.
> > 
> > This patch adds the flag PROCESS_VM_NOWAIT, so the caller can choose to
> > not block on IO if the memory access causes a page fault.
> 
> What is the expected return value to user space if we run into this case?
> 
> And in the same context: Will you send a man page update? :)
> 
> > 
> > Additionally, this patch adds the flag PROCESS_VM_PIDFD to refer to the
> > remote process via PID file descriptor instead of PID. Such a file
> > descriptor can be obtained with pidfd_open(2). This is useful to avoid
> > the pid number being reused. It is unlikely to happen for debuggers
> > because they can monitor the target process termination in other ways
> > (ptrace), but can be helpful in some profiling scenarios.
> > 
> > If a given flag is unsupported, the syscall returns the error EINVAL
> > without checking the buffers. This gives a way to userspace to detect
> > whether the current kernel supports a specific flag:
> > 
> >   process_vm_readv(pid, NULL, 1, NULL, 1, PROCESS_VM_PIDFD)
> >   -> EINVAL if the kernel does not support the flag PROCESS_VM_PIDFD
> >      (before this patch)
> >   -> EFAULT if the kernel supports the flag (after this patch)
> > 
> > Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
> > ---
> > v3:
> > - Fix ERR_PTR handling for pidfd_get_task(): use IS_ERR()/PTR_ERR()
> >   for the pidfd path, matching process_madvise() (Usama Arif, Sashiko)
> > 
> > v2:
> > - Expand commit message with use-case motivation (David Hildenbrand)
> > - Use unsigned long consistently for pvm_flags parameter (David Hildenbrand)
> > - Add PROCESS_VM_SUPPORTED_FLAGS kernel-internal define (David Hildenbrand)
> > - Keep (1UL << N) in UAPI header: BIT() is defined in vdso/bits.h
> >   which is not exported to userspace, so UAPI headers using BIT() would
> >   break when included from userspace programs (David Hildenbrand)
> > 
> >  MAINTAINERS                     |  1 +
> >  include/uapi/linux/process_vm.h |  9 +++++++++
> >  mm/process_vm_access.c          | 34 ++++++++++++++++++++++++---------
> >  3 files changed, 35 insertions(+), 9 deletions(-)
> >  create mode 100644 include/uapi/linux/process_vm.h
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 2fb1c75afd16..0f6ce21d6235 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -16786,6 +16786,7 @@ F:	include/linux/ptdump.h
> >  F:	include/linux/vmpressure.h
> >  F:	include/linux/vmstat.h
> >  F:	fs/proc/meminfo.c
> > +F:	include/uapi/linux/process_vm.h
> 
> We try to sort this alphabetically. Sometimes we failed. Likely this should just
> go to one more line up.
> 
> >  F:	kernel/fork.c
> >  F:	mm/Kconfig
> >  F:	mm/debug.c
> > diff --git a/include/uapi/linux/process_vm.h b/include/uapi/linux/process_vm.h
> > new file mode 100644
> > index 000000000000..4168e09f3f4e
> > --- /dev/null
> > +++ b/include/uapi/linux/process_vm.h
> 
> Thinking out loud: the c file is called "process_vm_access.c", should we name
> the header like that as well?
> 
> > @@ -0,0 +1,9 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_PROCESS_VM_H
> > +#define _UAPI_LINUX_PROCESS_VM_H
> > +
> > +/* Flags for process_vm_readv/process_vm_writev */
> > +#define PROCESS_VM_PIDFD        (1UL << 0)
> > +#define PROCESS_VM_NOWAIT       (1UL << 1)
> 
> Should we use BIT here? I see some usage in other uapi headers (e.g., tcp.h)
> 
> > +
> > +#endif /* _UAPI_LINUX_PROCESS_VM_H */
> > diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> > index 656d3e88755b..dacef50be0be 100644
> > --- a/mm/process_vm_access.c
> > +++ b/mm/process_vm_access.c
> > @@ -14,6 +14,9 @@
> >  #include <linux/ptrace.h>
> >  #include <linux/slab.h>
> >  #include <linux/syscalls.h>
> > +#include <linux/process_vm.h>
> > +
> > +#define PROCESS_VM_SUPPORTED_FLAGS (PROCESS_VM_PIDFD | PROCESS_VM_NOWAIT)
> >  
> >  /**
> >   * process_vm_rw_pages - read/write pages from task specified
> > @@ -68,6 +71,7 @@ static int process_vm_rw_pages(struct page **pages,
> >   * @mm: mm for task
> >   * @task: task to read/write from
> >   * @vm_write: 0 means copy from, 1 means copy to
> > + * @pvm_flags: PROCESS_VM_* flags
> >   * Returns 0 on success or on failure error code
> >   */
> >  static int process_vm_rw_single_vec(unsigned long addr,
> > @@ -76,7 +80,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
> >  				    struct page **process_pages,
> >  				    struct mm_struct *mm,
> >  				    struct task_struct *task,
> > -				    int vm_write)
> > +				    int vm_write,
> > +				    unsigned long pvm_flags)
> >  {
> >  	unsigned long pa = addr & PAGE_MASK;
> >  	unsigned long start_offset = addr - pa;
> > @@ -91,6 +96,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
> >  
> >  	if (vm_write)
> >  		flags |= FOLL_WRITE;
> > +	if (pvm_flags & PROCESS_VM_NOWAIT)
> > +		flags |= FOLL_NOWAIT;
> >  
> >  	while (!rc && nr_pages && iov_iter_count(iter)) {
> >  		int pinned_pages = min_t(unsigned long, nr_pages, PVM_MAX_USER_PAGES);
> > @@ -141,7 +148,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
> >   * @iter: where to copy to/from locally
> >   * @rvec: iovec array specifying where to copy to/from in the other process
> >   * @riovcnt: size of rvec array
> > - * @flags: currently unused
> > + * @flags: process_vm_readv/writev flags
> >   * @vm_write: 0 if reading from other process, 1 if writing to other process
> >   *
> >   * Returns the number of bytes read/written or error code. May
> > @@ -163,6 +170,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
> >  	unsigned long nr_pages_iov;
> >  	ssize_t iov_len;
> >  	size_t total_len = iov_iter_count(iter);
> > +	unsigned int f_flags;
> >  
> >  	/*
> >  	 * Work out how many pages of struct pages we're going to need
> > @@ -194,10 +202,18 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
> >  	}
> >  
> >  	/* Get process information */
> > -	task = find_get_task_by_vpid(pid);
> > -	if (!task) {
> > -		rc = -ESRCH;
> > -		goto free_proc_pages;
> > +	if (flags & PROCESS_VM_PIDFD) {
> > +		task = pidfd_get_task(pid, &f_flags);
> > +		if (IS_ERR(task)) {
> > +			rc = PTR_ERR(task);
> 
> This could return -EBADF or -ESRCH. We should document both in the man page. (or
> decide to always return -ESRCH, dunno)

No, please don't. Let's not start overwriting errnos that are actually
useful information for userspace.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev
  2026-04-28 20:05   ` David Hildenbrand (Arm)
  2026-04-29  6:41     ` Christian Brauner
@ 2026-04-29  6:41     ` Mike Rapoport
  2026-05-14  9:15     ` Alban Crequy
  2026-05-14 14:34     ` Alban Crequy
  3 siblings, 0 replies; 8+ messages in thread
From: Mike Rapoport @ 2026-04-29  6:41 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Alban Crequy, Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, linux-kernel, linux-mm, Alban Crequy, Peter Xu,
	Willy Tarreau, linux-kselftest, shuah, Usama Arif, David Laight

On Tue, Apr 28, 2026 at 10:05:49PM +0200, David Hildenbrand (Arm) wrote:
> On 4/28/26 14:28, Alban Crequy wrote:
> > From: Alban Crequy <albancrequy@microsoft.com>
> > @@ -194,10 +202,18 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
> >  	}
> >  
> >  	/* Get process information */
> > -	task = find_get_task_by_vpid(pid);
> > -	if (!task) {
> > -		rc = -ESRCH;
> > -		goto free_proc_pages;
> > +	if (flags & PROCESS_VM_PIDFD) {
> > +		task = pidfd_get_task(pid, &f_flags);
> > +		if (IS_ERR(task)) {
> > +			rc = PTR_ERR(task);
> 
> This could return -EBADF or -ESRCH. We should document both in the man page. (or
> decide to always return -ESRCH, dunno)

I'm for documenting both in the man page to let userpsace see what went
wrong.

> -- 
> Cheers,
> David

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev
  2026-04-28 20:05   ` David Hildenbrand (Arm)
  2026-04-29  6:41     ` Christian Brauner
  2026-04-29  6:41     ` Mike Rapoport
@ 2026-05-14  9:15     ` Alban Crequy
  2026-05-14 14:34     ` Alban Crequy
  3 siblings, 0 replies; 8+ messages in thread
From: Alban Crequy @ 2026-05-14  9:15 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, Christian Brauner, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-kernel, linux-mm,
	Alban Crequy, Peter Xu, Willy Tarreau, linux-kselftest, shuah,
	Usama Arif, David Laight

[-- Attachment #1: Type: text/plain, Size: 10848 bytes --]

Hi David,

Thanks for the review. I'm addressing all your comments in v4.

On Tue, 28 Apr 2026 at 22:06, David Hildenbrand (Arm) <david@kernel.org>
wrote:
>
> On 4/28/26 14:28, Alban Crequy wrote:
> > From: Alban Crequy <albancrequy@microsoft.com>
>
> Hi,
>
> some more smaller comments. Overall, LGTM.
>
> >
> > There are two categories of users for process_vm_readv:
> >
> > 1. Debuggers like GDB or strace.
> >
> >    When a debugger attempts to read the target memory and triggers a
> >    page fault, the page fault needs to be resolved so that the debugger
> >    can accurately interpret the memory. A debugger is typically attached
> >    to a single process.
> >
> > 2. Profilers like OpenTelemetry eBPF Profiler.
> >
> >    The profiler uses a perf event to get stack traces from all
> >    processes at 20Hz (20 stack traces to resolve per second). For
> >    interpreted languages (Ruby, Python, etc.), the profiler uses
> >    process_vm_readv to get the correct symbols. In this case,
> >    performance is the most important. It is fine if some stack traces
> >    cannot be resolved as long as it is not statistically significant.
> >
> > The current behaviour of process_vm_readv is to resolve page faults in
> > the target VM. This is as desired for debuggers, but unwelcome for
> > profilers because the page fault resolution could take a lot of time
> > depending on the backing filesystem. Additionally, since profilers
> > monitor all processes, we don't want a slow page fault resolution for
> > one target process slowing down the monitoring for all other target
> > processes.
> >
> > This patch adds the flag PROCESS_VM_NOWAIT, so the caller can choose to
> > not block on IO if the memory access causes a page fault.
>
> What is the expected return value to user space if we run into this case?

If the first page is not resident and PROCESS_VM_NOWAIT is set, it returns
-1
with errno EFAULT.

It can also return a short read, meaning it returns the number of bytes
successfully transferred from pages prior to the fault.

This is the same partial-read behavior as a regular
process_vm_readv when hitting an unmapped page — NOWAIT just changes
when the fault fails (immediately instead of blocking).

I've documented this in the updated commit message in v4 (I will send it
shortly). I also added two selftests that verify partial reads across a
resident and a non-resident page (with both single iovec and two iovecs).

Interestingly, while writing these tests, I found that the current
process_vm_readv(2) man page incorrectly claims that partial transfers
apply at iovec-element granularity. In practice, the kernel returns
partial reads at page granularity within a single remote iovec element.
I've sent a separate fix for that to linux-man@:
https://lore.kernel.org/linux-man/20260514083659.139971-1-alban.crequy@gmail.com/T/#u

> And in the same context: Will you send a man page update? :)

I've included a suggested man page update in the commit message of v4 patch
1/2, ready for the man-pages maintainer to pick up. I'll also send a
separate
patch to linux-man@ once the kernel patches are merged and available in a
new
Linux release.

> > Additionally, this patch adds the flag PROCESS_VM_PIDFD to refer to the
> > remote process via PID file descriptor instead of PID. Such a file
> > descriptor can be obtained with pidfd_open(2). This is useful to avoid
> > the pid number being reused. It is unlikely to happen for debuggers
> > because they can monitor the target process termination in other ways
> > (ptrace), but can be helpful in some profiling scenarios.
> >
> > If a given flag is unsupported, the syscall returns the error EINVAL
> > without checking the buffers. This gives a way to userspace to detect
> > whether the current kernel supports a specific flag:
> >
> >   process_vm_readv(pid, NULL, 1, NULL, 1, PROCESS_VM_PIDFD)
> >   -> EINVAL if the kernel does not support the flag PROCESS_VM_PIDFD
> >      (before this patch)
> >   -> EFAULT if the kernel supports the flag (after this patch)
> >
> > Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
> > ---
> > v3:
> > - Fix ERR_PTR handling for pidfd_get_task(): use IS_ERR()/PTR_ERR()
> >   for the pidfd path, matching process_madvise() (Usama Arif, Sashiko)
> >
> > v2:
> > - Expand commit message with use-case motivation (David Hildenbrand)
> > - Use unsigned long consistently for pvm_flags parameter (David
Hildenbrand)
> > - Add PROCESS_VM_SUPPORTED_FLAGS kernel-internal define (David
Hildenbrand)
> > - Keep (1UL << N) in UAPI header: BIT() is defined in vdso/bits.h
> >   which is not exported to userspace, so UAPI headers using BIT() would
> >   break when included from userspace programs (David Hildenbrand)
> >
> >  MAINTAINERS                     |  1 +
> >  include/uapi/linux/process_vm.h |  9 +++++++++
> >  mm/process_vm_access.c          | 34 ++++++++++++++++++++++++---------
> >  3 files changed, 35 insertions(+), 9 deletions(-)
> >  create mode 100644 include/uapi/linux/process_vm.h
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 2fb1c75afd16..0f6ce21d6235 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -16786,6 +16786,7 @@ F:    include/linux/ptdump.h
> >  F:   include/linux/vmpressure.h
> >  F:   include/linux/vmstat.h
> >  F:   fs/proc/meminfo.c
> > +F:   include/uapi/linux/process_vm.h
>
> We try to sort this alphabetically. Sometimes we failed. Likely this
should just
> go to one more line up.

Fixed in v4 — moved the include/uapi line to the correct position.

> >  F:   kernel/fork.c
> >  F:   mm/Kconfig
> >  F:   mm/debug.c
> > diff --git a/include/uapi/linux/process_vm.h
b/include/uapi/linux/process_vm.h
> > new file mode 100644
> > index 000000000000..4168e09f3f4e
> > --- /dev/null
> > +++ b/include/uapi/linux/process_vm.h
>
> Thinking out loud: the c file is called "process_vm_access.c", should we
name
> the header like that as well?

Good idea — renamed to include/uapi/linux/process_vm_access.h in v4.

> > @@ -0,0 +1,9 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_PROCESS_VM_H
> > +#define _UAPI_LINUX_PROCESS_VM_H
> > +
> > +/* Flags for process_vm_readv/process_vm_writev */
> > +#define PROCESS_VM_PIDFD        (1UL << 0)
> > +#define PROCESS_VM_NOWAIT       (1UL << 1)
>
> Should we use BIT here? I see some usage in other uapi headers (e.g.,
tcp.h)

As noted in the v2/v3 changelog: BIT() is defined in vdso/bits.h which is
not
exported to userspace. UAPI headers using BIT() (like tcp.h) work in-kernel
but
break for userspace programs that include these headers directly. In
practice,
BIT() in tcp.h is only used by TCP_ACCECN_* flags, so merely including
tcp.h in
userspace programs won't break them unless they use those flags. Those flags
were moved from kernel-only headers to UAPI headers in commit 4fa4ac5e5848
("tcp: accecn: add tcpi_ecn_mode and tcpi_option2 in tcp_info"), which
appeared
in Linux v7.0-rc1 (not yet in a release). So userspace programs don't yet
use
those macros.

So I think (1UL << N) is the correct choice for UAPI.

> > +
> > +#endif /* _UAPI_LINUX_PROCESS_VM_H */
> > diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> > index 656d3e88755b..dacef50be0be 100644
> > --- a/mm/process_vm_access.c
> > +++ b/mm/process_vm_access.c
> > @@ -14,6 +14,9 @@
> >  #include <linux/ptrace.h>
> >  #include <linux/slab.h>
> >  #include <linux/syscalls.h>
> > +#include <linux/process_vm.h>
> > +
> > +#define PROCESS_VM_SUPPORTED_FLAGS (PROCESS_VM_PIDFD |
PROCESS_VM_NOWAIT)
> >
> >  /**
> >   * process_vm_rw_pages - read/write pages from task specified
> > @@ -68,6 +71,7 @@ static int process_vm_rw_pages(struct page **pages,
> >   * @mm: mm for task
> >   * @task: task to read/write from
> >   * @vm_write: 0 means copy from, 1 means copy to
> > + * @pvm_flags: PROCESS_VM_* flags
> >   * Returns 0 on success or on failure error code
> >   */
> >  static int process_vm_rw_single_vec(unsigned long addr,
> > @@ -76,7 +80,8 @@ static int process_vm_rw_single_vec(unsigned long
addr,
> >                                   struct page **process_pages,
> >                                   struct mm_struct *mm,
> >                                   struct task_struct *task,
> > -                                 int vm_write)
> > +                                 int vm_write,
> > +                                 unsigned long pvm_flags)
> >  {
> >       unsigned long pa = addr & PAGE_MASK;
> >       unsigned long start_offset = addr - pa;
> > @@ -91,6 +96,8 @@ static int process_vm_rw_single_vec(unsigned long
addr,
> >
> >       if (vm_write)
> >               flags |= FOLL_WRITE;
> > +     if (pvm_flags & PROCESS_VM_NOWAIT)
> > +             flags |= FOLL_NOWAIT;
> >
> >       while (!rc && nr_pages && iov_iter_count(iter)) {
> >               int pinned_pages = min_t(unsigned long, nr_pages,
PVM_MAX_USER_PAGES);
> > @@ -141,7 +148,7 @@ static int process_vm_rw_single_vec(unsigned long
addr,
> >   * @iter: where to copy to/from locally
> >   * @rvec: iovec array specifying where to copy to/from in the other
process
> >   * @riovcnt: size of rvec array
> > - * @flags: currently unused
> > + * @flags: process_vm_readv/writev flags
> >   * @vm_write: 0 if reading from other process, 1 if writing to other
process
> >   *
> >   * Returns the number of bytes read/written or error code. May
> > @@ -163,6 +170,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct
iov_iter *iter,
> >       unsigned long nr_pages_iov;
> >       ssize_t iov_len;
> >       size_t total_len = iov_iter_count(iter);
> > +     unsigned int f_flags;
> >
> >       /*
> >        * Work out how many pages of struct pages we're going to need
> > @@ -194,10 +202,18 @@ static ssize_t process_vm_rw_core(pid_t pid,
struct iov_iter *iter,
> >       }
> >
> >       /* Get process information */
> > -     task = find_get_task_by_vpid(pid);
> > -     if (!task) {
> > -             rc = -ESRCH;
> > -             goto free_proc_pages;
> > +     if (flags & PROCESS_VM_PIDFD) {
> > +             task = pidfd_get_task(pid, &f_flags);
> > +             if (IS_ERR(task)) {
> > +                     rc = PTR_ERR(task);
>
> This could return -EBADF or -ESRCH. We should document both in the man
page. (or
> decide to always return -ESRCH, dunno)

Agreed with Christian's reply — keeping the actual errno from
pidfd_get_task() is useful information for userspace. Both EBADF
and ESRCH are documented in the suggested man page update text in v4.

Thanks,
Alban

>
>
>
> --
> Cheers,
>
> David

[-- Attachment #2: Type: text/html, Size: 13089 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev
  2026-04-28 20:05   ` David Hildenbrand (Arm)
                       ` (2 preceding siblings ...)
  2026-05-14  9:15     ` Alban Crequy
@ 2026-05-14 14:34     ` Alban Crequy
  3 siblings, 0 replies; 8+ messages in thread
From: Alban Crequy @ 2026-05-14 14:34 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-kernel, linux-mm,
	Peter Xu, Willy Tarreau, linux-kselftest, shuah, Usama Arif,
	David Laight

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3373 bytes --]

Hi David,

(Apologies for the earlier HTML email — resending in plain text.)

Thanks for the review. I'm addressing all your comments in v4.

> What is the expected return value to user space if we run into this case?

If the first page is not resident and PROCESS_VM_NOWAIT is set, it returns -1
with errno EFAULT.

Note that Sashiko suggested to return errno EAGAIN to distinguish
invalid addresses from unpaged memory. I haven't done that to avoid
adding more code tweaking errnos; let me know if you would like me to do
that.
https://sashiko.dev/#/patchset/20260428122826.339550-1-alban.crequy%40gmail.com

It can also return a short read, meaning it returns the number of bytes
successfully transferred from pages prior to the fault.

This is the same partial-read behavior as a regular
process_vm_readv when hitting an unmapped page — NOWAIT just changes
when the fault fails (immediately instead of blocking).

I've documented this in the updated commit message in v4 (I will send it
shortly). I also added two selftests that verify partial reads across a
resident and a non-resident page (with both single iovec and two iovecs).

Interestingly, while writing these tests, I found that the current
process_vm_readv(2) man page incorrectly claims that partial transfers
apply at iovec-element granularity. In practice, the kernel returns
partial reads at page granularity within a single remote iovec element.
I've sent a separate fix for that to linux-man@:
https://lore.kernel.org/linux-man/20260514083659.139971-1-alban.crequy@gmail.com/T/#u

> And in the same context: Will you send a man page update? :)

I've included a suggested man page update in the commit message of v4 patch
1/2, ready for the man-pages maintainer to pick up. I'll also send a separate
patch to linux-man@ once the kernel patches are merged and available in a new
Linux release.

> We try to sort this alphabetically.

Fixed in v4 — moved the include/uapi line to the correct position.

> Thinking out loud: the c file is called "process_vm_access.c", should
> we name the header like that as well?

Good idea — renamed to include/uapi/linux/process_vm_access.h in v4.

> Should we use BIT here?

As noted in the v2/v3 changelog: BIT() is defined in vdso/bits.h which is not
exported to userspace. UAPI headers using BIT() (like tcp.h) work in-kernel but
break for userspace programs that include these headers directly. In practice,
BIT() in tcp.h is only used by TCP_ACCECN_* flags, so merely including tcp.h in
userspace programs won't break them unless they use those flags. Those flags
were moved from kernel-only headers to UAPI headers in commit 4fa4ac5e5848
("tcp: accecn: add tcpi_ecn_mode and tcpi_option2 in tcp_info"), which appeared
in Linux v7.0-rc1 (not yet in a release). So userspace programs don't yet use
those macros.

So I think (1UL << N) is the correct choice for UAPI.

> This could return -EBADF or -ESRCH. We should document both in the
> man page. (or decide to always return -ESRCH, dunno)

Agreed with Christian's reply — keeping the actual errno from
pidfd_get_task() is useful information for userspace. Both EBADF
and ESRCH are documented in the suggested man page update text in v4.

Thanks,
Alban

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-05-14 14:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 12:28 [PATCH v3 0/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev Alban Crequy
2026-04-28 12:28 ` [PATCH v3 1/2] " Alban Crequy
2026-04-28 20:05   ` David Hildenbrand (Arm)
2026-04-29  6:41     ` Christian Brauner
2026-04-29  6:41     ` Mike Rapoport
2026-05-14  9:15     ` Alban Crequy
2026-05-14 14:34     ` Alban Crequy
2026-04-28 12:28 ` [PATCH v3 2/2] selftests/mm: add tests for process_vm_readv flags Alban Crequy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.