From: Usama Arif <usama.arif@linux.dev>
To: Alban Crequy <alban.crequy@gmail.com>
Cc: Usama Arif <usama.arif@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Christian Brauner <brauner@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Alban Crequy <albancrequy@microsoft.com>,
Peter Xu <peterx@redhat.com>, Willy Tarreau <w@1wt.eu>,
linux-kselftest@vger.kernel.org, shuah@kernel.org
Subject: Re: [PATCH v2 1/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev
Date: Thu, 9 Apr 2026 07:22:55 -0700 [thread overview]
Message-ID: <20260409142256.131676-1-usama.arif@linux.dev> (raw)
In-Reply-To: <20260408145436.843538-2-alban.crequy@gmail.com>
On Wed, 8 Apr 2026 16:54:35 +0200 Alban Crequy <alban.crequy@gmail.com> wrote:
> From: Alban Crequy <albancrequy@microsoft.com>
>
> There are two categories of users for process_vm_readv:
>
> 1. Debuggers like GDB or strace.
>
> When a debugger attempts to read the target memory and triggers a
> page fault, the page fault needs to be resolved so that the debugger
> can accurately interpret the memory. A debugger is typically attached
> to a single process.
>
> 2. Profilers like OpenTelemetry eBPF Profiler.
>
> The profiler uses a perf event to get stack traces from all
> processes at 20Hz (20 stack traces to resolve per second). For
> interpreted languages (Ruby, Python, etc.), the profiler uses
> process_vm_readv to get the correct symbols. In this case,
> performance is the most important. It is fine if some stack traces
> cannot be resolved as long as it is not statistically significant.
>
> The current behaviour of process_vm_readv is to resolve page faults in
> the target VM. This is as desired for debuggers, but unwelcome for
> profilers because the page fault resolution could take a lot of time
> depending on the backing filesystem. Additionally, since profilers
> monitor all processes, we don't want a slow page fault resolution for
> one target process slowing down the monitoring for all other target
> processes.
>
> This patch adds the flag PROCESS_VM_NOWAIT, so the caller can choose to
> not block on IO if the memory access causes a page fault.
>
> Additionally, this patch adds the flag PROCESS_VM_PIDFD to refer to the
> remote process via PID file descriptor instead of PID. Such a file
> descriptor can be obtained with pidfd_open(2). This is useful to avoid
> the pid number being reused. It is unlikely to happen for debuggers
> because they can monitor the target process termination in other ways
> (ptrace), but can be helpful in some profiling scenarios.
>
> If a given flag is unsupported, the syscall returns the error EINVAL
> without checking the buffers. This gives a way to userspace to detect
> whether the current kernel supports a specific flag:
>
> process_vm_readv(pid, NULL, 1, NULL, 1, PROCESS_VM_PIDFD)
> -> EINVAL if the kernel does not support the flag PROCESS_VM_PIDFD
> (before this patch)
> -> EFAULT if the kernel supports the flag (after this patch)
>
> Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
> ---
> v2:
> - Expand commit message with use-case motivation (David Hildenbrand)
> - Use unsigned long consistently for pvm_flags parameter (David Hildenbrand)
> - Add PROCESS_VM_SUPPORTED_FLAGS kernel-internal define (David Hildenbrand)
> - Keep (1UL << N) in UAPI header: BIT() is defined in vdso/bits.h
> which is not exported to userspace, so UAPI headers using BIT() would
> break when included from userspace programs (David Hildenbrand)
>
> MAINTAINERS | 1 +
> include/uapi/linux/process_vm.h | 9 +++++++++
> mm/process_vm_access.c | 24 ++++++++++++++++++------
> 3 files changed, 28 insertions(+), 6 deletions(-)
> create mode 100644 include/uapi/linux/process_vm.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c3fe46d7c4bc..f7168c5d7acc 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16678,6 +16678,7 @@ F: include/linux/pgtable.h
> F: include/linux/ptdump.h
> F: include/linux/vmpressure.h
> F: include/linux/vmstat.h
> +F: include/uapi/linux/process_vm.h
> F: kernel/fork.c
> F: mm/Kconfig
> F: mm/debug.c
> diff --git a/include/uapi/linux/process_vm.h b/include/uapi/linux/process_vm.h
> new file mode 100644
> index 000000000000..4168e09f3f4e
> --- /dev/null
> +++ b/include/uapi/linux/process_vm.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_PROCESS_VM_H
> +#define _UAPI_LINUX_PROCESS_VM_H
> +
> +/* Flags for process_vm_readv/process_vm_writev */
> +#define PROCESS_VM_PIDFD (1UL << 0)
> +#define PROCESS_VM_NOWAIT (1UL << 1)
> +
> +#endif /* _UAPI_LINUX_PROCESS_VM_H */
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 656d3e88755b..c6a25e9993e1 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -14,6 +14,9 @@
> #include <linux/ptrace.h>
> #include <linux/slab.h>
> #include <linux/syscalls.h>
> +#include <linux/process_vm.h>
> +
> +#define PROCESS_VM_SUPPORTED_FLAGS (PROCESS_VM_PIDFD | PROCESS_VM_NOWAIT)
>
> /**
> * process_vm_rw_pages - read/write pages from task specified
> @@ -68,6 +71,7 @@ static int process_vm_rw_pages(struct page **pages,
> * @mm: mm for task
> * @task: task to read/write from
> * @vm_write: 0 means copy from, 1 means copy to
> + * @pvm_flags: PROCESS_VM_* flags
> * Returns 0 on success or on failure error code
> */
> static int process_vm_rw_single_vec(unsigned long addr,
> @@ -76,7 +80,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
> struct page **process_pages,
> struct mm_struct *mm,
> struct task_struct *task,
> - int vm_write)
> + int vm_write,
> + unsigned long pvm_flags)
> {
> unsigned long pa = addr & PAGE_MASK;
> unsigned long start_offset = addr - pa;
> @@ -91,6 +96,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
>
> if (vm_write)
> flags |= FOLL_WRITE;
> + if (pvm_flags & PROCESS_VM_NOWAIT)
> + flags |= FOLL_NOWAIT;
>
> while (!rc && nr_pages && iov_iter_count(iter)) {
> int pinned_pages = min_t(unsigned long, nr_pages, PVM_MAX_USER_PAGES);
> @@ -141,7 +148,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
> * @iter: where to copy to/from locally
> * @rvec: iovec array specifying where to copy to/from in the other process
> * @riovcnt: size of rvec array
> - * @flags: currently unused
> + * @flags: process_vm_readv/writev flags
> * @vm_write: 0 if reading from other process, 1 if writing to other process
> *
> * Returns the number of bytes read/written or error code. May
> @@ -163,6 +170,7 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
> unsigned long nr_pages_iov;
> ssize_t iov_len;
> size_t total_len = iov_iter_count(iter);
> + unsigned int f_flags;
>
> /*
> * Work out how many pages of struct pages we're going to need
> @@ -194,7 +202,11 @@ static ssize_t process_vm_rw_core(pid_t pid, struct iov_iter *iter,
> }
>
> /* Get process information */
> - task = find_get_task_by_vpid(pid);
> + if (flags & PROCESS_VM_PIDFD)
> + task = pidfd_get_task(pid, &f_flags);
> + else
> + task = find_get_task_by_vpid(pid);
> +
> if (!task) {
> rc = -ESRCH;
> goto free_proc_pages;
pidfd_get_task() returns ERR_PTR() on failure (e.g. ERR_PTR(-EBADF)),
but the code checks "if (!task)" which only catches NULL. An invalid
pidfd will cause mm_access() and put_task_struct() to dereference an
error pointer, crashing the kernel.
next prev parent reply other threads:[~2026-04-09 14:23 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 14:54 [PATCH v2 0/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev Alban Crequy
2026-04-08 14:54 ` [PATCH v2 1/2] " Alban Crequy
2026-04-09 14:22 ` Usama Arif [this message]
2026-04-23 12:52 ` David Hildenbrand (Arm)
2026-04-23 14:04 ` David Laight
2026-04-23 14:17 ` David Hildenbrand (Arm)
2026-04-08 14:54 ` [PATCH v2 2/2] selftests/mm: add tests for process_vm_readv flags Alban Crequy
2026-04-09 12:38 ` [PATCH v2 0/2] mm/process_vm_access: pidfd and nowait support for process_vm_readv/writev Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260409142256.131676-1-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alban.crequy@gmail.com \
--cc=albancrequy@microsoft.com \
--cc=brauner@kernel.org \
--cc=david@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=peterx@redhat.com \
--cc=rppt@kernel.org \
--cc=shuah@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=w@1wt.eu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.