All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg KH <gregkh@linuxfoundation.org>
To: Andrii Nakryiko <andrii@kernel.org>
Cc: linux-fsdevel@vger.kernel.org, brauner@kernel.org,
	viro@zeniv.linux.org.uk, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
Date: Sat, 4 May 2024 17:28:36 +0200	[thread overview]
Message-ID: <2024050439-janitor-scoff-be04@gregkh> (raw)
In-Reply-To: <20240504003006.3303334-3-andrii@kernel.org>

On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> /proc/<pid>/maps file is extremely useful in practice for various tasks
> involving figuring out process memory layout, what files are backing any
> given memory range, etc. One important class of applications that
> absolutely rely on this are profilers/stack symbolizers. They would
> normally capture stack trace containing absolute memory addresses of
> some functions, and would then use /proc/<pid>/maps file to file
> corresponding backing ELF files, file offsets within them, and then
> continue from there to get yet more information (ELF symbols, DWARF
> information) to get human-readable symbolic information.
> 
> As such, there are both performance and correctness requirement
> involved. This address to VMA information translation has to be done as
> efficiently as possible, but also not miss any VMA (especially in the
> case of loading/unloading shared libraries).
> 
> Unfortunately, for all the /proc/<pid>/maps file universality and
> usefulness, it doesn't fit the above 100%.

Is this a new change or has it always been this way?

> First, it's text based, which makes its programmatic use from
> applications and libraries unnecessarily cumbersome and slow due to the
> need to do text parsing to get necessary pieces of information.

slow in what way?  How has it never been noticed before as a problem?

And exact numbers are appreciated please, yes open/read/close seems
slower than open/ioctl/close, but is it really overall an issue in the
real world for anything?

Text apis are good as everyone can handle them, ioctls are harder for
obvious reasons.

> Second, it's main purpose is to emit all VMAs sequentially, but in
> practice captured addresses would fall only into a small subset of all
> process' VMAs, mainly containing executable text. Yet, library would
> need to parse most or all of the contents to find needed VMAs, as there
> is no way to skip VMAs that are of no use. Efficient library can do the
> linear pass and it is still relatively efficient, but it's definitely an
> overhead that can be avoided, if there was a way to do more targeted
> querying of the relevant VMA information.

I don't understand, is this a bug in the current files?  If so, why not
just fix that up?

And again "efficient" need to be quantified.

> Another problem when writing generic stack trace symbolization library
> is an unfortunate performance-vs-correctness tradeoff that needs to be
> made.

What requirement has caused a "generic stack trace symbolization
library" to be needed at all?  What is the problem you are trying to
solve that is not already solved by existing tools?

> Library has to make a decision to either cache parsed contents of
> /proc/<pid>/maps for service future requests (if application requests to
> symbolize another set of addresses, captured at some later time, which
> is typical for periodic/continuous profiling cases) to avoid higher
> costs of needed to re-parse this file or caching the contents in memory
> to speed up future requests. In the former case, more memory is used for
> the cache and there is a risk of getting stale data if application
> loaded/unloaded shared libraries, or otherwise changed its set of VMAs
> through additiona mmap() calls (and other means of altering memory
> address space). In the latter case, it's the performance hit that comes
> from re-opening the file and re-reading/re-parsing its contents all over
> again.

Again, "performance hit" needs to be justified, it shouldn't be much
overall.

> This patch aims to solve this problem by providing a new API built on
> top of /proc/<pid>/maps. It is ioctl()-based and built as a binary
> interface, avoiding the cost and awkwardness of textual representation
> for programmatic use.

Some people find text easier to handle for programmatic use :)

> It's designed to be extensible and
> forward/backward compatible by including user-specified field size and
> using copy_struct_from_user() approach. But, most importantly, it allows
> to do point queries for specific single address, specified by user. And
> this is done efficiently using VMA iterator.

Ok, maybe this is the main issue, you only want one at a time?

> User has a choice to pick either getting VMA that covers provided
> address or -ENOENT if none is found (exact, least surprising, case). Or,
> with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can
> get either VMA that covers the address (if there is one), or the closest
> next VMA (i.e., VMA with the smallest vm_start > addr). The later allows
> more efficient use, but, given it could be a surprising behavior,
> requires an explicit opt-in.
> 
> Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
> sense given it's querying the same set of VMA data. All the permissions
> checks performed on /proc/<pid>/maps opening fit here as well.
> ioctl-based implementation is fetching remembered mm_struct reference,
> but otherwise doesn't interfere with seq_file-based implementation of
> /proc/<pid>/maps textual interface, and so could be used together or
> independently without paying any price for that.
> 
> There is one extra thing that /proc/<pid>/maps doesn't currently
> provide, and that's an ability to fetch ELF build ID, if present. User
> has control over whether this piece of information is requested or not
> by either setting build_id_size field to zero or non-zero maximum buffer
> size they provided through build_id_addr field (which encodes user
> pointer as __u64 field).
> 
> The need to get ELF build ID reliably is an important aspect when
> dealing with profiling and stack trace symbolization, and
> /proc/<pid>/maps textual representation doesn't help with this,
> requiring applications to open underlying ELF binary through
> /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra
> permissions implications due giving a full access to the binary from
> (potentially) another process, while all application is interested in is
> build ID. Giving an ability to request just build ID doesn't introduce
> any additional security concerns, on top of what /proc/<pid>/maps is
> already concerned with, simplifying the overall logic.
> 
> Kernel already implements build ID fetching, which is used from BPF
> subsystem. We are reusing this code here, but plan a follow up changes
> to make it work better under more relaxed assumption (compared to what
> existing code assumes) of being called from user process context, in
> which page faults are allowed. BPF-specific implementation currently
> bails out if necessary part of ELF file is not paged in, all due to
> extra BPF-specific restrictions (like the need to fetch build ID in
> restrictive contexts such as NMI handler).
> 
> Note also, that fetching VMA name (e.g., backing file path, or special
> hard-coded or user-provided names) is optional just like build ID. If
> user sets vma_name_size to zero, kernel code won't attempt to retrieve
> it, saving resources.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Where is the userspace code that uses this new api you have created?

> ---
>  fs/proc/task_mmu.c      | 165 ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fs.h |  32 ++++++++
>  2 files changed, 197 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 8e503a1635b7..cb7b1ff1a144 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -22,6 +22,7 @@
>  #include <linux/pkeys.h>
>  #include <linux/minmax.h>
>  #include <linux/overflow.h>
> +#include <linux/buildid.h>
>  
>  #include <asm/elf.h>
>  #include <asm/tlb.h>
> @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
>  	return do_maps_open(inode, file, &proc_pid_maps_op);
>  }
>  
> +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> +{
> +	struct procfs_procmap_query karg;
> +	struct vma_iterator iter;
> +	struct vm_area_struct *vma;
> +	struct mm_struct *mm;
> +	const char *name = NULL;
> +	char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> +	__u64 usize;
> +	int err;
> +
> +	if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> +		return -EFAULT;
> +	if (usize > PAGE_SIZE)

Nice, where did you document that?  And how is that portable given that
PAGE_SIZE can be different on different systems?

and why aren't you checking the actual structure size instead?  You can
easily run off the end here without knowing it.

> +		return -E2BIG;
> +	if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> +		return -EINVAL;

Ok, so you have two checks?  How can the first one ever fail?


> +	err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> +	if (err)
> +		return err;
> +
> +	if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> +		return -EINVAL;
> +	if (!!karg.vma_name_size != !!karg.vma_name_addr)
> +		return -EINVAL;
> +	if (!!karg.build_id_size != !!karg.build_id_addr)
> +		return -EINVAL;

So you want values to be set, right?

> +
> +	mm = priv->mm;
> +	if (!mm || !mmget_not_zero(mm))
> +		return -ESRCH;

What is this error for?  Where is this documentned?

> +	if (mmap_read_lock_killable(mm)) {
> +		mmput(mm);
> +		return -EINTR;
> +	}
> +
> +	vma_iter_init(&iter, mm, karg.query_addr);
> +	vma = vma_next(&iter);
> +	if (!vma) {
> +		err = -ENOENT;
> +		goto out;
> +	}
> +	/* user wants covering VMA, not the closest next one */
> +	if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> +	    vma->vm_start > karg.query_addr) {
> +		err = -ENOENT;
> +		goto out;
> +	}
> +
> +	karg.vma_start = vma->vm_start;
> +	karg.vma_end = vma->vm_end;
> +
> +	if (vma->vm_file) {
> +		const struct inode *inode = file_user_inode(vma->vm_file);
> +
> +		karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> +		karg.dev_major = MAJOR(inode->i_sb->s_dev);
> +		karg.dev_minor = MINOR(inode->i_sb->s_dev);

So the major/minor is that of the file superblock?  Why?

> +		karg.inode = inode->i_ino;

What is userspace going to do with this?

> +	} else {
> +		karg.vma_offset = 0;
> +		karg.dev_major = 0;
> +		karg.dev_minor = 0;
> +		karg.inode = 0;

Why not set everything to 0 up above at the beginning so you never miss
anything, and you don't miss any holes accidentally in the future.

> +	}
> +
> +	karg.vma_flags = 0;
> +	if (vma->vm_flags & VM_READ)
> +		karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> +	if (vma->vm_flags & VM_WRITE)
> +		karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> +	if (vma->vm_flags & VM_EXEC)
> +		karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> +	if (vma->vm_flags & VM_MAYSHARE)
> +		karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> +
> +	if (karg.build_id_size) {
> +		__u32 build_id_sz = BUILD_ID_SIZE_MAX;
> +
> +		err = build_id_parse(vma, build_id_buf, &build_id_sz);
> +		if (!err) {
> +			if (karg.build_id_size < build_id_sz) {
> +				err = -ENAMETOOLONG;
> +				goto out;
> +			}
> +			karg.build_id_size = build_id_sz;
> +		}
> +	}
> +
> +	if (karg.vma_name_size) {
> +		size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
> +		const struct path *path;
> +		const char *name_fmt;
> +		size_t name_sz = 0;
> +
> +		get_vma_name(vma, &path, &name, &name_fmt);
> +
> +		if (path || name_fmt || name) {
> +			name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
> +			if (!name_buf) {
> +				err = -ENOMEM;
> +				goto out;
> +			}
> +		}
> +		if (path) {
> +			name = d_path(path, name_buf, name_buf_sz);
> +			if (IS_ERR(name)) {
> +				err = PTR_ERR(name);
> +				goto out;
> +			}
> +			name_sz = name_buf + name_buf_sz - name;
> +		} else if (name || name_fmt) {
> +			name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
> +			name = name_buf;
> +		}
> +		if (name_sz > name_buf_sz) {
> +			err = -ENAMETOOLONG;
> +			goto out;
> +		}
> +		karg.vma_name_size = name_sz;
> +	}
> +
> +	/* unlock and put mm_struct before copying data to user */
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +
> +	if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
> +					       name, karg.vma_name_size)) {
> +		kfree(name_buf);
> +		return -EFAULT;
> +	}
> +	kfree(name_buf);
> +
> +	if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
> +					       build_id_buf, karg.build_id_size))
> +		return -EFAULT;
> +
> +	if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
> +		return -EFAULT;
> +
> +	return 0;
> +
> +out:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +	kfree(name_buf);
> +	return err;
> +}
> +
> +static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +	struct seq_file *seq = file->private_data;
> +	struct proc_maps_private *priv = seq->private;
> +
> +	switch (cmd) {
> +	case PROCFS_PROCMAP_QUERY:
> +		return do_procmap_query(priv, (void __user *)arg);
> +	default:
> +		return -ENOIOCTLCMD;
> +	}
> +}
> +
>  const struct file_operations proc_pid_maps_operations = {
>  	.open		= pid_maps_open,
>  	.read		= seq_read,
>  	.llseek		= seq_lseek,
>  	.release	= proc_map_release,
> +	.unlocked_ioctl = procfs_procmap_ioctl,
> +	.compat_ioctl	= procfs_procmap_ioctl,
>  };
>  
>  /*
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 45e4e64fd664..fe8924a8d916 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -393,4 +393,36 @@ struct pm_scan_arg {
>  	__u64 return_mask;
>  };
>  
> +/* /proc/<pid>/maps ioctl */
> +#define PROCFS_IOCTL_MAGIC 0x9f

Don't you need to document this in the proper place?

> +#define PROCFS_PROCMAP_QUERY	_IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> +
> +enum procmap_query_flags {
> +	PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> +};
> +
> +enum procmap_vma_flags {
> +	PROCFS_PROCMAP_VMA_READABLE = 0x01,
> +	PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> +	PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> +	PROCFS_PROCMAP_VMA_SHARED = 0x08,

Are these bits?  If so, please use the bit macro for it to make it
obvious.

> +};
> +
> +struct procfs_procmap_query {
> +	__u64 size;
> +	__u64 query_flags;		/* in */

Does this map to the procmap_vma_flags enum?  if so, please say so.

> +	__u64 query_addr;		/* in */
> +	__u64 vma_start;		/* out */
> +	__u64 vma_end;			/* out */
> +	__u64 vma_flags;		/* out */
> +	__u64 vma_offset;		/* out */
> +	__u64 inode;			/* out */

What is the inode for, you have an inode for the file already, why give
it another one?

> +	__u32 dev_major;		/* out */
> +	__u32 dev_minor;		/* out */

What is major/minor for?

> +	__u32 vma_name_size;		/* in/out */
> +	__u32 build_id_size;		/* in/out */
> +	__u64 vma_name_addr;		/* in */
> +	__u64 build_id_addr;		/* in */

Why not document this all using kerneldoc above the structure?

anyway, I don't like ioctls, but there is a place for them, you just
have to actually justify the use for them and not say "not efficient
enough" as that normally isn't an issue overall.

thanks,

greg k-h

  reply	other threads:[~2024-05-04 15:28 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-04  0:30 [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Andrii Nakryiko
2024-05-04  0:30 ` [PATCH 1/5] fs/procfs: extract logic for getting VMA name constituents Andrii Nakryiko
2024-05-04  0:30 ` [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Andrii Nakryiko
2024-05-04 15:28   ` Greg KH [this message]
2024-05-04 21:50     ` Andrii Nakryiko
2024-05-06 13:58       ` Arnaldo Carvalho de Melo
2024-05-06 18:05         ` Namhyung Kim
2024-05-06 18:51           ` Andrii Nakryiko
2024-05-06 18:53           ` Arnaldo Carvalho de Melo
2024-05-06 19:16             ` Arnaldo Carvalho de Melo
2024-05-07 21:55               ` Namhyung Kim
2024-05-06 18:41         ` Andrii Nakryiko
2024-05-06 20:35           ` Arnaldo Carvalho de Melo
2024-05-07 16:36             ` Andrii Nakryiko
2024-05-04 23:36   ` kernel test robot
2024-05-07 18:10   ` Liam R. Howlett
2024-05-07 18:52     ` Andrii Nakryiko
2024-05-04  0:30 ` [PATCH 3/5] tools: sync uapi/linux/fs.h header into tools subdir Andrii Nakryiko
2024-05-04  0:30 ` [PATCH 4/5] selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available Andrii Nakryiko
2024-05-04  0:30 ` [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs Andrii Nakryiko
2024-05-04 15:29   ` Greg KH
2024-05-04 21:57     ` Andrii Nakryiko
2024-05-05  5:09       ` Ian Rogers
2024-05-06 18:32         ` Andrii Nakryiko
2024-05-06 18:43           ` Ian Rogers
2024-05-07  5:06             ` Andrii Nakryiko
2024-05-07 17:29               ` Andrii Nakryiko
2024-05-07 22:27                 ` Namhyung Kim
2024-05-07 22:56                   ` Andrii Nakryiko
2024-05-08  0:36                     ` Arnaldo Carvalho de Melo
2024-05-04 15:32   ` Greg KH
2024-05-04 22:13     ` Andrii Nakryiko
2024-05-07 15:48       ` Liam R. Howlett
2024-05-07 16:10         ` Matthew Wilcox
2024-05-07 16:18           ` Liam R. Howlett
2024-05-07 16:27         ` Andrii Nakryiko
2024-05-07 18:06           ` Liam R. Howlett
2024-05-07 19:00             ` Andrii Nakryiko
2024-05-08  1:20               ` Liam R. Howlett
2024-05-04 11:24 ` [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps Christian Brauner
2024-05-04 15:33   ` Greg KH
2024-05-04 21:50     ` Andrii Nakryiko
2024-05-04 21:50   ` Andrii Nakryiko
2024-05-05  5:26 ` Ian Rogers
2024-05-06 18:58   ` Andrii Nakryiko
  -- strict thread matches above, loose matches on Subject: below --
2024-05-04 18:37 [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps Alexey Dobriyan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2024050439-janitor-scoff-be04@gregkh \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brauner@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.