Linux Security Modules development
 help / color / mirror / Atom feed
* Re: [PATCH 01/18] fsinfo: Introduce a non-repeating system-unique superblock ID [ver #21]
From: Miklos Szeredi @ 2020-08-04  9:34 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Linus Torvalds, Ian Kent, Miklos Szeredi,
	Christian Brauner, Jann Horn, Darrick J. Wong, Karel Zak,
	Jeff Layton, Linux API, linux-fsdevel, LSM, linux-kernel
In-Reply-To: <159646179405.1784947.10794350637774567265.stgit@warthog.procyon.org.uk>

On Mon, Aug 3, 2020 at 3:37 PM David Howells <dhowells@redhat.com> wrote:
>
> Introduce an (effectively) non-repeating system-unique superblock ID that
> can be used to determine that two objects are in the same superblock
> without needing to worry about the ID changing in the meantime (as is
> possible with device IDs).
>
> The counter could also be used to tag other features, such as mount
> objects.
>
> Signed-off-by: David Howells <dhowells@redhat.com>

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>

^ permalink raw reply

* Re: [PATCH 02/18] fsinfo: Add fsinfo() syscall to query filesystem information [ver #21]
From: Miklos Szeredi @ 2020-08-04 10:16 UTC (permalink / raw)
  To: David Howells
  Cc: viro, linux-api, torvalds, raven, mszeredi, christian, jannh,
	darrick.wong, kzak, jlayton, linux-fsdevel, linux-security-module,
	linux-kernel
In-Reply-To: <159646180259.1784947.223853053048725752.stgit@warthog.procyon.org.uk>

On Mon, Aug 03, 2020 at 02:36:42PM +0100, David Howells wrote:
> Add a system call to allow filesystem information to be queried.  A request
> value can be given to indicate the desired attribute.  Support is provided
> for enumerating multi-value attributes.
> 
> ===============
> NEW SYSTEM CALL
> ===============
> 
> The new system call looks like:
> 
> 	int ret = fsinfo(int dfd,
> 			 const char *pathname,
> 			 const struct fsinfo_params *params,
> 			 size_t params_size,
> 			 void *result_buffer,
> 			 size_t result_buf_size);
> 
> The params parameter optionally points to a block of parameters:
> 
> 	struct fsinfo_params {
> 		__u64	resolve_flags;
> 		__u32	at_flags;
> 		__u32	flags;
> 		__u32	request;
> 		__u32	Nth;
> 		__u32	Mth;

The Mth field seems to be unused in this patchset.  Since the struct is
extensible, I guess there's no point in adding it now.

> 	};
> 
> If params is NULL, the default is that params->request is
> FSINFO_ATTR_STATFS and all the other fields are 0.  params_size indicates
> the size of the parameter struct.  If the parameter block is short compared
> to what the kernel expects, the missing length will be set to 0; if the
> parameter block is longer, an error will be given if the excess is not all
> zeros.
> 
> The object to be queried is specified as follows - part param->flags
> indicates the type of reference:
> 
>  (1) FSINFO_FLAGS_QUERY_PATH - dfd, pathname and at_flags indicate a
>      filesystem object to query.
> 
>      There is no separate system call providing an analogue of lstat() -
>      AT_SYMLINK_NOFOLLOW should be set in at_flags instead.
>      AT_NO_AUTOMOUNT can also be used to an allow automount point to be
>      queried without triggering it.
> 
>      RESOLVE_* flags can also be set in resolve_flags to further restrict
>      the patchwalk.
> 
>  (2) FSINFO_FLAGS_QUERY_FD - dfd indicates a file descriptor pointing to
>      the filesystem object to query.  pathname should be NULL.

This is at_flags = AT_EMPTY_PATH by convention.


> 
>  (3) FSINFO_FLAGS_QUERY_MOUNT - pathname indicates the numeric ID of the
>      mountpoint to query as a string.  dfd is used to constrain which
>      mounts can be accessed.  If dfd is AT_FDCWD, the mount must be within
>      the subtree rooted at chroot, otherwise the mount must be within the
>      subtree rooted at the directory specified by dfd.
> 
>  (4) In the future FSINFO_FLAGS_QUERY_FSCONTEXT will be added - dfd will
>      indicate a context handle fd obtained from fsopen() or fspick(),
>      allowing that to be queried before the target superblock is attached
>      to the filesystem or even created.

Can you describe features that are added by *this* patch?  It's compex enough as
is.

> 
> params->request indicates the attribute/attributes to be queried.  This can
> be one of:
> 
> 	FSINFO_ATTR_STATFS		- statfs-style info
> 	FSINFO_ATTR_IDS			- Filesystem IDs
> 	FSINFO_ATTR_LIMITS		- Filesystem limits
> 	FSINFO_ATTR_SUPPORTS		- Support for statx, ioctl, etc.
> 	FSINFO_ATTR_TIMESTAMP_INFO	- Inode timestamp info
> 	FSINFO_ATTR_VOLUME_ID		- Volume ID (string)
> 	FSINFO_ATTR_VOLUME_UUID		- Volume UUID
> 	FSINFO_ATTR_VOLUME_NAME		- Volume name (string)
> 	FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO - Information about attr Nth
> 	FSINFO_ATTR_FSINFO_ATTRIBUTES	- List of supported attrs
> 
> Some attributes (such as the servers backing a network filesystem) can have
> multiple values.  These can be enumerated by setting params->Nth and
> params->Mth to 0, 1, ... until ENODATA is returned.
> 
> result_buffer and result_buf_size point to the reply buffer.  The buffer is
> filled up to the specified size, even if this means truncating the reply.
> The size of the full reply is returned, irrespective of the amount data
> that was copied.  In future versions, this will allow extra fields to be
> tacked on to the end of the reply, but anyone not expecting them will only
> get the subset they're expecting.  If either buffer of result_buf_size are
> 0, no copy will take place and the data size will be returned.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: linux-api@vger.kernel.org
> ---
> 
>  arch/alpha/kernel/syscalls/syscall.tbl      |    1 
>  arch/arm/tools/syscall.tbl                  |    1 
>  arch/arm64/include/asm/unistd.h             |    2 
>  arch/arm64/include/asm/unistd32.h           |    2 
>  arch/ia64/kernel/syscalls/syscall.tbl       |    1 
>  arch/m68k/kernel/syscalls/syscall.tbl       |    1 
>  arch/microblaze/kernel/syscalls/syscall.tbl |    1 
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
>  arch/parisc/kernel/syscalls/syscall.tbl     |    1 
>  arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
>  arch/s390/kernel/syscalls/syscall.tbl       |    1 
>  arch/sh/kernel/syscalls/syscall.tbl         |    1 
>  arch/sparc/kernel/syscalls/syscall.tbl      |    1 
>  arch/x86/entry/syscalls/syscall_32.tbl      |    1 
>  arch/x86/entry/syscalls/syscall_64.tbl      |    1 
>  arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
>  fs/Kconfig                                  |    7 
>  fs/Makefile                                 |    1 
>  fs/fsinfo.c                                 |  596 +++++++++++++++++++++++++
>  include/linux/fs.h                          |    4 
>  include/linux/fsinfo.h                      |   74 +++
>  include/linux/syscalls.h                    |    4 
>  include/uapi/asm-generic/unistd.h           |    4 
>  include/uapi/linux/fsinfo.h                 |  189 ++++++++
>  kernel/sys_ni.c                             |    1 
>  samples/vfs/Makefile                        |    2 
>  samples/vfs/test-fsinfo.c                   |  646 +++++++++++++++++++++++++++
>  29 files changed, 1545 insertions(+), 3 deletions(-)
>  create mode 100644 fs/fsinfo.c
>  create mode 100644 include/linux/fsinfo.h
>  create mode 100644 include/uapi/linux/fsinfo.h
>  create mode 100644 samples/vfs/test-fsinfo.c
> 
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index b6cf8403da35..984abd1ac058 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -479,3 +479,4 @@
>  548	common	pidfd_getfd			sys_pidfd_getfd
>  549	common	faccessat2			sys_faccessat2
>  550	common	watch_mount			sys_watch_mount
> +551	common	fsinfo				sys_fsinfo
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index 27cc1f53f4a0..bd791f91f5bb 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -453,3 +453,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
>  440	common	watch_mount			sys_watch_mount
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index b3b2019f8d16..86a9d7b3eabe 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -38,7 +38,7 @@
>  #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
>  #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
>  
> -#define __NR_compat_syscalls		441
> +#define __NR_compat_syscalls		442
>  #endif
>  
>  #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 4f9cf98cdf0f..bd78eb2c487a 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -887,6 +887,8 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
>  __SYSCALL(__NR_faccessat2, sys_faccessat2)
>  #define __NR_watch_mount 440
>  __SYSCALL(__NR_watch_mount, sys_watch_mount)
> +#define __NR_fsinfo 441
> +__SYSCALL(__NR_fsinfo, sys_fsinfo)
>  
>  /*
>   * Please add new compat syscalls above this comment and update
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index fc6d87903781..09d144487b7d 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -360,3 +360,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
>  440	common	watch_mount			sys_watch_mount
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index c671aa0e4d25..1bdc26af3c54 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -439,3 +439,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
>  440	common	watch_mount			sys_watch_mount
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index 65cc53f129ef..fb8543122904 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -445,3 +445,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
>  440	common	watch_mount			sys_watch_mount
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index 7f034a239930..b8362bd6bd4a 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -378,3 +378,4 @@
>  438	n32	pidfd_getfd			sys_pidfd_getfd
>  439	n32	faccessat2			sys_faccessat2
>  440	n32	watch_mount			sys_watch_mount
> +441	n32	fsinfo				sys_fsinfo
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index d39b90de3642..60ca4091d378 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -354,3 +354,4 @@
>  438	n64	pidfd_getfd			sys_pidfd_getfd
>  439	n64	faccessat2			sys_faccessat2
>  440	n64	watch_mount			sys_watch_mount
> +441	n64	fsinfo				sys_fsinfo
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index 09f426cb45b1..07aea9379ca0 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -427,3 +427,4 @@
>  438	o32	pidfd_getfd			sys_pidfd_getfd
>  439	o32	faccessat2			sys_faccessat2
>  440	o32	watch_mount			sys_watch_mount
> +441	o32	fsinfo				sys_fsinfo
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index 52ff3454baa1..f8060767f11a 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -437,3 +437,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
>  440	common	watch_mount			sys_watch_mount
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 10b7ed3c7a1b..3036bf1336d2 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -529,3 +529,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
>  440	common	watch_mount			sys_watch_mount
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index 86f317bf52df..c0a111fdb3ce 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -442,3 +442,4 @@
>  438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
>  439  common	faccessat2		sys_faccessat2			sys_faccessat2
>  440	common	watch_mount		sys_watch_mount			sys_watch_mount
> +441	common	fsinfo			sys_fsinfo			sys_fsinfo
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index 0bb0f0b372c7..03b55c32441f 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -442,3 +442,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
>  440	common	watch_mount			sys_watch_mount
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index 369ab65c1e9a..a0144db9fb8c 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -485,3 +485,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
>  440	common	watch_mount			sys_watch_mount
> +441	common	fsinfo				sys_fsinfo
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index e760ba92c58d..edf90a2be0b9 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -444,3 +444,4 @@
>  438	i386	pidfd_getfd		sys_pidfd_getfd
>  439	i386	faccessat2		sys_faccessat2
>  440	i386	watch_mount		sys_watch_mount
> +441	i386	fsinfo			sys_fsinfo
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5b58621d4f75..ab0eda639d67 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -361,6 +361,7 @@
>  438	common	pidfd_getfd		sys_pidfd_getfd
>  439	common	faccessat2		sys_faccessat2
>  440	common	watch_mount		sys_watch_mount
> +441	common	fsinfo			sys_fsinfo
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index 5b28ee39f70f..979013890caf 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -410,3 +410,4 @@
>  438	common	pidfd_getfd			sys_pidfd_getfd
>  439	common	faccessat2			sys_faccessat2
>  440	common	watch_mount			sys_watch_mount
> +441	common	fsinfo				sys_fsinfo
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 1a55e56d5c54..df76451ab49a 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -15,6 +15,13 @@ config VALIDATE_FS_PARSER
>  	  Enable this to perform validation of the parameter description for a
>  	  filesystem when it is registered.
>  
> +config FSINFO
> +	bool "Enable the fsinfo() system call"
> +	help
> +	  Enable the file system information querying system call to allow
> +	  comprehensive information to be retrieved about a filesystem,
> +	  superblock or mount object.
> +
>  if BLOCK
>  
>  config FS_IOMAP
> diff --git a/fs/Makefile b/fs/Makefile
> index dd0d87e2ef19..93a7f8047585 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -55,6 +55,7 @@ obj-$(CONFIG_COREDUMP)		+= coredump.o
>  obj-$(CONFIG_SYSCTL)		+= drop_caches.o
>  
>  obj-$(CONFIG_FHANDLE)		+= fhandle.o
> +obj-$(CONFIG_FSINFO)		+= fsinfo.o
>  obj-y				+= iomap/
>  
>  obj-y				+= quota/
> diff --git a/fs/fsinfo.c b/fs/fsinfo.c
> new file mode 100644
> index 000000000000..7d9c73e9cbde
> --- /dev/null
> +++ b/fs/fsinfo.c
> @@ -0,0 +1,596 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Filesystem information query.
> + *
> + * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +#include <linux/syscalls.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/mount.h>
> +#include <linux/namei.h>
> +#include <linux/statfs.h>
> +#include <linux/security.h>
> +#include <linux/uaccess.h>
> +#include <linux/fsinfo.h>
> +#include <uapi/linux/mount.h>
> +#include "internal.h"
> +
> +/**
> + * fsinfo_opaque - Store opaque blob as an fsinfo attribute value.
> + * @s: The blob to store (may be NULL)
> + * @ctx: The parameter context
> + * @len: The length of the blob
> + */
> +int fsinfo_opaque(const void *s, struct fsinfo_context *ctx, unsigned int len)
> +{
> +	void *p = ctx->buffer;
> +	int ret = 0;
> +
> +	if (s) {
> +		if (!ctx->want_size_only)
> +			memcpy(p, s, len);
> +		ret = len;
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(fsinfo_opaque);
> +
> +/**
> + * fsinfo_string - Store a NUL-terminated string as an fsinfo attribute value.
> + * @s: The string to store (may be NULL)
> + * @ctx: The parameter context
> + */
> +int fsinfo_string(const char *s, struct fsinfo_context *ctx)
> +{
> +	if (!s)
> +		return 1;
> +	return fsinfo_opaque(s, ctx, min_t(size_t, strlen(s) + 1, ctx->buf_size));
> +}
> +EXPORT_SYMBOL(fsinfo_string);
> +
> +/*
> + * Get basic filesystem stats from statfs.
> + */
> +static int fsinfo_generic_statfs(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_statfs *p = ctx->buffer;
> +	struct kstatfs buf;
> +	int ret;
> +
> +	ret = vfs_statfs(path, &buf);
> +	if (ret < 0)
> +		return ret;
> +
> +	p->f_blocks.lo	= buf.f_blocks;
> +	p->f_bfree.lo	= buf.f_bfree;
> +	p->f_bavail.lo	= buf.f_bavail;
> +	p->f_files.lo	= buf.f_files;
> +	p->f_ffree.lo	= buf.f_ffree;
> +	p->f_favail.lo	= buf.f_ffree;
> +	p->f_bsize	= buf.f_bsize;
> +	p->f_frsize	= buf.f_frsize;
> +	return sizeof(*p);
> +}
> +
> +static int fsinfo_generic_ids(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_ids *p = ctx->buffer;
> +	struct super_block *sb;
> +	struct kstatfs buf;
> +	int ret;
> +
> +	ret = vfs_statfs(path, &buf);
> +	if (ret < 0 && ret != -ENOSYS)
> +		return ret;
> +	if (ret == 0)
> +		memcpy(&p->f_fsid, &buf.f_fsid, sizeof(p->f_fsid));
> +
> +	sb = path->dentry->d_sb;
> +	p->f_fstype	= sb->s_magic;
> +	p->f_dev_major	= MAJOR(sb->s_dev);
> +	p->f_dev_minor	= MINOR(sb->s_dev);
> +	p->f_sb_id	= sb->s_unique_id;
> +	strlcpy(p->f_fs_name, sb->s_type->name, sizeof(p->f_fs_name));
> +	return sizeof(*p);
> +}
> +
> +int fsinfo_generic_limits(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_limits *p = ctx->buffer;
> +	struct super_block *sb = path->dentry->d_sb;
> +
> +	p->max_file_size.hi	= 0;
> +	p->max_file_size.lo	= sb->s_maxbytes;
> +	p->max_ino.hi		= 0;
> +	p->max_ino.lo		= UINT_MAX;
> +	p->max_hard_links	= sb->s_max_links;
> +	p->max_uid		= UINT_MAX;
> +	p->max_gid		= UINT_MAX;
> +	p->max_projid		= UINT_MAX;
> +	p->max_filename_len	= NAME_MAX;
> +	p->max_symlink_len	= PATH_MAX;
> +	p->max_xattr_name_len	= XATTR_NAME_MAX;
> +	p->max_xattr_body_len	= XATTR_SIZE_MAX;
> +	p->max_dev_major	= 0xffffff;
> +	p->max_dev_minor	= 0xff;
> +	return sizeof(*p);
> +}
> +EXPORT_SYMBOL(fsinfo_generic_limits);
> +
> +int fsinfo_generic_supports(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_supports *p = ctx->buffer;
> +	struct super_block *sb = path->dentry->d_sb;
> +
> +	p->stx_mask = STATX_BASIC_STATS;
> +	if (sb->s_d_op && sb->s_d_op->d_automount)
> +		p->stx_attributes |= STATX_ATTR_AUTOMOUNT;
> +	return sizeof(*p);
> +}
> +EXPORT_SYMBOL(fsinfo_generic_supports);
> +
> +static const struct fsinfo_timestamp_info fsinfo_default_timestamp_info = {
> +	.atime = {
> +		.minimum	= S64_MIN,
> +		.maximum	= S64_MAX,
> +		.gran_mantissa	= 1,
> +		.gran_exponent	= 0,
> +	},
> +	.mtime = {
> +		.minimum	= S64_MIN,
> +		.maximum	= S64_MAX,
> +		.gran_mantissa	= 1,
> +		.gran_exponent	= 0,
> +	},
> +	.ctime = {
> +		.minimum	= S64_MIN,
> +		.maximum	= S64_MAX,
> +		.gran_mantissa	= 1,
> +		.gran_exponent	= 0,
> +	},
> +	.btime = {
> +		.minimum	= S64_MIN,
> +		.maximum	= S64_MAX,
> +		.gran_mantissa	= 1,
> +		.gran_exponent	= 0,
> +	},
> +};
> +
> +int fsinfo_generic_timestamp_info(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_timestamp_info *p = ctx->buffer;
> +	struct super_block *sb = path->dentry->d_sb;
> +	s8 exponent;
> +
> +	*p = fsinfo_default_timestamp_info;
> +
> +	if (sb->s_time_gran < 1000000000) {
> +		if (sb->s_time_gran < 1000)
> +			exponent = -9;
> +		else if (sb->s_time_gran < 1000000)
> +			exponent = -6;
> +		else
> +			exponent = -3;
> +
> +		p->atime.gran_exponent = exponent;
> +		p->mtime.gran_exponent = exponent;
> +		p->ctime.gran_exponent = exponent;
> +		p->btime.gran_exponent = exponent;
> +	}
> +
> +	return sizeof(*p);
> +}
> +EXPORT_SYMBOL(fsinfo_generic_timestamp_info);
> +
> +static int fsinfo_generic_volume_uuid(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_volume_uuid *p = ctx->buffer;
> +	struct super_block *sb = path->dentry->d_sb;
> +
> +	memcpy(p, &sb->s_uuid, sizeof(*p));
> +	return sizeof(*p);
> +}
> +
> +static int fsinfo_generic_volume_id(struct path *path, struct fsinfo_context *ctx)
> +{
> +	return fsinfo_string(path->dentry->d_sb->s_id, ctx);
> +}
> +
> +static const struct fsinfo_attribute fsinfo_common_attributes[] = {
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_STATFS,		fsinfo_generic_statfs),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		fsinfo_generic_limits),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		fsinfo_generic_supports),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
> +	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		fsinfo_generic_volume_id),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
> +
> +	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	(void *)123UL),
> +	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
> +	{}
> +};
> +
> +/*
> + * Determine an attribute's minimum buffer size and, if the buffer is large
> + * enough, get the attribute value.
> + */
> +static int fsinfo_get_this_attribute(struct path *path,
> +				     struct fsinfo_context *ctx,
> +				     const struct fsinfo_attribute *attr)
> +{
> +	int buf_size;
> +
> +	if (ctx->Nth != 0 && !(attr->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)))
> +		return -ENODATA;
> +	if (ctx->Mth != 0 && !(attr->flags & FSINFO_FLAGS_NM))
> +		return -ENODATA;
> +
> +	switch (attr->type) {
> +	case FSINFO_TYPE_VSTRUCT:
> +		ctx->clear_tail = true;
> +		buf_size = attr->size;
> +		break;
> +	case FSINFO_TYPE_STRING:
> +	case FSINFO_TYPE_OPAQUE:
> +	case FSINFO_TYPE_LIST:
> +		buf_size = 4096;
> +		break;
> +	default:
> +		return -ENOPKG;
> +	}
> +
> +	if (ctx->buf_size < buf_size)
> +		return buf_size;
> +
> +	return attr->get(path, ctx);
> +}
> +
> +static void fsinfo_attributes_insert(struct fsinfo_context *ctx,
> +				     const struct fsinfo_attribute *attr)
> +{
> +	__u32 *p = ctx->buffer;
> +	unsigned int i;
> +
> +	if (ctx->usage >= ctx->buf_size ||
> +	    ctx->buf_size - ctx->usage < sizeof(__u32)) {
> +		ctx->usage += sizeof(__u32);
> +		return;
> +	}
> +
> +	for (i = 0; i < ctx->usage / sizeof(__u32); i++)
> +		if (p[i] == attr->attr_id)
> +			return;
> +
> +	p[i] = attr->attr_id;
> +	ctx->usage += sizeof(__u32);
> +}
> +
> +static int fsinfo_list_attributes(struct path *path,
> +				  struct fsinfo_context *ctx,
> +				  const struct fsinfo_attribute *attributes)
> +{
> +	const struct fsinfo_attribute *a;
> +
> +	for (a = attributes; a->get; a++)
> +		fsinfo_attributes_insert(ctx, a);
> +	return -EOPNOTSUPP; /* We want to go through all the lists */
> +}
> +
> +static int fsinfo_get_attribute_info(struct path *path,
> +				     struct fsinfo_context *ctx,
> +				     const struct fsinfo_attribute *attributes)
> +{
> +	const struct fsinfo_attribute *a;
> +	struct fsinfo_attribute_info *p = ctx->buffer;
> +
> +	if (!ctx->buf_size)
> +		return sizeof(*p);
> +
> +	for (a = attributes; a->get; a++) {
> +		if (a->attr_id == ctx->Nth) {
> +			p->attr_id	= a->attr_id;
> +			p->type		= a->type;
> +			p->flags	= a->flags;
> +			p->size		= a->size;
> +			p->size		= a->size;
> +			return sizeof(*p);
> +		}
> +	}
> +	return -EOPNOTSUPP; /* We want to go through all the lists */
> +}
> +
> +/**
> + * fsinfo_get_attribute - Look up and handle an attribute
> + * @path: The object to query
> + * @params: Parameters to define a request and place to store result
> + * @attributes: List of attributes to search.
> + *
> + * Look through a list of attributes for one that matches the requested
> + * attribute then call the handler for it.
> + */
> +int fsinfo_get_attribute(struct path *path, struct fsinfo_context *ctx,
> +			 const struct fsinfo_attribute *attributes)
> +{
> +	const struct fsinfo_attribute *a;
> +
> +	switch (ctx->requested_attr) {
> +	case FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO:
> +		return fsinfo_get_attribute_info(path, ctx, attributes);
> +	case FSINFO_ATTR_FSINFO_ATTRIBUTES:
> +		return fsinfo_list_attributes(path, ctx, attributes);
> +	default:
> +		for (a = attributes; a->get; a++)
> +			if (a->attr_id == ctx->requested_attr)
> +				return fsinfo_get_this_attribute(path, ctx, a);
> +		return -EOPNOTSUPP;
> +	}
> +}
> +EXPORT_SYMBOL(fsinfo_get_attribute);
> +
> +/**
> + * generic_fsinfo - Handle an fsinfo attribute generically
> + * @path: The object to query
> + * @params: Parameters to define a request and place to store result
> + */
> +static int fsinfo_call(struct path *path, struct fsinfo_context *ctx)
> +{
> +	int ret;
> +
> +	if (path->dentry->d_sb->s_op->fsinfo) {
> +		ret = path->dentry->d_sb->s_op->fsinfo(path, ctx);
> +		if (ret != -EOPNOTSUPP)
> +			return ret;
> +	}
> +	ret = fsinfo_get_attribute(path, ctx, fsinfo_common_attributes);
> +	if (ret != -EOPNOTSUPP)
> +		return ret;
> +
> +	switch (ctx->requested_attr) {
> +	case FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO:
> +		return -ENODATA;
> +	case FSINFO_ATTR_FSINFO_ATTRIBUTES:
> +		return ctx->usage;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +}
> +
> +/**
> + * vfs_fsinfo - Retrieve filesystem information
> + * @path: The object to query
> + * @params: Parameters to define a request and place to store result
> + *
> + * Get an attribute on a filesystem or an object within a filesystem.  The
> + * filesystem attribute to be queried is indicated by @ctx->requested_attr, and
> + * if it's a multi-valued attribute, the particular value is selected by
> + * @ctx->Nth and then @ctx->Mth.
> + *
> + * For common attributes, a value may be fabricated if it is not supported by
> + * the filesystem.
> + *
> + * On success, the size of the attribute's value is returned (0 is a valid
> + * size).  A buffer will have been allocated and will be pointed to by
> + * @ctx->buffer.  The caller must free this with kvfree().
> + *
> + * Errors can also be returned: -ENOMEM if a buffer cannot be allocated, -EPERM
> + * or -EACCES if permission is denied by the LSM, -EOPNOTSUPP if an attribute
> + * doesn't exist for the specified object or -ENODATA if the attribute exists,
> + * but the Nth,Mth value does not exist.  -EMSGSIZE indicates that the value is
> + * unmanageable internally and -ENOPKG indicates other internal failure.
> + *
> + * Errors such as -EIO may also come from attempts to access media or servers
> + * to obtain the requested information if it's not immediately to hand.
> + *
> + * [*] Note that the caller may set @ctx->want_size_only if it only wants the
> + *     size of the value and not the data.  If this is set, a buffer may not be
> + *     allocated under some circumstances.  This is intended for size query by
> + *     userspace.
> + *
> + * [*] Note that @ctx->clear_tail will be returned set if the data should be
> + *     padded out with zeros when writing it to userspace.
> + */
> +static int vfs_fsinfo(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct dentry *dentry = path->dentry;
> +	int ret;
> +
> +	ret = security_sb_statfs(dentry);
> +	if (ret)
> +		return ret;
> +
> +	/* Call the handler to find out the buffer size required. */
> +	ctx->buf_size = 0;
> +	ret = fsinfo_call(path, ctx);
> +	if (ret < 0 || ctx->want_size_only)
> +		return ret;
> +	ctx->buf_size = ret;
> +
> +	do {
> +		/* Allocate a buffer of the requested size. */
> +		if (ctx->buf_size > INT_MAX)
> +			return -EMSGSIZE;
> +		ctx->buffer = kvzalloc(ctx->buf_size, GFP_KERNEL);
> +		if (!ctx->buffer)
> +			return -ENOMEM;
> +
> +		ctx->usage = 0;
> +		ctx->skip = 0;
> +		ret = fsinfo_call(path, ctx);
> +		if (IS_ERR_VALUE((long)ret))
> +			return ret;
> +		if ((unsigned int)ret <= ctx->buf_size)
> +			return ret; /* It fitted */
> +
> +		/* We need to resize the buffer */
> +		ctx->buf_size = roundup(ret, PAGE_SIZE);
> +		kvfree(ctx->buffer);
> +		ctx->buffer = NULL;
> +	} while (!signal_pending(current));
> +
> +	return -ERESTARTSYS;
> +}
> +
> +static int vfs_fsinfo_path(int dfd, const char __user *pathname,
> +			   const struct fsinfo_params *up,
> +			   struct fsinfo_context *ctx)
> +{
> +	struct path path;
> +	unsigned lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
> +	int ret = -EINVAL;
> +
> +	if (up->resolve_flags & ~VALID_RESOLVE_FLAGS)
> +		return -EINVAL;
> +	if (up->at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
> +			     AT_EMPTY_PATH))
> +		return -EINVAL;
> +
> +	if (up->resolve_flags & RESOLVE_NO_XDEV)
> +		lookup_flags |= LOOKUP_NO_XDEV;
> +	if (up->resolve_flags & RESOLVE_NO_MAGICLINKS)
> +		lookup_flags |= LOOKUP_NO_MAGICLINKS;
> +	if (up->resolve_flags & RESOLVE_NO_SYMLINKS)
> +		lookup_flags |= LOOKUP_NO_SYMLINKS;
> +	if (up->resolve_flags & RESOLVE_BENEATH)
> +		lookup_flags |= LOOKUP_BENEATH;
> +	if (up->resolve_flags & RESOLVE_IN_ROOT)
> +		lookup_flags |= LOOKUP_IN_ROOT;
> +	if (up->at_flags & AT_SYMLINK_NOFOLLOW)
> +		lookup_flags &= ~LOOKUP_FOLLOW;
> +	if (up->at_flags & AT_NO_AUTOMOUNT)
> +		lookup_flags &= ~LOOKUP_AUTOMOUNT;
> +	if (up->at_flags & AT_EMPTY_PATH)
> +		lookup_flags |= LOOKUP_EMPTY;
> +
> +retry:
> +	ret = user_path_at(dfd, pathname, lookup_flags, &path);
> +	if (ret)
> +		goto out;
> +
> +	ret = vfs_fsinfo(&path, ctx);
> +	path_put(&path);
> +	if (retry_estale(ret, lookup_flags)) {
> +		lookup_flags |= LOOKUP_REVAL;
> +		goto retry;
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_context *ctx)
> +{
> +	struct fd f = fdget_raw(fd);
> +	int ret = -EBADF;
> +
> +	if (f.file) {
> +		ret = vfs_fsinfo(&f.file->f_path, ctx);
> +		fdput(f);
> +	}
> +	return ret;
> +}
> +
> +/**
> + * sys_fsinfo - System call to get filesystem information
> + * @dfd: Base directory to pathwalk from or fd referring to filesystem.
> + * @pathname: Filesystem to query or NULL.
> + * @params: Parameters to define request (NULL: FSINFO_ATTR_STATFS).
> + * @params_size: Size of parameter buffer.
> + * @result_buffer: Result buffer.
> + * @result_buf_size: Size of result buffer.
> + *
> + * Get information on a filesystem.  The filesystem attribute to be queried is
> + * indicated by @_params->request, and some of the attributes can have multiple
> + * values, indexed by @_params->Nth and @_params->Mth.  If @_params is NULL,
> + * then the 0th fsinfo_attr_statfs attribute is queried.  If an attribute does
> + * not exist, EOPNOTSUPP is returned; if the Nth,Mth value does not exist,
> + * ENODATA is returned.
> + *
> + * On success, the size of the attribute's value is returned.  If
> + * @result_buf_size is 0 or @result_buffer is NULL, only the size is returned.
> + * If the size of the value is larger than @result_buf_size, it will be
> + * truncated by the copy.  If the size of the value is smaller than
> + * @result_buf_size then the excess buffer space will be cleared.  The full
> + * size of the value will be returned, irrespective of how much data is
> + * actually placed in the buffer.
> + */
> +SYSCALL_DEFINE6(fsinfo,
> +		int, dfd,
> +		const char __user *, pathname,
> +		const struct fsinfo_params __user *, params,
> +		size_t, params_size,
> +		void __user *, result_buffer,
> +		size_t, result_buf_size)
> +{
> +	struct fsinfo_context ctx;
> +	struct fsinfo_params user_params;
> +	unsigned int result_size;
> +	void *r;
> +	int ret;
> +
> +	if ((!params &&  params_size) ||
> +	    ( params && !params_size) ||
> +	    (!result_buffer &&  result_buf_size) ||
> +	    ( result_buffer && !result_buf_size))
> +		return -EINVAL;
> +	if (result_buf_size > UINT_MAX)
> +		return -EOVERFLOW;
> +
> +	memset(&ctx, 0, sizeof(ctx));
> +	ctx.requested_attr	= FSINFO_ATTR_STATFS;
> +	ctx.flags		= FSINFO_FLAGS_QUERY_PATH;
> +	ctx.want_size_only	= (result_buf_size == 0);
> +
> +	if (params) {
> +		ret = copy_struct_from_user(&user_params, sizeof(user_params),
> +					    params, params_size);
> +		if (ret < 0)
> +			return ret;
> +		if (user_params.flags & ~FSINFO_FLAGS_QUERY_MASK)
> +			return -EINVAL;
> +		ctx.flags = user_params.flags;
> +		ctx.requested_attr = user_params.request;
> +		ctx.Nth = user_params.Nth;
> +		ctx.Mth = user_params.Mth;
> +	}
> +
> +	switch (ctx.flags & FSINFO_FLAGS_QUERY_MASK) {
> +	case FSINFO_FLAGS_QUERY_PATH:
> +		ret = vfs_fsinfo_path(dfd, pathname, &user_params, &ctx);
> +		break;
> +	case FSINFO_FLAGS_QUERY_FD:
> +		if (pathname)
> +			return -EINVAL;
> +		ret = vfs_fsinfo_fd(dfd, &ctx);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (ret < 0)
> +		goto error;
> +
> +	r = ctx.buffer + ctx.skip;
> +	result_size = min_t(size_t, ret, result_buf_size);
> +	if (result_size > 0 &&
> +	    copy_to_user(result_buffer, r, result_size) != 0) {
> +		ret = -EFAULT;
> +		goto error;
> +	}
> +
> +	/* Clear any part of the buffer that we won't fill if we're putting a
> +	 * struct in there.  Strings, opaque objects and arrays are expected to
> +	 * be variable length.
> +	 */
> +	if (ctx.clear_tail &&
> +	    result_buf_size > result_size &&
> +	    clear_user(result_buffer + result_size,
> +		       result_buf_size - result_size) != 0) {
> +		ret = -EFAULT;
> +		goto error;
> +	}
> +
> +error:
> +	kvfree(ctx.buffer);
> +	return ret;
> +}
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 28a29356eace..3284f497de0a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -68,6 +68,7 @@ struct fsverity_info;
>  struct fsverity_operations;
>  struct fs_context;
>  struct fs_parameter_spec;
> +struct fsinfo_context;
>  
>  extern void __init inode_init(void);
>  extern void __init inode_init_early(void);
> @@ -1963,6 +1964,9 @@ struct super_operations {
>  	int (*thaw_super) (struct super_block *);
>  	int (*unfreeze_fs) (struct super_block *);
>  	int (*statfs) (struct dentry *, struct kstatfs *);
> +#ifdef CONFIG_FSINFO
> +	int (*fsinfo)(struct path *, struct fsinfo_context *);
> +#endif
>  	int (*remount_fs) (struct super_block *, int *, char *);
>  	void (*umount_begin) (struct super_block *);
>  
> diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
> new file mode 100644
> index 000000000000..a811d69b02ff
> --- /dev/null
> +++ b/include/linux/fsinfo.h
> @@ -0,0 +1,74 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Filesystem information query
> + *
> + * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +
> +#ifndef _LINUX_FSINFO_H
> +#define _LINUX_FSINFO_H
> +
> +#ifdef CONFIG_FSINFO
> +
> +#include <uapi/linux/fsinfo.h>
> +
> +struct path;
> +
> +#define FSINFO_NORMAL_ATTR_MAX_SIZE 4096
> +
> +struct fsinfo_context {
> +	__u32		flags;		/* [in] FSINFO_FLAGS_* */
> +	__u32		requested_attr;	/* [in] What is being asking for */
> +	__u32		Nth;		/* [in] Instance of it (some may have multiple) */
> +	__u32		Mth;		/* [in] Subinstance */
> +	bool		want_size_only;	/* [in] Just want to know the size, not the data */
> +	bool		clear_tail;	/* [out] T if tail of buffer should be cleared */
> +	unsigned int	skip;		/* [out] Number of bytes to skip in buffer */
> +	unsigned int	usage;		/* [tmp] Amount of buffer used (if large) */
> +	unsigned int	buf_size;	/* [tmp] Size of ->buffer[] */
> +	void		*buffer;	/* [out] The reply buffer */
> +};
> +
> +/*
> + * A filesystem information attribute definition.
> + */
> +struct fsinfo_attribute {
> +	unsigned int		attr_id;	/* The ID of the attribute */
> +	enum fsinfo_value_type	type:8;		/* The type of the attribute's value(s) */
> +	unsigned int		flags:8;
> +	unsigned int		size:16;	/* - Value size (FSINFO_STRUCT/LIST) */
> +	int (*get)(struct path *path, struct fsinfo_context *params);
> +};
> +
> +#define __FSINFO(A, T, S, G, F) \
> +	{ .attr_id = A, .type = T, .flags = F, .size = S, .get = G }
> +
> +#define _FSINFO(A, T, S, G)	__FSINFO(A, T, S, G, 0)
> +#define _FSINFO_N(A, T, S, G)	__FSINFO(A, T, S, G, FSINFO_FLAGS_N)
> +#define _FSINFO_NM(A, T, S, G)	__FSINFO(A, T, S, G, FSINFO_FLAGS_NM)
> +
> +#define _FSINFO_VSTRUCT(A,S,G)	  _FSINFO   (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
> +#define _FSINFO_VSTRUCT_N(A,S,G)  _FSINFO_N (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
> +#define _FSINFO_VSTRUCT_NM(A,S,G) _FSINFO_NM(A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
> +
> +#define FSINFO_VSTRUCT(A,G)	_FSINFO_VSTRUCT   (A, A##__STRUCT, G)
> +#define FSINFO_VSTRUCT_N(A,G)	_FSINFO_VSTRUCT_N (A, A##__STRUCT, G)
> +#define FSINFO_VSTRUCT_NM(A,G)	_FSINFO_VSTRUCT_NM(A, A##__STRUCT, G)
> +#define FSINFO_STRING(A,G)	_FSINFO   (A, FSINFO_TYPE_STRING, 0, G)
> +#define FSINFO_STRING_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_STRING, 0, G)
> +#define FSINFO_STRING_NM(A,G)	_FSINFO_NM(A, FSINFO_TYPE_STRING, 0, G)
> +#define FSINFO_OPAQUE(A,G)	_FSINFO   (A, FSINFO_TYPE_OPAQUE, 0, G)


The opaque type seems to be unused in this patchset.  It's definitely not
somehting we want without a good reason, so if that reason arrises, then let's
please discuss then.

> +#define FSINFO_LIST(A,G)	_FSINFO   (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G)
> +#define FSINFO_LIST_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G)
> +
> +extern int fsinfo_opaque(const void *, struct fsinfo_context *, unsigned int);
> +extern int fsinfo_string(const char *, struct fsinfo_context *);
> +extern int fsinfo_generic_timestamp_info(struct path *, struct fsinfo_context *);
> +extern int fsinfo_generic_supports(struct path *, struct fsinfo_context *);
> +extern int fsinfo_generic_limits(struct path *, struct fsinfo_context *);
> +extern int fsinfo_get_attribute(struct path *, struct fsinfo_context *,
> +				const struct fsinfo_attribute *);
> +
> +#endif /* CONFIG_FSINFO */
> +
> +#endif /* _LINUX_FSINFO_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 88d03fd627ab..e31ad49af4c3 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -47,6 +47,7 @@ struct stat64;
>  struct statfs;
>  struct statfs64;
>  struct statx;
> +struct fsinfo_params;
>  struct __sysctl_args;
>  struct sysinfo;
>  struct timespec;
> @@ -1007,6 +1008,9 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
>  asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
>  asmlinkage long sys_watch_mount(int dfd, const char __user *path,
>  				unsigned int at_flags, int watch_fd, int watch_id);
> +asmlinkage long sys_fsinfo(int dfd, const char __user *pathname,
> +			   const struct fsinfo_params __user *params, size_t params_size,
> +			   void __user *result_buffer, size_t result_buf_size);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index fcdca8c7d30a..9e38f611ab56 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -859,9 +859,11 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
>  __SYSCALL(__NR_faccessat2, sys_faccessat2)
>  #define __NR_watch_mount 440
>  __SYSCALL(__NR_watch_mount, sys_watch_mount)
> +#define __NR_fsinfo 441
> +__SYSCALL(__NR_fsinfo, sys_fsinfo)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 441
> +#define __NR_syscalls 442
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
> new file mode 100644
> index 000000000000..65892239ba86
> --- /dev/null
> +++ b/include/uapi/linux/fsinfo.h
> @@ -0,0 +1,189 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/* fsinfo() definitions.
> + *
> + * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +#ifndef _UAPI_LINUX_FSINFO_H
> +#define _UAPI_LINUX_FSINFO_H
> +
> +#include <linux/types.h>
> +#include <linux/socket.h>
> +#include <linux/openat2.h>
> +
> +/*
> + * The filesystem attributes that can be requested.  Note that some attributes
> + * may have multiple instances which can be switched in the parameter block.
> + */
> +#define FSINFO_ATTR_STATFS		0x00	/* statfs()-style state */
> +#define FSINFO_ATTR_IDS			0x01	/* Filesystem IDs */
> +#define FSINFO_ATTR_LIMITS		0x02	/* Filesystem limits */
> +#define FSINFO_ATTR_SUPPORTS		0x03	/* What's supported in statx, iocflags, ... */
> +#define FSINFO_ATTR_TIMESTAMP_INFO	0x04	/* Inode timestamp info */
> +#define FSINFO_ATTR_VOLUME_ID		0x05	/* Volume ID (string) */
> +#define FSINFO_ATTR_VOLUME_UUID		0x06	/* Volume UUID (LE uuid) */
> +#define FSINFO_ATTR_VOLUME_NAME		0x07	/* Volume name (string) */
> +
> +#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
> +#define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */


I think it would make sense to move the actual attributes to a separate patch
and leave this just being the infrastructure.

> +
> +/*
> + * Optional fsinfo() parameter structure.
> + *
> + * If this is not given, it is assumed that fsinfo_attr_statfs instance 0,0 is
> + * desired.
> + */
> +struct fsinfo_params {
> +	__u64	resolve_flags;	/* RESOLVE_* flags */
> +	__u32	at_flags;	/* AT_* flags */
> +	__u32	flags;		/* Flags controlling fsinfo() specifically */
> +#define FSINFO_FLAGS_QUERY_MASK	0x0007 /* What object should fsinfo() query? */
> +#define FSINFO_FLAGS_QUERY_PATH	0x0000 /* - path, specified by dirfd,pathname,AT_EMPTY_PATH */
> +#define FSINFO_FLAGS_QUERY_FD	0x0001 /* - fd specified by dirfd */
> +	__u32	request;	/* ID of requested attribute */
> +	__u32	Nth;		/* Instance of it (some may have multiple) */
> +	__u32	Mth;		/* Subinstance of Nth instance */
> +};
> +
> +enum fsinfo_value_type {
> +	FSINFO_TYPE_VSTRUCT	= 0,	/* Version-lengthed struct (up to 4096 bytes) */
> +	FSINFO_TYPE_STRING	= 1,	/* NUL-term var-length string (up to 4095 chars) */
> +	FSINFO_TYPE_OPAQUE	= 2,	/* Opaque blob (unlimited size) */
> +	FSINFO_TYPE_LIST	= 3,	/* List of ints/structs (unlimited size) */
> +};
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO).
> + *
> + * This gives information about the attributes supported by fsinfo for the
> + * given path.
> + */
> +struct fsinfo_attribute_info {
> +	unsigned int		attr_id;	/* The ID of the attribute */
> +	enum fsinfo_value_type	type;		/* The type of the attribute's value(s) */
> +	unsigned int		flags;
> +#define FSINFO_FLAGS_N		0x01		/* - Attr has a set of values */
> +#define FSINFO_FLAGS_NM		0x02		/* - Attr has a set of sets of values */
> +	unsigned int		size;		/* - Value size (FSINFO_STRUCT/FSINFO_LIST) */
> +};
> +
> +#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO__STRUCT struct fsinfo_attribute_info
> +#define FSINFO_ATTR_FSINFO_ATTRIBUTES__STRUCT __u32
> +
> +struct fsinfo_u128 {
> +#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> +	__u64	hi;
> +	__u64	lo;
> +#elif defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN)
> +	__u64	lo;
> +	__u64	hi;
> +#endif

Shouldn't this belong in <linux/types.h>?

> +};
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_STATFS).
> + * - This gives extended filesystem information.
> + */
> +struct fsinfo_statfs {
> +	struct fsinfo_u128 f_blocks;	/* Total number of blocks in fs */
> +	struct fsinfo_u128 f_bfree;	/* Total number of free blocks */
> +	struct fsinfo_u128 f_bavail;	/* Number of free blocks available to ordinary user */
> +	struct fsinfo_u128 f_files;	/* Total number of file nodes in fs */
> +	struct fsinfo_u128 f_ffree;	/* Number of free file nodes */
> +	struct fsinfo_u128 f_favail;	/* Number of file nodes available to ordinary user */


Is there a reason these are 128 wide fields?  Are we approaching the limits of
64bits?

> +	__u64	f_bsize;		/* Optimal block size */
> +	__u64	f_frsize;		/* Fragment size */
> +};
> +
> +#define FSINFO_ATTR_STATFS__STRUCT struct fsinfo_statfs
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_IDS).
> + *
> + * List of basic identifiers as is normally found in statfs().
> + */
> +struct fsinfo_ids {
> +	char	f_fs_name[15 + 1];	/* Filesystem name */
> +	__u64	f_fsid;			/* Short 64-bit Filesystem ID (as statfs) */
> +	__u64	f_sb_id;		/* Internal superblock ID for sbnotify()/mntnotify() */
> +	__u32	f_fstype;		/* Filesystem type from linux/magic.h [uncond] */
> +	__u32	f_dev_major;		/* As st_dev_* from struct statx [uncond] */
> +	__u32	f_dev_minor;
> +	__u32	__padding[1];
> +};
> +
> +#define FSINFO_ATTR_IDS__STRUCT struct fsinfo_ids
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_LIMITS).
> + *
> + * List of supported filesystem limits.
> + */
> +struct fsinfo_limits {
> +	struct fsinfo_u128 max_file_size;	/* Maximum file size */
> +	struct fsinfo_u128 max_ino;		/* Maximum inode number */

Again, what's the reason.  AFACT we are not yet worried about overflowing 64
bits.  Future proofing is good, but there has to be some rules and reasons
behind the decisions.

BTW, having all-string attributes (which I have advocated in the past) would
avoid having to worry about field widths.

> +	__u64	max_uid;			/* Maximum UID supported */
> +	__u64	max_gid;			/* Maximum GID supported */
> +	__u64	max_projid;			/* Maximum project ID supported */
> +	__u64	max_hard_links;			/* Maximum number of hard links on a file */
> +	__u64	max_xattr_body_len;		/* Maximum xattr content length */
> +	__u32	max_xattr_name_len;		/* Maximum xattr name length */
> +	__u32	max_filename_len;		/* Maximum filename length */
> +	__u32	max_symlink_len;		/* Maximum symlink content length */
> +	__u32	max_dev_major;			/* Maximum device major representable */
> +	__u32	max_dev_minor;			/* Maximum device minor representable */
> +	__u32	__padding[1];
> +};
> +
> +#define FSINFO_ATTR_LIMITS__STRUCT struct fsinfo_limits
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_SUPPORTS).
> + *
> + * What's supported in various masks, such as statx() attribute and mask bits
> + * and IOC flags.
> + */
> +struct fsinfo_supports {
> +	__u64	stx_attributes;		/* What statx::stx_attributes are supported */
> +	__u32	stx_mask;		/* What statx::stx_mask bits are supported */
> +	__u32	fs_ioc_getflags;	/* What FS_IOC_GETFLAGS may return */
> +	__u32	fs_ioc_setflags_set;	/* What FS_IOC_SETFLAGS may set */
> +	__u32	fs_ioc_setflags_clear;	/* What FS_IOC_SETFLAGS may clear */
> +	__u32	fs_ioc_fsgetxattr_xflags; /* What FS_IOC_FSGETXATTR[A] may return in fsx_xflags */
> +	__u32	fs_ioc_fssetxattr_xflags_set; /* What FS_IOC_FSSETXATTR may set in fsx_xflags */
> +	__u32	fs_ioc_fssetxattr_xflags_clear; /* What FS_IOC_FSSETXATTR may set in fsx_xflags */
> +	__u32	win_file_attrs;		/* What DOS/Windows FILE_* attributes are supported */
> +};
> +
> +#define FSINFO_ATTR_SUPPORTS__STRUCT struct fsinfo_supports
> +
> +struct fsinfo_timestamp_one {
> +	__s64	minimum;	/* Minimum timestamp value in seconds */
> +	__s64	maximum;	/* Maximum timestamp value in seconds */
> +	__u16	gran_mantissa;	/* Granularity(secs) = mant * 10^exp */
> +	__s8	gran_exponent;
> +	__u8	__padding[5];
> +};
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_TIMESTAMP_INFO).
> + */
> +struct fsinfo_timestamp_info {
> +	struct fsinfo_timestamp_one	atime;	/* Access time */
> +	struct fsinfo_timestamp_one	mtime;	/* Modification time */
> +	struct fsinfo_timestamp_one	ctime;	/* Change time */
> +	struct fsinfo_timestamp_one	btime;	/* Birth/creation time */
> +};
> +
> +#define FSINFO_ATTR_TIMESTAMP_INFO__STRUCT struct fsinfo_timestamp_info
> +
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_VOLUME_UUID).
> + */
> +struct fsinfo_volume_uuid {
> +	__u8	uuid[16];
> +};
> +
> +#define FSINFO_ATTR_VOLUME_UUID__STRUCT struct fsinfo_volume_uuid
> +
> +#endif /* _UAPI_LINUX_FSINFO_H */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 3e1c5c9d2efe..f72a9e4ddc9a 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -51,6 +51,7 @@ COND_SYSCALL_COMPAT(io_pgetevents);
>  COND_SYSCALL(io_uring_setup);
>  COND_SYSCALL(io_uring_enter);
>  COND_SYSCALL(io_uring_register);
> +COND_SYSCALL(fsinfo);
>  
>  /* fs/xattr.c */
>  
> diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
> index 00b6824f9237..d63af5106fc2 100644
> --- a/samples/vfs/Makefile
> +++ b/samples/vfs/Makefile
> @@ -1,5 +1,5 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -userprogs := test-fsmount test-statx
> +userprogs := test-fsinfo test-fsmount test-statx
>  always-y := $(userprogs)
>  
>  userccflags += -I usr/include
> diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
> new file mode 100644
> index 000000000000..934b25399ffe
> --- /dev/null
> +++ b/samples/vfs/test-fsinfo.c
> @@ -0,0 +1,646 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* Test the fsinfo() system call
> + *
> + * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +
> +#define _GNU_SOURCE
> +#define _ATFILE_SOURCE
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <ctype.h>
> +#include <errno.h>
> +#include <time.h>
> +#include <math.h>
> +#include <fcntl.h>
> +#include <sys/syscall.h>
> +#include <linux/fsinfo.h>
> +#include <linux/socket.h>
> +#include <sys/stat.h>
> +#include <arpa/inet.h>
> +
> +#ifndef __NR_fsinfo
> +#define __NR_fsinfo -1
> +#endif
> +
> +static bool debug = 0;
> +static bool list_last;
> +
> +static __attribute__((unused))
> +ssize_t fsinfo(int dfd, const char *filename,
> +	       struct fsinfo_params *params, size_t params_size,
> +	       void *result_buffer, size_t result_buf_size)
> +{
> +	return syscall(__NR_fsinfo, dfd, filename,
> +		       params, params_size,
> +		       result_buffer, result_buf_size);
> +}
> +
> +struct fsinfo_attribute {
> +	unsigned int		attr_id;
> +	enum fsinfo_value_type	type;
> +	unsigned int		size;
> +	const char		*name;
> +	void (*dump)(void *reply, unsigned int size);
> +};
> +
> +static const struct fsinfo_attribute fsinfo_attributes[];
> +
> +static ssize_t get_fsinfo(const char *, const char *, struct fsinfo_params *, void **);
> +
> +static void dump_hex(FILE *f, unsigned char *data, int from, int to)
> +{
> +	unsigned offset, col = 0;
> +	bool print_offset = true;
> +
> +	for (offset = from; offset < to; offset++) {
> +		if (print_offset) {
> +			fprintf(f, "%04x: ", offset);
> +			print_offset = 0;
> +		}
> +		fprintf(f, "%02x", data[offset]);
> +		col++;
> +		if ((col & 3) == 0) {
> +			if ((col & 15) == 0) {
> +				fprintf(f, "\n");
> +				print_offset = 1;
> +			} else {
> +				fprintf(f, " ");
> +			}
> +		}
> +	}
> +
> +	if (!print_offset)
> +		fprintf(f, "\n");
> +}
> +
> +static void dump_attribute_info(void *reply, unsigned int size)
> +{
> +	struct fsinfo_attribute_info *attr_info = reply;
> +	const struct fsinfo_attribute *attr;
> +	char type[32], val_size[32];
> +
> +	switch (attr_info->type) {
> +	case FSINFO_TYPE_VSTRUCT:	strcpy(type, "V-STRUCT");	break;
> +	case FSINFO_TYPE_STRING:	strcpy(type, "STRING");		break;
> +	case FSINFO_TYPE_OPAQUE:	strcpy(type, "OPAQUE");		break;
> +	case FSINFO_TYPE_LIST:		strcpy(type, "LIST");		break;
> +	default:
> +		sprintf(type, "type-%x", attr_info->type);
> +		break;
> +	}
> +
> +	if (attr_info->flags & FSINFO_FLAGS_N)
> +		strcat(type, " x N");
> +	else if (attr_info->flags & FSINFO_FLAGS_NM)
> +		strcat(type, " x NM");
> +
> +	for (attr = fsinfo_attributes; attr->name; attr++)
> +		if (attr->attr_id == attr_info->attr_id)
> +			break;
> +
> +	if (attr_info->size)
> +		sprintf(val_size, "%u", attr_info->size);
> +	else
> +		strcpy(val_size, "-");
> +
> +	printf("%8x %-12s %08x %5s %s\n",
> +	       attr_info->attr_id,
> +	       type,
> +	       attr_info->flags,
> +	       val_size,
> +	       attr->name ? attr->name : "");
> +}
> +
> +static void dump_fsinfo_generic_statfs(void *reply, unsigned int size)
> +{
> +	struct fsinfo_statfs *f = reply;
> +
> +	printf("\n");
> +	printf("\tblocks       : n=%llu fr=%llu av=%llu\n",
> +	       (unsigned long long)f->f_blocks.lo,
> +	       (unsigned long long)f->f_bfree.lo,
> +	       (unsigned long long)f->f_bavail.lo);
> +
> +	printf("\tfiles        : n=%llu fr=%llu av=%llu\n",
> +	       (unsigned long long)f->f_files.lo,
> +	       (unsigned long long)f->f_ffree.lo,
> +	       (unsigned long long)f->f_favail.lo);
> +	printf("\tbsize        : %llu\n",
> +	       (unsigned long long)f->f_bsize);
> +	printf("\tfrsize       : %llu\n",
> +	       (unsigned long long)f->f_frsize);
> +}
> +
> +static void dump_fsinfo_generic_ids(void *reply, unsigned int size)
> +{
> +	struct fsinfo_ids *f = reply;
> +
> +	printf("\n");
> +	printf("\tdev          : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
> +	printf("\tfs           : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
> +	printf("\tfsid         : %llx\n", (unsigned long long)f->f_fsid);
> +	printf("\tsbid         : %llx\n", (unsigned long long)f->f_sb_id);
> +}
> +
> +static void dump_fsinfo_generic_limits(void *reply, unsigned int size)
> +{
> +	struct fsinfo_limits *f = reply;
> +
> +	printf("\n");
> +	printf("\tmax file size: %llx%016llx\n",
> +	       (unsigned long long)f->max_file_size.hi,
> +	       (unsigned long long)f->max_file_size.lo);
> +	printf("\tmax ino      : %llx%016llx\n",
> +	       (unsigned long long)f->max_ino.hi,
> +	       (unsigned long long)f->max_ino.lo);
> +	printf("\tmax ids      : u=%llx g=%llx p=%llx\n",
> +	       (unsigned long long)f->max_uid,
> +	       (unsigned long long)f->max_gid,
> +	       (unsigned long long)f->max_projid);
> +	printf("\tmax dev      : maj=%x min=%x\n",
> +	       f->max_dev_major, f->max_dev_minor);
> +	printf("\tmax links    : %llx\n",
> +	       (unsigned long long)f->max_hard_links);
> +	printf("\tmax xattr    : n=%x b=%llx\n",
> +	       f->max_xattr_name_len,
> +	       (unsigned long long)f->max_xattr_body_len);
> +	printf("\tmax len      : file=%x sym=%x\n",
> +	       f->max_filename_len, f->max_symlink_len);
> +}
> +
> +static void dump_fsinfo_generic_supports(void *reply, unsigned int size)
> +{
> +	struct fsinfo_supports *f = reply;
> +
> +	printf("\n");
> +	printf("\tstx_attr     : %llx\n", (unsigned long long)f->stx_attributes);
> +	printf("\tstx_mask     : %x\n", f->stx_mask);
> +	printf("\tfs_ioc_*flags: get=%x set=%x clr=%x\n",
> +	       f->fs_ioc_getflags, f->fs_ioc_setflags_set, f->fs_ioc_setflags_clear);
> +	printf("\tfs_ioc_*xattr: fsx_xflags: get=%x set=%x clr=%x\n",
> +	       f->fs_ioc_fsgetxattr_xflags,
> +	       f->fs_ioc_fssetxattr_xflags_set,
> +	       f->fs_ioc_fssetxattr_xflags_clear);
> +	printf("\twin_fattrs   : %x\n", f->win_file_attrs);
> +}
> +
> +static void print_time(struct fsinfo_timestamp_one *t, char stamp)
> +{
> +	printf("\t%ctime       : gran=%uE%d range=%llx-%llx\n",
> +	       stamp,
> +	       t->gran_mantissa, t->gran_exponent,
> +	       (long long)t->minimum, (long long)t->maximum);
> +}
> +
> +static void dump_fsinfo_generic_timestamp_info(void *reply, unsigned int size)
> +{
> +	struct fsinfo_timestamp_info *f = reply;
> +
> +	printf("\n");
> +	print_time(&f->atime, 'a');
> +	print_time(&f->mtime, 'm');
> +	print_time(&f->ctime, 'c');
> +	print_time(&f->btime, 'b');
> +}
> +
> +static void dump_fsinfo_generic_volume_uuid(void *reply, unsigned int size)
> +{
> +	struct fsinfo_volume_uuid *f = reply;
> +
> +	printf("%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
> +	       "-%02x%02x%02x%02x%02x%02x\n",
> +	       f->uuid[ 0], f->uuid[ 1],
> +	       f->uuid[ 2], f->uuid[ 3],
> +	       f->uuid[ 4], f->uuid[ 5],
> +	       f->uuid[ 6], f->uuid[ 7],
> +	       f->uuid[ 8], f->uuid[ 9],
> +	       f->uuid[10], f->uuid[11],
> +	       f->uuid[12], f->uuid[13],
> +	       f->uuid[14], f->uuid[15]);
> +}
> +
> +static void dump_string(void *reply, unsigned int size)
> +{
> +	char *s = reply, *p;
> +	bool nl = false, last_nl = false;
> +
> +	p = s;
> +	if (size >= 4096) {
> +		size = 4096;
> +		p[4092] = '.';
> +		p[4093] = '.';
> +		p[4094] = '.';
> +		p[4095] = 0;
> +	} else {
> +		p[size] = 0;
> +	}
> +
> +	for (p = s; *p; p++) {
> +		if (*p == '\n') {
> +			last_nl = nl = true;
> +			continue;
> +		}
> +		last_nl = false;
> +		if (!isprint(*p) && *p != '\t')
> +			*p = '?';
> +	}
> +
> +	if (nl)
> +		putchar('\n');
> +	printf("%s", s);
> +	if (!last_nl)
> +		putchar('\n');
> +}
> +
> +#define dump_fsinfo_meta_attribute_info		(void *)0x123
> +#define dump_fsinfo_meta_attributes		(void *)0x123
> +
> +/*
> + *
> + */
> +#define __FSINFO(A, T, S, G, F, N)					\
> +	{ .attr_id = A, .type = T, .size = S, .name = N, .dump = dump_##G }
> +
> +#define _FSINFO(A,T,S,G,N)	__FSINFO(A, T, S, G, 0, N)
> +#define _FSINFO_N(A,T,S,G,N)	__FSINFO(A, T, S, G, FSINFO_FLAGS_N, N)
> +#define _FSINFO_NM(A,T,S,G,N)	__FSINFO(A, T, S, G, FSINFO_FLAGS_NM, N)
> +
> +#define _FSINFO_VSTRUCT(A,S,G,N)    _FSINFO   (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G, N)
> +#define _FSINFO_VSTRUCT_N(A,S,G,N)  _FSINFO_N (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G, N)
> +#define _FSINFO_VSTRUCT_NM(A,S,G,N) _FSINFO_NM(A, FSINFO_TYPE_VSTRUCT, sizeof(S), G, N)
> +
> +#define FSINFO_VSTRUCT(A,G)	_FSINFO_VSTRUCT   (A, A##__STRUCT, G, #A)
> +#define FSINFO_VSTRUCT_N(A,G)	_FSINFO_VSTRUCT_N (A, A##__STRUCT, G, #A)
> +#define FSINFO_VSTRUCT_NM(A,G)	_FSINFO_VSTRUCT_NM(A, A##__STRUCT, G, #A)
> +#define FSINFO_STRING(A,G)	_FSINFO   (A, FSINFO_TYPE_STRING, 0, G, #A)
> +#define FSINFO_STRING_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_STRING, 0, G, #A)
> +#define FSINFO_STRING_NM(A,G)	_FSINFO_NM(A, FSINFO_TYPE_STRING, 0, G, #A)
> +#define FSINFO_OPAQUE(A,G)	_FSINFO   (A, FSINFO_TYPE_OPAQUE, 0, G, #A)
> +#define FSINFO_LIST(A,G)	_FSINFO   (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G, #A)
> +#define FSINFO_LIST_N(A,G)	_FSINFO_N (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G, #A)
> +
> +static const struct fsinfo_attribute fsinfo_attributes[] = {
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_STATFS,		fsinfo_generic_statfs),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_IDS,		fsinfo_generic_ids),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_LIMITS,		fsinfo_generic_limits),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_SUPPORTS,		fsinfo_generic_supports),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_TIMESTAMP_INFO,	fsinfo_generic_timestamp_info),
> +	FSINFO_STRING	(FSINFO_ATTR_VOLUME_ID,		string),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_VOLUME_UUID,	fsinfo_generic_volume_uuid),
> +	FSINFO_STRING	(FSINFO_ATTR_VOLUME_NAME,	string),
> +	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, fsinfo_meta_attribute_info),
> +	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	fsinfo_meta_attributes),
> +	{}
> +};
> +
> +static __attribute__((noreturn))
> +void bad_value(const char *what,
> +	       struct fsinfo_params *params,
> +	       const struct fsinfo_attribute *attr,
> +	       const struct fsinfo_attribute_info *attr_info,
> +	       void *reply, unsigned int size)
> +{
> +	printf("\n");
> +	fprintf(stderr, "%s %s{%u}{%u} t=%x f=%x s=%x\n",
> +		what, attr->name, params->Nth, params->Mth,
> +		attr_info->type, attr_info->flags, attr_info->size);
> +	fprintf(stderr, "size=%u\n", size);
> +	dump_hex(stderr, reply, 0, size);
> +	exit(1);
> +}
> +
> +static void dump_value(unsigned int attr_id,
> +		       const struct fsinfo_attribute *attr,
> +		       const struct fsinfo_attribute_info *attr_info,
> +		       void *reply, unsigned int size)
> +{
> +	if (!attr || !attr->dump) {
> +		printf("<no dumper>\n");
> +		return;
> +	}
> +
> +	if (attr->type == FSINFO_TYPE_VSTRUCT && size < attr->size) {
> +		printf("<short data %u/%u>\n", size, attr->size);
> +		return;
> +	}
> +
> +	attr->dump(reply, size);
> +}
> +
> +static void dump_list(unsigned int attr_id,
> +		      const struct fsinfo_attribute *attr,
> +		      const struct fsinfo_attribute_info *attr_info,
> +		      void *reply, unsigned int size)
> +{
> +	size_t elem_size = attr_info->size;
> +	unsigned int ix = 0;
> +
> +	printf("\n");
> +	if (!attr || !attr->dump) {
> +		printf("<no dumper>\n");
> +		return;
> +	}
> +
> +	if (attr->type == FSINFO_TYPE_VSTRUCT && size < attr->size) {
> +		printf("<short data %u/%u>\n", size, attr->size);
> +		return;
> +	}
> +
> +	list_last = false;
> +	while (size >= elem_size) {
> +		printf("\t[%02x] ", ix);
> +		if (size == elem_size)
> +			list_last = true;
> +		attr->dump(reply, size);
> +		reply += elem_size;
> +		size -= elem_size;
> +		ix++;
> +	}
> +}
> +
> +/*
> + * Call fsinfo, expanding the buffer as necessary.
> + */
> +static ssize_t get_fsinfo(const char *file, const char *name,
> +			  struct fsinfo_params *params, void **_r)
> +{
> +	ssize_t ret;
> +	size_t buf_size = 4096;
> +	void *r;
> +
> +	for (;;) {
> +		r = malloc(buf_size);
> +		if (!r) {
> +			perror("malloc");
> +			exit(1);
> +		}
> +		memset(r, 0xbd, buf_size);
> +
> +		errno = 0;
> +		ret = fsinfo(AT_FDCWD, file, params, sizeof(*params), r, buf_size - 1);
> +		if (ret == -1)
> +			goto error;
> +
> +		if (ret <= buf_size - 1)
> +			break;
> +		buf_size = (ret + 4096 - 1) & ~(4096 - 1);
> +	}
> +
> +	if (debug)
> +		printf("fsinfo(%s,%s,%u,%u) = %zd\n",
> +		       file, name, params->Nth, params->Mth, ret);
> +
> +	((char *)r)[ret] = 0;
> +	*_r = r;
> +	return ret;
> +
> +error:
> +	*_r = NULL;
> +	free(r);
> +	if (debug)
> +		printf("fsinfo(%s,%s,%u,%u) = %m\n",
> +		       file, name, params->Nth, params->Mth);
> +	return ret;
> +}
> +
> +/*
> + * Try one subinstance of an attribute.
> + */
> +static int try_one(const char *file, struct fsinfo_params *params,
> +		   const struct fsinfo_attribute_info *attr_info, bool raw)
> +{
> +	const struct fsinfo_attribute *attr;
> +	const char *name;
> +	size_t size = 4096;
> +	char namebuf[32];
> +	void *r;
> +
> +	for (attr = fsinfo_attributes; attr->name; attr++) {
> +		if (attr->attr_id == params->request) {
> +			name = attr->name;
> +			if (strncmp(name, "fsinfo_generic_", 15) == 0)
> +				name += 15;
> +			goto found;
> +		}
> +	}
> +
> +	sprintf(namebuf, "<unknown-%x>", params->request);
> +	name = namebuf;
> +	attr = NULL;
> +
> +found:
> +	size = get_fsinfo(file, name, params, &r);
> +
> +	if (size == -1) {
> +		if (errno == ENODATA) {
> +			if (!(attr_info->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)) &&
> +			    params->Nth == 0 && params->Mth == 0)
> +				bad_value("Unexpected ENODATA",
> +					  params, attr, attr_info, r, size);
> +			free(r);
> +			return (params->Mth == 0) ? 2 : 1;
> +		}
> +		if (errno == EOPNOTSUPP) {
> +			if (params->Nth > 0 || params->Mth > 0)
> +				bad_value("Should return ENODATA",
> +					  params, attr, attr_info, r, size);
> +			//printf("\e[33m%s\e[m: <not supported>\n",
> +			//       fsinfo_attr_names[attr]);
> +			free(r);
> +			return 2;
> +		}
> +		perror(file);
> +		exit(1);
> +	}
> +
> +	if (raw) {
> +		if (size > 4096)
> +			size = 4096;
> +		dump_hex(stdout, r, 0, size);
> +		free(r);
> +		return 0;
> +	}
> +
> +	switch (attr_info->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)) {
> +	case 0:
> +		printf("\e[33m%s\e[m: ", name);
> +		break;
> +	case FSINFO_FLAGS_N:
> +		printf("\e[33m%s{%u}\e[m: ", name, params->Nth);
> +		break;
> +	case FSINFO_FLAGS_NM:
> +		printf("\e[33m%s{%u,%u}\e[m: ", name, params->Nth, params->Mth);
> +		break;
> +	}
> +
> +	switch (attr_info->type) {
> +	case FSINFO_TYPE_STRING:
> +		if (size == 0 || ((char *)r)[size - 1] != 0)
> +			bad_value("Unterminated string",
> +				  params, attr, attr_info, r, size);
> +	case FSINFO_TYPE_VSTRUCT:
> +	case FSINFO_TYPE_OPAQUE:
> +		dump_value(params->request, attr, attr_info, r, size);
> +		free(r);
> +		return 0;
> +
> +	case FSINFO_TYPE_LIST:
> +		dump_list(params->request, attr, attr_info, r, size);
> +		free(r);
> +		return 0;
> +
> +	default:
> +		bad_value("Fishy type", params, attr, attr_info, r, size);
> +	}
> +}
> +
> +static int cmp_u32(const void *a, const void *b)
> +{
> +	return *(const int *)a - *(const int *)b;
> +}
> +
> +/*
> + *
> + */
> +int main(int argc, char **argv)
> +{
> +	struct fsinfo_attribute_info attr_info;
> +	struct fsinfo_params params = {
> +		.at_flags	= AT_SYMLINK_NOFOLLOW,
> +		.flags		= FSINFO_FLAGS_QUERY_PATH,
> +	};
> +	unsigned int *attrs, ret, nr, i;
> +	bool meta = false;
> +	int raw = 0, opt, Nth, Mth;
> +
> +	while ((opt = getopt(argc, argv, "Madlr"))) {
> +		switch (opt) {
> +		case 'M':
> +			meta = true;
> +			continue;
> +		case 'a':
> +			params.at_flags |= AT_NO_AUTOMOUNT;
> +			params.flags = FSINFO_FLAGS_QUERY_PATH;
> +			continue;
> +		case 'd':
> +			debug = true;
> +			continue;
> +		case 'l':
> +			params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
> +			params.flags = FSINFO_FLAGS_QUERY_PATH;
> +			continue;
> +		case 'r':
> +			raw = 1;
> +			continue;
> +		}
> +		break;
> +	}
> +
> +	argc -= optind;
> +	argv += optind;
> +
> +	if (argc != 1) {
> +		printf("Format: test-fsinfo [-Madlr] <path>\n");
> +		exit(2);
> +	}
> +
> +	/* Retrieve a list of supported attribute IDs */
> +	params.request = FSINFO_ATTR_FSINFO_ATTRIBUTES;
> +	params.Nth = 0;
> +	params.Mth = 0;
> +	ret = get_fsinfo(argv[0], "attributes", &params, (void **)&attrs);
> +	if (ret == -1) {
> +		fprintf(stderr, "Unable to get attribute list: %m\n");
> +		exit(1);
> +	}
> +
> +	if (ret % sizeof(attrs[0])) {
> +		fprintf(stderr, "Bad length of attribute list (0x%x)\n", ret);
> +		exit(2);
> +	}
> +
> +	nr = ret / sizeof(attrs[0]);
> +	qsort(attrs, nr, sizeof(attrs[0]), cmp_u32);
> +
> +	if (meta) {
> +		printf("ATTR ID  TYPE         FLAGS    SIZE  NAME\n");
> +		printf("======== ============ ======== ===== =========\n");
> +		for (i = 0; i < nr; i++) {
> +			params.request = FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO;
> +			params.Nth = attrs[i];
> +			params.Mth = 0;
> +			ret = fsinfo(AT_FDCWD, argv[0],
> +				     &params, sizeof(params),
> +				     &attr_info, sizeof(attr_info));
> +			if (ret == -1) {
> +				fprintf(stderr, "Can't get info for attribute %x: %m\n", attrs[i]);
> +				exit(1);
> +			}
> +
> +			dump_attribute_info(&attr_info, ret);
> +		}
> +		exit(0);
> +	}
> +
> +	for (i = 0; i < nr; i++) {
> +		params.request = FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO;
> +		params.Nth = attrs[i];
> +		params.Mth = 0;
> +		ret = fsinfo(AT_FDCWD, argv[0],
> +			     &params, sizeof(params),
> +			     &attr_info, sizeof(attr_info));
> +		if (ret == -1) {
> +			fprintf(stderr, "Can't get info for attribute %x: %m\n", attrs[i]);
> +			exit(1);
> +		}
> +
> +		if (attrs[i] == FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO ||
> +		    attrs[i] == FSINFO_ATTR_FSINFO_ATTRIBUTES)
> +			continue;
> +
> +		if (attrs[i] != attr_info.attr_id) {
> +			fprintf(stderr, "ID for %03x returned %03x\n",
> +				attrs[i], attr_info.attr_id);
> +			break;
> +		}
> +		Nth = 0;
> +		do {
> +			Mth = 0;
> +			do {
> +				params.request = attrs[i];
> +				params.Nth = Nth;
> +				params.Mth = Mth;
> +
> +				switch (try_one(argv[0], &params, &attr_info, raw)) {
> +				case 0:
> +					continue;
> +				case 1:
> +					goto done_M;
> +				case 2:
> +					goto done_N;
> +				}
> +			} while (++Mth < 100);
> +
> +		done_M:
> +			if (Mth >= 100) {
> +				fprintf(stderr, "Fishy: Mth %x[%u][%u]\n", attrs[i], Nth, Mth);
> +				break;
> +			}
> +
> +		} while (++Nth < 100);
> +
> +	done_N:
> +		if (Nth >= 100) {
> +			fprintf(stderr, "Fishy: Nth %x[%u]\n", attrs[i], Nth);
> +			break;
> +		}
> +	}
> +
> +	return 0;
> +}
> 
> 

^ permalink raw reply

* Re: [PATCH 05/18] fsinfo: Allow fsinfo() to look up a mount object by ID [ver #21]
From: Miklos Szeredi @ 2020-08-04 10:33 UTC (permalink / raw)
  To: David Howells
  Cc: viro, torvalds, raven, mszeredi, christian, jannh, darrick.wong,
	kzak, jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel
In-Reply-To: <159646182832.1784947.10440560370840683639.stgit@warthog.procyon.org.uk>

On Mon, Aug 03, 2020 at 02:37:08PM +0100, David Howells wrote:
> Allow the fsinfo() syscall to look up a mount object by ID rather than by
> pathname.  This is necessary as there can be multiple mounts stacked up at
> the same pathname and there's no way to look through them otherwise.
> 
> This is done by passing FSINFO_FLAGS_QUERY_MOUNT to fsinfo() in the
> parameters and then passing the mount ID as a string to fsinfo() in place
> of the filename:
> 
> 	struct fsinfo_params params = {
> 		.flags	 = FSINFO_FLAGS_QUERY_MOUNT,
> 		.request = FSINFO_ATTR_IDS,
> 	};
> 
> 	ret = fsinfo(AT_FDCWD, "21", &params, buffer, sizeof(buffer));
> 
> The caller is only permitted to query a mount object if the root directory
> of that mount connects directly to the current chroot if dfd == AT_FDCWD[*]
> or the directory specified by dfd otherwise.  Note that this is not
> available to the pathwalk of any other syscall.
> 
> [*] This needs to be something other than AT_FDCWD, perhaps AT_FDROOT.
> 
> [!] This probably needs an LSM hook.
> 
> [!] This might want to check the permissions on all the intervening dirs -
>     but it would have to do that under RCU conditions.
> 
> [!] This might want to check a CAP_* flag.

Was this reviewed by security folks?

> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  fs/fsinfo.c                 |   53 +++++++++++++++++++
>  fs/internal.h               |    1 
>  fs/namespace.c              |  117 ++++++++++++++++++++++++++++++++++++++++++-
>  include/uapi/linux/fsinfo.h |    1 
>  samples/vfs/test-fsinfo.c   |    7 ++-
>  5 files changed, 175 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/fsinfo.c b/fs/fsinfo.c
> index aef7a736e8fc..8ccbcddb4f16 100644
> --- a/fs/fsinfo.c
> +++ b/fs/fsinfo.c
> @@ -563,6 +563,56 @@ static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_context *ctx)
>  	return ret;
>  }
>  
> +/*
> + * Look up the root of a mount object.  This allows access to mount objects
> + * (and their attached superblocks) that can't be retrieved by path because
> + * they're entirely covered.
> + *
> + * We only permit access to a mount that has a direct path between either the
> + * dentry pointed to by dfd or to our chroot (if dfd is AT_FDCWD).
> + */
> +static int vfs_fsinfo_mount(int dfd, const char __user *filename,
> +			    struct fsinfo_context *ctx)
> +{
> +	struct path path;
> +	struct fd f = {};
> +	char *name;
> +	unsigned long mnt_id;
> +	int ret;
> +
> +	if (!filename)
> +		return -EINVAL;
> +
> +	name = strndup_user(filename, 32);
> +	if (IS_ERR(name))
> +		return PTR_ERR(name);
> +	ret = kstrtoul(name, 0, &mnt_id);
> +	if (ret < 0)
> +		goto out_name;
> +	if (mnt_id > INT_MAX)
> +		goto out_name;
> +
> +	if (dfd != AT_FDCWD) {
> +		ret = -EBADF;
> +		f = fdget_raw(dfd);
> +		if (!f.file)
> +			goto out_name;
> +	}
> +
> +	ret = lookup_mount_object(f.file ? &f.file->f_path : NULL,
> +				  mnt_id, &path);
> +	if (ret < 0)
> +		goto out_fd;
> +
> +	ret = vfs_fsinfo(&path, ctx);
> +	path_put(&path);
> +out_fd:
> +	fdput(f);
> +out_name:
> +	kfree(name);
> +	return ret;
> +}
> +
>  /**
>   * sys_fsinfo - System call to get filesystem information
>   * @dfd: Base directory to pathwalk from or fd referring to filesystem.
> @@ -636,6 +686,9 @@ SYSCALL_DEFINE6(fsinfo,
>  			return -EINVAL;
>  		ret = vfs_fsinfo_fd(dfd, &ctx);
>  		break;
> +	case FSINFO_FLAGS_QUERY_MOUNT:
> +		ret = vfs_fsinfo_mount(dfd, pathname, &ctx);
> +		break;
>  	default:
>  		return -EINVAL;
>  	}
> diff --git a/fs/internal.h b/fs/internal.h
> index 0b57da498f06..84bbb743a5ac 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -89,6 +89,7 @@ extern int __mnt_want_write_file(struct file *);
>  extern void __mnt_drop_write_file(struct file *);
>  
>  extern void dissolve_on_fput(struct vfsmount *);
> +extern int lookup_mount_object(struct path *, unsigned int, struct path *);
>  extern int fsinfo_generic_mount_source(struct path *, struct fsinfo_context *);
>  
>  /*
> diff --git a/fs/namespace.c b/fs/namespace.c
> index ead8d1a16610..b2b9920ffd3c 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -64,7 +64,7 @@ static int __init set_mphash_entries(char *str)
>  __setup("mphash_entries=", set_mphash_entries);
>  
>  static u64 event;
> -static DEFINE_IDA(mnt_id_ida);
> +static DEFINE_IDR(mnt_id_ida);
>  static DEFINE_IDA(mnt_group_ida);
>  
>  static struct hlist_head *mount_hashtable __read_mostly;
> @@ -105,17 +105,27 @@ static inline struct hlist_head *mp_hash(struct dentry *dentry)
>  
>  static int mnt_alloc_id(struct mount *mnt)
>  {
> -	int res = ida_alloc(&mnt_id_ida, GFP_KERNEL);
> +	int res;
>  
> +	/* Allocate an ID, but don't set the pointer back to the mount until
> +	 * later, as once we do that, we have to follow RCU protocols to get
> +	 * rid of the mount struct.
> +	 */
> +	res = idr_alloc(&mnt_id_ida, NULL, 0, INT_MAX, GFP_KERNEL);

This needs to be a separate patch.

>  	if (res < 0)
>  		return res;
>  	mnt->mnt_id = res;
>  	return 0;
>  }
>  
> +static void mnt_publish_id(struct mount *mnt)
> +{
> +	idr_replace(&mnt_id_ida, mnt, mnt->mnt_id);
> +}
> +
>  static void mnt_free_id(struct mount *mnt)
>  {
> -	ida_free(&mnt_id_ida, mnt->mnt_id);
> +	idr_remove(&mnt_id_ida, mnt->mnt_id);
>  }
>  
>  /*
> @@ -975,6 +985,7 @@ struct vfsmount *vfs_create_mount(struct fs_context *fc)
>  	lock_mount_hash();
>  	list_add_tail(&mnt->mnt_instance, &mnt->mnt.mnt_sb->s_mounts);
>  	unlock_mount_hash();
> +	mnt_publish_id(mnt);
>  	return &mnt->mnt;
>  }
>  EXPORT_SYMBOL(vfs_create_mount);
> @@ -1068,6 +1079,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
>  	lock_mount_hash();
>  	list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
>  	unlock_mount_hash();
> +	mnt_publish_id(mnt);
>  
>  	if ((flag & CL_SLAVE) ||
>  	    ((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) {
> @@ -4151,4 +4163,103 @@ int fsinfo_generic_mount_source(struct path *path, struct fsinfo_context *ctx)
>  	return m.count + 1;
>  }
>  
> +/*
> + * See if one path point connects directly to another by ancestral relationship
> + * across mountpoints.  Must call with the RCU read lock held.
> + */
> +static bool are_paths_connected(struct path *ancestor, struct path *to_check)
> +{
> +	struct mount *mnt, *parent;
> +	struct path cursor;
> +	unsigned seq;
> +	bool connected;
> +
> +	seq = 0;
> +restart:
> +	cursor = *to_check;
> +
> +	read_seqbegin_or_lock(&rename_lock, &seq);
> +	while (cursor.mnt != ancestor->mnt) {
> +		mnt = real_mount(cursor.mnt);
> +		parent = READ_ONCE(mnt->mnt_parent);
> +		if (mnt == parent)
> +			goto failed;
> +		cursor.dentry = READ_ONCE(mnt->mnt_mountpoint);
> +		cursor.mnt = &parent->mnt;
> +	}
> +
> +	while (cursor.dentry != ancestor->dentry) {
> +		if (cursor.dentry == cursor.mnt->mnt_root ||
> +		    IS_ROOT(cursor.dentry))
> +			goto failed;
> +		cursor.dentry = READ_ONCE(cursor.dentry->d_parent);
> +	}
> +
> +	connected = true;
> +out:
> +	done_seqretry(&rename_lock, seq);
> +	return connected;
> +
> +failed:
> +	if (need_seqretry(&rename_lock, seq)) {
> +		seq = 1;
> +		goto restart;
> +	}
> +	connected = false;
> +	goto out;
> +}
> +
> +/**
> + * lookup_mount_object - Look up a vfsmount object by ID
> + * @root: The mount root must connect backwards to this point (or chroot if NULL).
> + * @id: The ID of the mountpoint.
> + * @_mntpt: Where to return the resulting mountpoint path.
> + *
> + * Look up the root of the mount with the corresponding ID.  This is only
> + * permitted if that mount connects directly to the specified root/chroot.
> + */
> +int lookup_mount_object(struct path *root, unsigned int mnt_id, struct path *_mntpt)
> +{
> +	struct mount *mnt;
> +	struct path stop, mntpt = {};
> +	int ret = -EPERM;
> +
> +	if (!root)
> +		get_fs_root(current->fs, &stop);
> +	else
> +		stop = *root;
> +
> +	rcu_read_lock();
> +	lock_mount_hash();
> +	mnt = idr_find(&mnt_id_ida, mnt_id);
> +	if (!mnt)
> +		goto out_unlock_mh;
> +	if (mnt->mnt.mnt_flags & (MNT_SYNC_UMOUNT | MNT_UMOUNT | MNT_DOOMED))
> +		goto out_unlock_mh;
> +	if (mnt_get_count(mnt) == 0)
> +		goto out_unlock_mh;
> +	mnt_add_count(mnt, 1);
> +	mntpt.mnt = &mnt->mnt;
> +	mntpt.dentry = dget(mnt->mnt.mnt_root);
> +	unlock_mount_hash();
> +
> +	if (are_paths_connected(&stop, &mntpt)) {
> +		*_mntpt = mntpt;
> +		mntpt.mnt = NULL;
> +		mntpt.dentry = NULL;
> +		ret = 0;
> +	}
> +
> +out_unlock:
> +	rcu_read_unlock();
> +	if (!root)
> +		path_put(&stop);
> +	path_put(&mntpt);
> +	return ret;
> +
> +out_unlock_mh:
> +	unlock_mount_hash();
> +	goto out_unlock;
> +}
> +
>  #endif /* CONFIG_FSINFO */
> diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
> index a27e92b68266..d24e47762a07 100644
> --- a/include/uapi/linux/fsinfo.h
> +++ b/include/uapi/linux/fsinfo.h
> @@ -44,6 +44,7 @@ struct fsinfo_params {
>  #define FSINFO_FLAGS_QUERY_MASK	0x0007 /* What object should fsinfo() query? */
>  #define FSINFO_FLAGS_QUERY_PATH	0x0000 /* - path, specified by dirfd,pathname,AT_EMPTY_PATH */
>  #define FSINFO_FLAGS_QUERY_FD	0x0001 /* - fd specified by dirfd */
> +#define FSINFO_FLAGS_QUERY_MOUNT 0x0002	/* - mount object (path=>mount_id, dirfd=>subtree) */
>  	__u32	request;	/* ID of requested attribute */
>  	__u32	Nth;		/* Instance of it (some may have multiple) */
>  	__u32	Mth;		/* Subinstance of Nth instance */
> diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
> index 634f30b7e67f..dfa44bba8bbd 100644
> --- a/samples/vfs/test-fsinfo.c
> +++ b/samples/vfs/test-fsinfo.c
> @@ -593,7 +593,7 @@ int main(int argc, char **argv)
>  	bool meta = false;
>  	int raw = 0, opt, Nth, Mth;
>  
> -	while ((opt = getopt(argc, argv, "Madlr"))) {
> +	while ((opt = getopt(argc, argv, "Madmlr"))) {
>  		switch (opt) {
>  		case 'M':
>  			meta = true;
> @@ -609,6 +609,10 @@ int main(int argc, char **argv)
>  			params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
>  			params.flags = FSINFO_FLAGS_QUERY_PATH;
>  			continue;
> +		case 'm':
> +			params.resolve_flags = 0;
> +			params.flags = FSINFO_FLAGS_QUERY_MOUNT;
> +			continue;
>  		case 'r':
>  			raw = 1;
>  			continue;
> @@ -621,6 +625,7 @@ int main(int argc, char **argv)
>  
>  	if (argc != 1) {
>  		printf("Format: test-fsinfo [-Madlr] <path>\n");
> +		printf("Format: test-fsinfo [-Mdr] -m <mnt_id>\n");
>  		exit(2);
>  	}
>  
> 
> 

^ permalink raw reply

* Re: [PATCH 06/18] fsinfo: Add a uniquifier ID to struct mount [ver #21]
From: Miklos Szeredi @ 2020-08-04 10:41 UTC (permalink / raw)
  To: David Howells
  Cc: viro, torvalds, raven, mszeredi, christian, jannh, darrick.wong,
	kzak, jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel
In-Reply-To: <159646183662.1784947.5709738540440380373.stgit@warthog.procyon.org.uk>

On Mon, Aug 03, 2020 at 02:37:16PM +0100, David Howells wrote:
> Add a uniquifier ID to struct mount that is effectively unique over the
> kernel lifetime to deal around mnt_id values being reused.  This can then
> be exported through fsinfo() to allow detection of replacement mounts that
> happen to end up with the same mount ID.
> 
> The normal mount handle is still used for referring to a particular mount.
> 
> The mount notification is then changed to convey these unique mount IDs
> rather than the mount handle.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  fs/mount.h        |    3 +++
>  fs/mount_notify.c |    4 ++--
>  fs/namespace.c    |    3 +++
>  3 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/mount.h b/fs/mount.h
> index 85456a5f5a3a..1037781be055 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -79,6 +79,9 @@ struct mount {
>  	int mnt_expiry_mark;		/* true if marked for expiry */
>  	struct hlist_head mnt_pins;
>  	struct hlist_head mnt_stuck_children;
> +#ifdef CONFIG_FSINFO
> +	u64	mnt_unique_id;		/* ID unique over lifetime of kernel */
> +#endif

Not sure if it's worth making conditional.

>  #ifdef CONFIG_MOUNT_NOTIFICATIONS
>  	struct watch_list *mnt_watchers; /* Watches on dentries within this mount */
>  #endif
> diff --git a/fs/mount_notify.c b/fs/mount_notify.c
> index 44f570e4cebe..d8ba66ed5f77 100644
> --- a/fs/mount_notify.c
> +++ b/fs/mount_notify.c
> @@ -90,7 +90,7 @@ void notify_mount(struct mount *trigger,
>  	n.watch.type	= WATCH_TYPE_MOUNT_NOTIFY;
>  	n.watch.subtype	= subtype;
>  	n.watch.info	= info_flags | watch_sizeof(n);
> -	n.triggered_on	= trigger->mnt_id;
> +	n.triggered_on	= trigger->mnt_unique_id;
>  
>  	switch (subtype) {
>  	case NOTIFY_MOUNT_EXPIRY:
> @@ -102,7 +102,7 @@ void notify_mount(struct mount *trigger,
>  	case NOTIFY_MOUNT_UNMOUNT:
>  	case NOTIFY_MOUNT_MOVE_FROM:
>  	case NOTIFY_MOUNT_MOVE_TO:
> -		n.auxiliary_mount	= aux->mnt_id;
> +		n.auxiliary_mount = aux->mnt_unique_id;

Hmm, so we now have two ID's:

 - one can be used to look up the mount
 - one is guaranteed to be unique

With this change the mount cannot be looked up with FSINFO_FLAGS_QUERY_MOUNT,
right?

Should we be merging the two ID's into a single one which has both properties?

>  		break;
>  
>  	default:
> diff --git a/fs/namespace.c b/fs/namespace.c
> index b2b9920ffd3c..1db8a64cd76f 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -115,6 +115,9 @@ static int mnt_alloc_id(struct mount *mnt)
>  	if (res < 0)
>  		return res;
>  	mnt->mnt_id = res;
> +#ifdef CONFIG_FSINFO
> +	mnt->mnt_unique_id = atomic64_inc_return(&vfs_unique_counter);
> +#endif
>  	return 0;
>  }
>  
> 
> 

^ permalink raw reply

* Re: [PATCH 02/18] fsinfo: Add fsinfo() syscall to query filesystem information [ver #21]
From: David Howells @ 2020-08-04 11:34 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, viro, linux-api, torvalds, raven, mszeredi, christian,
	jannh, darrick.wong, kzak, jlayton, linux-fsdevel,
	linux-security-module, linux-kernel
In-Reply-To: <20200804101659.GA32719@miu.piliscsaba.redhat.com>

Miklos Szeredi <miklos@szeredi.hu> wrote:

> > 		__u32	Mth;
> 
> The Mth field seems to be unused in this patchset.  Since the struct is
> extensible, I guess there's no point in adding it now.

Yeah - I was using it to index through the server address lists for network
filesystems (ie. the Mth address of the Nth server), but I've dropped the nfs
patch and made afs return an array of addresses for the Nth server since the
address list can get reordered.

Ordinarily, I'd just take it out, but I don't want to cause the patchset to
get dropped for yet another merge cycle :-/

> > +#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100	/* Information about attr N (for path) */
> > +#define FSINFO_ATTR_FSINFO_ATTRIBUTES	0x101	/* List of supported attrs (for path) */
> 
> I think it would make sense to move the actual attributes to a separate patch
> and leave this just being the infrastructure.

Maybe.  If there are no attributes, then it makes it a bit hard to test.

> > +struct fsinfo_u128 {
> ...
> 
> Shouldn't this belong in <linux/types.h>?

Maybe.  Ideally, I'd use a proper C type rather than a struct.

> Is there a reason these are 128 wide fields?  Are we approaching the limits of
> 64bits?

Dave Chinner was talking at LSF a couple of years ago, IIRC, about looking
beyond the 16 Exa limit in XFS.  I've occasionally talked to people who have
multi-Peta data sets in AFS or whatever they were using, streamed from science
experiments, so the limit isn't necessarily all *that* far off.

> > +struct fsinfo_limits {
> > +	struct fsinfo_u128 max_file_size;	/* Maximum file size */
> > +	struct fsinfo_u128 max_ino;		/* Maximum inode number */
> 
> Again, what's the reason.  AFACT we are not yet worried about overflowing 64
> bits.  Future proofing is good, but there has to be some rules and reasons
> behind the decisions.

This is cheap to do.  This information is expected to be static for the
lifetime a superblock and, for most filesystems, of the running kernel, so
simply copying it with memcpy() from rodata is going to suffice most of the
time.

But don't worry - 640K is sufficient for everyone ;-)

David


^ permalink raw reply

* Re: [PATCH 13/17] watch_queue: Implement mount topology and attribute change notifications [ver #5]
From: Ian Kent @ 2020-08-04 11:38 UTC (permalink / raw)
  To: Miklos Szeredi, David Howells
  Cc: Linus Torvalds, Al Viro, Casey Schaufler, Stephen Smalley,
	Nicolas Dichtel, Christian Brauner, andres, Jeff Layton, dray,
	Karel Zak, keyrings, Linux API, linux-fsdevel, LSM, linux-kernel
In-Reply-To: <CAJfpeguvLMCw1H8+DPsfZE_k0sEiRtA17pD9HjnceSsAvqqAZw@mail.gmail.com>

On Mon, 2020-08-03 at 11:29 +0200, Miklos Szeredi wrote:
> On Thu, Jul 23, 2020 at 12:48 PM David Howells <dhowells@redhat.com>
> wrote:
> 
> > > >                 __u32   topology_changes;
> > > >                 __u32   attr_changes;
> > > >                 __u32   aux_topology_changes;
> > > 
> > > Being 32bit this introduces wraparound effects.  Is that really
> > > worth it?
> > 
> > You'd have to make 2 billion changes without whoever's monitoring
> > getting a
> > chance to update their counters.  But maybe it's not worth it
> > putting them
> > here.  If you'd prefer, I can make the counters all 64-bit and just
> > retrieve
> > them with fsinfo().
> 
> Yes, I think that would be preferable.

I think this is the source of the recommendation for removing the
change counters from the notification message, correct?

While it looks like I may not need those counters for systemd message
buffer overflow handling myself I think removing them from the
notification message isn't a sensible thing to do.

If you need to detect missing messages, perhaps due to message buffer
overflow, then you need change counters that are relevant to the
notification message itself. That's so the next time you get a message
for that object you can be sure that change counter comparisons you
you make relate to object notifications you have processed.

Yes, I know it isn't quite that simple, but tallying up what you have
processed in the current batch of messages (or in multiple batches of
messages if more than one read has been possible) to perform the check
is a user space responsibility. And it simply can't be done if the
counters consistency is in question which it would be if you need to
perform another system call to get it.

It's way more useful to have these in the notification than obtainable
via fsinfo() IMHO.

> 
> > > >         n->watch.info & NOTIFY_MOUNT_IS_RECURSIVE if true
> > > > indicates that
> > > >         the notifcation was generated by an event (eg. SETATTR)
> > > > that was
> > > >         applied recursively.  The notification is only
> > > > generated for the
> > > >         object that initially triggered it.
> > > 
> > > Unused in this patchset.  Please don't add things to the API
> > > which are not
> > > used.
> > 
> > Christian Brauner has patches for mount_setattr() that will need to
> > use this.
> 
> Fine, then that patch can add the flag.
> 
> Thanks,
> Miklos


^ permalink raw reply

* Re: [PATCH 06/18] fsinfo: Add a uniquifier ID to struct mount [ver #21]
From: Ian Kent @ 2020-08-04 12:32 UTC (permalink / raw)
  To: Miklos Szeredi, David Howells
  Cc: viro, torvalds, mszeredi, christian, jannh, darrick.wong, kzak,
	jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel
In-Reply-To: <20200804104108.GC32719@miu.piliscsaba.redhat.com>

On Tue, 2020-08-04 at 12:41 +0200, Miklos Szeredi wrote:
> On Mon, Aug 03, 2020 at 02:37:16PM +0100, David Howells wrote:
> > Add a uniquifier ID to struct mount that is effectively unique over
> > the
> > kernel lifetime to deal around mnt_id values being reused.  This
> > can then
> > be exported through fsinfo() to allow detection of replacement
> > mounts that
> > happen to end up with the same mount ID.
> > 
> > The normal mount handle is still used for referring to a particular
> > mount.
> > 
> > The mount notification is then changed to convey these unique mount
> > IDs
> > rather than the mount handle.
> > 
> > Signed-off-by: David Howells <dhowells@redhat.com>
> > ---
> > 
> >  fs/mount.h        |    3 +++
> >  fs/mount_notify.c |    4 ++--
> >  fs/namespace.c    |    3 +++
> >  3 files changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/mount.h b/fs/mount.h
> > index 85456a5f5a3a..1037781be055 100644
> > --- a/fs/mount.h
> > +++ b/fs/mount.h
> > @@ -79,6 +79,9 @@ struct mount {
> >  	int mnt_expiry_mark;		/* true if marked for
> > expiry */
> >  	struct hlist_head mnt_pins;
> >  	struct hlist_head mnt_stuck_children;
> > +#ifdef CONFIG_FSINFO
> > +	u64	mnt_unique_id;		/* ID unique over lifetime of
> > kernel */
> > +#endif
> 
> Not sure if it's worth making conditional.
> 
> >  #ifdef CONFIG_MOUNT_NOTIFICATIONS
> >  	struct watch_list *mnt_watchers; /* Watches on dentries within
> > this mount */
> >  #endif
> > diff --git a/fs/mount_notify.c b/fs/mount_notify.c
> > index 44f570e4cebe..d8ba66ed5f77 100644
> > --- a/fs/mount_notify.c
> > +++ b/fs/mount_notify.c
> > @@ -90,7 +90,7 @@ void notify_mount(struct mount *trigger,
> >  	n.watch.type	= WATCH_TYPE_MOUNT_NOTIFY;
> >  	n.watch.subtype	= subtype;
> >  	n.watch.info	= info_flags | watch_sizeof(n);
> > -	n.triggered_on	= trigger->mnt_id;
> > +	n.triggered_on	= trigger->mnt_unique_id;
> >  
> >  	switch (subtype) {
> >  	case NOTIFY_MOUNT_EXPIRY:
> > @@ -102,7 +102,7 @@ void notify_mount(struct mount *trigger,
> >  	case NOTIFY_MOUNT_UNMOUNT:
> >  	case NOTIFY_MOUNT_MOVE_FROM:
> >  	case NOTIFY_MOUNT_MOVE_TO:
> > -		n.auxiliary_mount	= aux->mnt_id;
> > +		n.auxiliary_mount = aux->mnt_unique_id;
> 
> Hmm, so we now have two ID's:
> 
>  - one can be used to look up the mount
>  - one is guaranteed to be unique
> 
> With this change the mount cannot be looked up with
> FSINFO_FLAGS_QUERY_MOUNT,
> right?
> 
> Should we be merging the two ID's into a single one which has both
> properties?

I'd been thinking we would probably need to change to 64 bit ids
for a while now and I thought that was what was going to happen.

We'll need to change libmount and current code but better early
on than later.

Ian

> 
> >  		break;
> >  
> >  	default:
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index b2b9920ffd3c..1db8a64cd76f 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -115,6 +115,9 @@ static int mnt_alloc_id(struct mount *mnt)
> >  	if (res < 0)
> >  		return res;
> >  	mnt->mnt_id = res;
> > +#ifdef CONFIG_FSINFO
> > +	mnt->mnt_unique_id = atomic64_inc_return(&vfs_unique_counter);
> > +#endif
> >  	return 0;
> >  }
> >  
> > 
> > 


^ permalink raw reply

* Re: [PATCH 13/17] watch_queue: Implement mount topology and attribute change notifications [ver #5]
From: Miklos Szeredi @ 2020-08-04 13:19 UTC (permalink / raw)
  To: Ian Kent
  Cc: David Howells, Linus Torvalds, Al Viro, Casey Schaufler,
	Stephen Smalley, Nicolas Dichtel, Christian Brauner, andres,
	Jeff Layton, dray, Karel Zak, keyrings, Linux API, linux-fsdevel,
	LSM, linux-kernel
In-Reply-To: <43c061d26ddef2aa3ca1ac726da7db9ab461e7be.camel@themaw.net>

On Tue, Aug 4, 2020 at 1:39 PM Ian Kent <raven@themaw.net> wrote:
>
> On Mon, 2020-08-03 at 11:29 +0200, Miklos Szeredi wrote:
> > On Thu, Jul 23, 2020 at 12:48 PM David Howells <dhowells@redhat.com>
> > wrote:
> >
> > > > >                 __u32   topology_changes;
> > > > >                 __u32   attr_changes;
> > > > >                 __u32   aux_topology_changes;
> > > >
> > > > Being 32bit this introduces wraparound effects.  Is that really
> > > > worth it?
> > >
> > > You'd have to make 2 billion changes without whoever's monitoring
> > > getting a
> > > chance to update their counters.  But maybe it's not worth it
> > > putting them
> > > here.  If you'd prefer, I can make the counters all 64-bit and just
> > > retrieve
> > > them with fsinfo().
> >
> > Yes, I think that would be preferable.
>
> I think this is the source of the recommendation for removing the
> change counters from the notification message, correct?
>
> While it looks like I may not need those counters for systemd message
> buffer overflow handling myself I think removing them from the
> notification message isn't a sensible thing to do.
>
> If you need to detect missing messages, perhaps due to message buffer
> overflow, then you need change counters that are relevant to the
> notification message itself. That's so the next time you get a message
> for that object you can be sure that change counter comparisons you
> you make relate to object notifications you have processed.

I don't quite get it.  Change notification is just that: a
notification.   You need to know what object that notification relates
to, to be able to retrieve the up to date attributes of said object.

What happens if you get a change counter N in the notification
message, then get a change counter N + 1 in the attribute retrieval?
You know that another change happened, and you haven't yet processed
the notification yet.  So when the notification with N + 1 comes in,
you can optimize away the attribute retrieve.

Nice optimization, but it's optimizing a race condition, and I don't
think that's warranted.  I don't see any other use for the change
counter in the notification message.


> Yes, I know it isn't quite that simple, but tallying up what you have
> processed in the current batch of messages (or in multiple batches of
> messages if more than one read has been possible) to perform the check
> is a user space responsibility. And it simply can't be done if the
> counters consistency is in question which it would be if you need to
> perform another system call to get it.
>
> It's way more useful to have these in the notification than obtainable
> via fsinfo() IMHO.

What is it useful for?

If the notification itself would contain the list of updated
attributes and their new values, then yes, this would make sense.  If
the notification just tells us that the object was modified, but not
the modifications themselves, then I don't see how the change counter
in itself could add any information (other than optimizing the race
condition above).

Thanks,
Miklos

Thanks,



>
> >
> > > > >         n->watch.info & NOTIFY_MOUNT_IS_RECURSIVE if true
> > > > > indicates that
> > > > >         the notifcation was generated by an event (eg. SETATTR)
> > > > > that was
> > > > >         applied recursively.  The notification is only
> > > > > generated for the
> > > > >         object that initially triggered it.
> > > >
> > > > Unused in this patchset.  Please don't add things to the API
> > > > which are not
> > > > used.
> > >
> > > Christian Brauner has patches for mount_setattr() that will need to
> > > use this.
> >
> > Fine, then that patch can add the flag.
> >
> > Thanks,
> > Miklos
>

^ permalink raw reply

* Re: [PATCH 08/18] fsinfo: Allow mount topology and propagation info to be retrieved [ver #21]
From: Miklos Szeredi @ 2020-08-04 13:38 UTC (permalink / raw)
  To: David Howells
  Cc: viro, torvalds, raven, mszeredi, christian, jannh, darrick.wong,
	kzak, jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel
In-Reply-To: <159646185371.1784947.14555585307218856883.stgit@warthog.procyon.org.uk>

On Mon, Aug 03, 2020 at 02:37:33PM +0100, David Howells wrote:
> Add a couple of attributes to allow information about the mount topology
> and propagation to be retrieved:
> 
>  (1) FSINFO_ATTR_MOUNT_TOPOLOGY.
> 
>      Information about a mount's parentage in the mount topology tree and
>      its propagation attributes.
> 
>      This has to be collected with the VFS namespace lock held, so it's
>      separate from FSINFO_ATTR_MOUNT_INFO.  The topology change counter
>      that a subsequent patch will export can be used to work out from the
>      cheaper _INFO attribute as to whether the more expensive _TOPOLOGY
>      attribute needs requerying.
> 
>      MOUNT_PROPAGATION_* flags are added to linux/mount.h for UAPI
>      consumption.  At some point a mount_setattr() system call needs to be
>      added.
> 
>  (2) FSINFO_ATTR_MOUNT_CHILDREN.
> 
>      Information about a mount's children in the mount topology tree.
> 
>      This is formatted as an array of structures, one for each child and
>      capped with one for the argument mount (checked after listing all the
>      children).  Each element contains the static IDs of the respective
>      mount object along with a sum of its change attributes.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  fs/fsinfo.c                 |    2 +
>  fs/internal.h               |    2 +
>  fs/namespace.c              |   94 +++++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fsinfo.h |   27 ++++++++++++
>  include/uapi/linux/mount.h  |   13 +++++-
>  samples/vfs/test-fsinfo.c   |   55 +++++++++++++++++++++++++
>  6 files changed, 192 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fsinfo.c b/fs/fsinfo.c
> index f276857709ee..0540cce89555 100644
> --- a/fs/fsinfo.c
> +++ b/fs/fsinfo.c
> @@ -291,9 +291,11 @@ static const struct fsinfo_attribute fsinfo_common_attributes[] = {
>  	FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
>  
>  	FSINFO_VSTRUCT	(FSINFO_ATTR_MOUNT_INFO,	fsinfo_generic_mount_info),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_MOUNT_TOPOLOGY,	fsinfo_generic_mount_topology),
>  	FSINFO_STRING	(FSINFO_ATTR_MOUNT_PATH,	fsinfo_generic_seq_read),
>  	FSINFO_STRING	(FSINFO_ATTR_MOUNT_POINT,	fsinfo_generic_mount_point),
>  	FSINFO_STRING	(FSINFO_ATTR_MOUNT_POINT_FULL,	fsinfo_generic_mount_point_full),
> +	FSINFO_LIST	(FSINFO_ATTR_MOUNT_CHILDREN,	fsinfo_generic_mount_children),
>  	{}
>  };
>  
> diff --git a/fs/internal.h b/fs/internal.h
> index a56008b7f3ec..cb5edcc7125a 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -98,8 +98,10 @@ extern void dissolve_on_fput(struct vfsmount *);
>  extern int lookup_mount_object(struct path *, unsigned int, struct path *);
>  extern int fsinfo_generic_mount_source(struct path *, struct fsinfo_context *);
>  extern int fsinfo_generic_mount_info(struct path *, struct fsinfo_context *);
> +extern int fsinfo_generic_mount_topology(struct path *, struct fsinfo_context *);
>  extern int fsinfo_generic_mount_point(struct path *, struct fsinfo_context *);
>  extern int fsinfo_generic_mount_point_full(struct path *, struct fsinfo_context *);
> +extern int fsinfo_generic_mount_children(struct path *, struct fsinfo_context *);
>  
>  /*
>   * fs_struct.c
> diff --git a/fs/namespace.c b/fs/namespace.c
> index c196af35d39d..b5c2a3b4f96d 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -4303,6 +4303,54 @@ int fsinfo_generic_mount_info(struct path *path, struct fsinfo_context *ctx)
>  	return sizeof(*p);
>  }
>  
> +/*
> + * Retrieve information about the topology at the nominated mount and
> + * its propogation attributes.
> + */
> +int fsinfo_generic_mount_topology(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct fsinfo_mount_topology *p = ctx->buffer;
> +	struct mount *m;
> +	struct path root;
> +
> +	get_fs_root(current->fs, &root);
> +
> +	namespace_lock();
> +
> +	m = real_mount(path->mnt);
> +
> +	p->parent_id = m->mnt_parent->mnt_id;
> +
> +	if (path->mnt == root.mnt) {
> +		p->parent_id = m->mnt_id;
> +	} else {
> +		rcu_read_lock();
> +		if (!are_paths_connected(&root, path))
> +			p->parent_id = m->mnt_id;
> +		rcu_read_unlock();
> +	}
> +
> +	if (IS_MNT_SHARED(m)) {
> +		p->shared_group_id = m->mnt_group_id;
> +		p->propagation_type |= MOUNT_PROPAGATION_SHARED;
> +	} else if (IS_MNT_SLAVE(m)) {
> +		int source = m->mnt_master->mnt_group_id;
> +		int from = get_dominating_id(m, &root);
> +		p->dependent_source_id = source;
> +		if (from && from != source)
> +			p->dependent_clone_of_id = from;
> +		p->propagation_type |= MOUNT_PROPAGATION_DEPENDENT;
> +	} else if (IS_MNT_UNBINDABLE(m)) {
> +		p->propagation_type |= MOUNT_PROPAGATION_UNBINDABLE;
> +	} else {
> +		p->propagation_type |= MOUNT_PROPAGATION_PRIVATE;
> +	}
> +
> +	namespace_unlock();
> +	path_put(&root);
> +	return sizeof(*p);
> +}
> +
>  /*
>   * Return the path of this mount relative to its parent and clipped to
>   * the current chroot.
> @@ -4379,4 +4427,50 @@ int fsinfo_generic_mount_point_full(struct path *path, struct fsinfo_context *ct
>  	return (ctx->buffer + ctx->buf_size) - p;
>  }
>  
> +/*
> + * Store a mount record into the fsinfo buffer.
> + */
> +static void fsinfo_store_mount(struct fsinfo_context *ctx, const struct mount *p,
> +			       bool is_root)
> +{
> +	struct fsinfo_mount_child record = {};
> +	unsigned int usage = ctx->usage;
> +
> +	if (ctx->usage >= INT_MAX)
> +		return;
> +	ctx->usage = usage + sizeof(record);
> +	if (!ctx->buffer || ctx->usage > ctx->buf_size)
> +		return;
> +
> +	record.mnt_unique_id	= p->mnt_unique_id;
> +	record.mnt_id		= p->mnt_id;
> +	record.parent_id	= is_root ? p->mnt_id : p->mnt_parent->mnt_id;
> +	memcpy(ctx->buffer + usage, &record, sizeof(record));
> +}
> +
> +/*
> + * Return information about the submounts relative to path.
> + */
> +int fsinfo_generic_mount_children(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct mount *m, *child;
> +
> +	m = real_mount(path->mnt);
> +
> +	read_seqlock_excl(&mount_lock);
> +
> +	list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {
> +		if (child->mnt_parent != m)
> +			continue;
> +		fsinfo_store_mount(ctx, child, false);
> +	}
> +
> +	/* End the list with a copy of the parameter mount's details so that
> +	 * userspace can quickly check for changes.
> +	 */
> +	fsinfo_store_mount(ctx, m, true);
> +	read_sequnlock_excl(&mount_lock);
> +	return ctx->usage;
> +}
> +
>  #endif /* CONFIG_FSINFO */
> diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
> index 15ef161905cd..f0a352b7028e 100644
> --- a/include/uapi/linux/fsinfo.h
> +++ b/include/uapi/linux/fsinfo.h
> @@ -35,6 +35,8 @@
>  #define FSINFO_ATTR_MOUNT_PATH		0x201	/* Bind mount/superblock path (string) */
>  #define FSINFO_ATTR_MOUNT_POINT		0x202	/* Relative path of mount in parent (string) */
>  #define FSINFO_ATTR_MOUNT_POINT_FULL	0x203	/* Absolute path of mount (string) */
> +#define FSINFO_ATTR_MOUNT_TOPOLOGY	0x204	/* Mount object topology */
> +#define FSINFO_ATTR_MOUNT_CHILDREN	0x205	/* Children of this mount (list) */
>  
>  /*
>   * Optional fsinfo() parameter structure.
> @@ -102,6 +104,31 @@ struct fsinfo_mount_info {
>  
>  #define FSINFO_ATTR_MOUNT_INFO__STRUCT struct fsinfo_mount_info
>  
> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_MOUNT_TOPOLOGY).
> + */
> +struct fsinfo_mount_topology {
> +	__u32	parent_id;		/* Parent mount identifier */

Again, which mount ID does this refer to?  I think we want this to be *the*
mount id that's both unique and can be looked up and that is 64 bits wide.


> +	__u32	shared_group_id;	/* Shared: mount group ID */
> +	__u32	dependent_source_id;	/* Dependent: source mount group ID */
> +	__u32	dependent_clone_of_id;	/* Dependent: ID of mount this was cloned from */

Another set of ID's that are currently 32bit *internally* but that doesn't mean
they will always be 32 bit.

And that last one (apart from "slave" being obfuscated) is simply incorrect.  It
has nothing to do with cloning.  It's the "ID of the closest peer group in the
propagation chain that has a representative mount in the current root".

> +	__u32	propagation_type;	/* MOUNT_PROPAGATION_* type */
> +};
> +
> +#define FSINFO_ATTR_MOUNT_TOPOLOGY__STRUCT struct fsinfo_mount_topology
> +
> +/*
> + * Information struct element for fsinfo(FSINFO_ATTR_MOUNT_CHILDREN).
> + * - An extra element is placed on the end representing the parent mount.
> + */
> +struct fsinfo_mount_child {
> +	__u64	mnt_unique_id;		/* Kernel-lifetime unique mount ID */
> +	__u32	mnt_id;			/* Mount identifier (use with AT_FSINFO_MOUNTID_PATH) */
> +	__u32	parent_id;		/* Parent mount identifier */


Again, which ID do we want for this and parent?  Preferably one which is 64bit.
As it is we are operating with 96bit mount ID's, which is excessive.

> +};
> +
> +#define FSINFO_ATTR_MOUNT_CHILDREN__STRUCT struct fsinfo_mount_child
> +
>  /*
>   * Information struct for fsinfo(FSINFO_ATTR_STATFS).
>   * - This gives extended filesystem information.
> diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
> index 96a0240f23fe..9ac8bb708843 100644
> --- a/include/uapi/linux/mount.h
> +++ b/include/uapi/linux/mount.h
> @@ -105,7 +105,7 @@ enum fsconfig_command {
>  #define FSMOUNT_CLOEXEC		0x00000001
>  
>  /*
> - * Mount attributes.
> + * Mount object attributes (these are separate to filesystem attributes).
>   */
>  #define MOUNT_ATTR_RDONLY	0x00000001 /* Mount read-only */
>  #define MOUNT_ATTR_NOSUID	0x00000002 /* Ignore suid and sgid bits */
> @@ -117,4 +117,15 @@ enum fsconfig_command {
>  #define MOUNT_ATTR_STRICTATIME	0x00000020 /* - Always perform atime updates */
>  #define MOUNT_ATTR_NODIRATIME	0x00000080 /* Do not update directory access times */
>  
> +/*
> + * Mount object propagation type.
> + */
> +enum propagation_type {
> +	/* 0 is left unallocated to mean "no change" in mount_setattr()  */
> +	MOUNT_PROPAGATION_UNBINDABLE	= 1, /* Make unbindable. */
> +	MOUNT_PROPAGATION_PRIVATE	= 2, /* Do not receive or send mount events. */
> +	MOUNT_PROPAGATION_DEPENDENT	= 3, /* Only receive mount events. */
> +	MOUNT_PROPAGATION_SHARED	= 4, /* Send and receive mount events. */
> +};
> +
>  #endif /* _UAPI_LINUX_MOUNT_H */
> diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
> index f3bebb7318d9..b7290ea8eb55 100644
> --- a/samples/vfs/test-fsinfo.c
> +++ b/samples/vfs/test-fsinfo.c
> @@ -21,6 +21,7 @@
>  #include <sys/syscall.h>
>  #include <linux/fsinfo.h>
>  #include <linux/socket.h>
> +#include <linux/mount.h>
>  #include <sys/stat.h>
>  #include <arpa/inet.h>
>  
> @@ -305,6 +306,58 @@ static void dump_fsinfo_generic_mount_info(void *reply, unsigned int size)
>  	printf("\tattr    : %x\n", r->attr);
>  }
>  
> +static void dump_fsinfo_generic_mount_topology(void *reply, unsigned int size)
> +{
> +	struct fsinfo_mount_topology *r = reply;
> +
> +	printf("\n");
> +	printf("\tparent  : %x\n", r->parent_id);
> +
> +	switch (r->propagation_type) {
> +	case MOUNT_PROPAGATION_UNBINDABLE:
> +		printf("\tpropag  : unbindable\n");
> +		break;
> +	case MOUNT_PROPAGATION_PRIVATE:
> +		printf("\tpropag  : private\n");
> +		break;
> +	case MOUNT_PROPAGATION_DEPENDENT:
> +		printf("\tpropag  : dependent source=%x clone_of=%x\n",
> +		       r->dependent_source_id, r->dependent_clone_of_id);
> +		break;
> +	case MOUNT_PROPAGATION_SHARED:
> +		printf("\tpropag  : shared group=%x\n", r->shared_group_id);
> +		break;
> +	default:
> +		printf("\tpropag  : unknown type %x\n", r->propagation_type);
> +		break;
> +	}
> +
> +}
> +
> +static void dump_fsinfo_generic_mount_children(void *reply, unsigned int size)
> +{
> +	struct fsinfo_mount_child *r = reply;
> +	ssize_t mplen;
> +	char path[32], *mp;
> +
> +	struct fsinfo_params params = {
> +		.flags		= FSINFO_FLAGS_QUERY_MOUNT,
> +		.request	= FSINFO_ATTR_MOUNT_POINT,
> +	};
> +
> +	if (!list_last) {
> +		sprintf(path, "%u", r->mnt_id);
> +		mplen = get_fsinfo(path, "FSINFO_ATTR_MOUNT_POINT", &params, (void **)&mp);
> +		if (mplen < 0)
> +			mp = "-";
> +	} else {
> +		mp = "<this>";
> +	}
> +
> +	printf("%8x %16llx %s\n",
> +	       r->mnt_id, (unsigned long long)r->mnt_unique_id, mp);
> +}
> +
>  static void dump_string(void *reply, unsigned int size)
>  {
>  	char *s = reply, *p;
> @@ -383,9 +436,11 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
>  	FSINFO_LIST	(FSINFO_ATTR_FSINFO_ATTRIBUTES,	fsinfo_meta_attributes),
>  
>  	FSINFO_VSTRUCT	(FSINFO_ATTR_MOUNT_INFO,	fsinfo_generic_mount_info),
> +	FSINFO_VSTRUCT	(FSINFO_ATTR_MOUNT_TOPOLOGY,	fsinfo_generic_mount_topology),
>  	FSINFO_STRING	(FSINFO_ATTR_MOUNT_PATH,	string),
>  	FSINFO_STRING_N	(FSINFO_ATTR_MOUNT_POINT,	string),
>  	FSINFO_STRING_N	(FSINFO_ATTR_MOUNT_POINT_FULL,	string),
> +	FSINFO_LIST	(FSINFO_ATTR_MOUNT_CHILDREN,	fsinfo_generic_mount_children),
>  	{}
>  };
>  
> 
> 

^ permalink raw reply

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Mark Rutland @ 2020-08-04 13:55 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
In-Reply-To: <86625441-80f3-2909-2f56-e18e2b60957d@linux.microsoft.com>

On Mon, Aug 03, 2020 at 12:58:04PM -0500, Madhavan T. Venkataraman wrote:
> On 7/31/20 1:31 PM, Mark Rutland wrote:
> > On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
> >> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
> >>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
> >>> <madvenka@linux.microsoft.com> wrote:
>> >> When the kernel generates the code for a trampoline, it can hard code data values
> >> in the generated code itself so it does not need PC-relative data referencing.
> >>
> >> And, for ISAs that do support the large offset, we do have to implement and
> >> maintain the code page stuff for different ISAs for each application and library
> >> if we did not use trampfd.
> > Trampoline code is architecture specific today, so I don't see that as a
> > major issue. Common structural bits can probably be shared even if the
> > specifid machine code cannot.
> 
> True. But an implementor may prefer a standard mechanism provided by
> the kernel so all of his architectures can be supported easily with less
> effort.
> 
> If you look at the libffi reference patch I have included, the architecture
> specific changes to use trampfd just involve a single C function call to
> a common code function.

Sure but in addition to that each architecture backend had to define a
set of arguments to that. I view the C function is analagous to the
"common structural bits".

I appreciate that your patch is small today (and architectures seem to
largely align on what they need), but I don't think it's necessarily
true that things will remain so simple as architecture are extended and
their calling conventions evolve, and I also don't think it's clear that
this will work for more complex cases elsewhere.

[...]

> >> With the user level trampoline table approach, the data part of the trampoline table
> >> can be hacked by an attacker if an application has a vulnerability. Specifically, the
> >> target PC can be altered to some arbitrary location. Trampfd implements an
> >> "Allowed PCS" context. In the libffi changes, I have created a read-only array of
> >> all ABI handlers used in closures for each architecture. This read-only array
> >> can be used to restrict the PC values for libffi trampolines to prevent hacking.
> >>
> >> To generalize, we can implement security rules/features if the trampoline
> >> object is in the kernel.
> > I don't follow this argument. If it's possible to statically define that
> > in the kernel, it's also possible to do that in userspace without any
> > new kernel support.
> It is not statically defined in the kernel.
> 
> Let us take the libffi example. In the 64-bit X86 arch code, there are 3
> ABI handlers:
> 
>     ffi_closure_unix64_sse
>     ffi_closure_unix64
>     ffi_closure_win64
> 
> I could create an "Allowed PCs" context like this:
> 
> struct my_allowed_pcs {
>     struct trampfd_values    pcs;
>     __u64                             pc_values[3];
> };
> 
> const struct my_allowed_pcs    my_allowed_pcs = {
>     { 3, 0 },
>     (uintptr_t) ffi_closure_unix64_sse,
>     (uintptr_t) ffi_closure_unix64,
>     (uintptr_t) ffi_closure_win64,
> };
> 
> I have created a read-only array of allowed ABI handlers that closures use.
> 
> When I set up the context for a closure trampoline, I could do this:
> 
>     pwrite(trampfd, &my_allowed_pcs, sizeof(my_allowed_pcs), TRAMPFD_ALLOWED_PCS_OFFSET);
>    
> This copies the array into the trampoline object in the kernel.
> When the register context is set for the trampoline, the kernel checks
> the PC register value against allowed PCs.
> 
> Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only
> permitted target PCs enforced by the kernel are the ABI handlers.

Sorry, when I said "statically define" meant when you knew legitimate
targets ahead of time when you create the trampoline (i.e. whether you
could enumerate those and know they would not change dynamically).

My point was that you can achieve the same in userspace if the
trampoline and array of legitimate targets are in read-only memory,
without having to trap to the kernel.

I think the key point here is that an adversary must be prevented from
altering a trampoline and any associated metadata, and I think that
there are ways of achieving that without having to trap into the kernel,
and without the kernel having to be intimately aware of the calling
conventions used in userspace.

[...]

> >> Trampfd is a framework that can be used to implement multiple things. May be,
> >> a few of those things can also be implemented in user land itself. But I think having
> >> just one mechanism to execute dynamic code objects is preferable to having
> >> multiple mechanisms not standardized across all applications.
> > In abstract, having a common interface sounds nice, but in practice
> > elements of this are always architecture-specific (e.g. interactiosn
> > with HW CFI), and that common interface can result in more pain as it
> > doesn't fit naturally into the context that ISAs were designed for (e.g. 
> > where control-flow instructions are extended with new semantics).
> 
> In the case of trampfd, the code generation is indeed architecture
> specific. But that is in the kernel. The application is not affected by it.

As an ABI detail, applications are *definitely* affected by this, and it
is wrong to suggest they are not even if you don't have a specific case
in mind today. As this forms a contract between userspace and the kernel
it's overly simplistic to say that it's the kernel's problem

For example, in the case of BTI on arm64, what should the trampoline
set PSTATE.BTYPE to? Different use-cases *will* want different values,
and not necessarily the value of PSTATE at the instant the call to the
trampoline was made. In the case of libffi specifically using the
original value of PSTATE.BTYPE probably is sound, but other code
sequences may need to restrict/broaden or entirely change that.

> Again, referring to the libffi reference patch, I have defined wrapper
> functions for trampfd in common code. The architecture specific code
> in libffi only calls the set_context function defined in common code.
> Even this is required only because register names are specific to each
> architecture and the target PC (to the ABI handler) is specific to
> each architecture-ABI combo.
> 
> > It also meass that you can't share the rough approach across OSs which
> > do not implement an identical mechanism, so for code abstracting by ISA
> > first, then by platform/ABI, there isn't much saving.
> 
> Why can you not share the same approach across OSes? In fact,
> I have tried to design it so that other OSes can use the same
> mechanism.

Sure, but where they *don't*, you must fall back to the existing
purely-userspace mechanisms, and so a codebase now has the burden of
maintaining two distinct mechanisms.

Whereas if there's a way of doing this in userspace with (stronger)
enforcement of memory permissions the trampoline code can be common for
when this is present or absent, which is much easier for a codebase rto
maintain, and could make use of weaker existing mechanisms to improve
the situation on systems without the new functionality.

Thanks,
Mark.

^ permalink raw reply

* Re: [PATCH 10/18] fsinfo: Provide notification overrun handling support [ver #21]
From: Miklos Szeredi @ 2020-08-04 13:56 UTC (permalink / raw)
  To: David Howells
  Cc: viro, torvalds, raven, mszeredi, christian, jannh, darrick.wong,
	kzak, jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel
In-Reply-To: <159646187082.1784947.4293611877413578847.stgit@warthog.procyon.org.uk>

On Mon, Aug 03, 2020 at 02:37:50PM +0100, David Howells wrote:
> Provide support for the handling of an overrun in a watch queue.  In the
> event that an overrun occurs, the watcher needs to be able to find out what
> it was that they missed.  To this end, previous patches added event
> counters to struct mount.

So this is optimizing the buffer overrun case?

Shoun't we just make sure that the likelyhood of overruns is low and if it
happens, just reinitialize everthing from scratch (shouldn't be *that*
expensive).

Trying to find out what was missed seems like just adding complexity for no good
reason.


^ permalink raw reply

* Re: [PATCH 15/18] fsinfo: Add an attribute that lists all the visible mounts in a namespace [ver #21]
From: Miklos Szeredi @ 2020-08-04 14:05 UTC (permalink / raw)
  To: David Howells
  Cc: viro, torvalds, raven, mszeredi, christian, jannh, darrick.wong,
	kzak, jlayton, linux-api, linux-fsdevel, linux-security-module,
	linux-kernel
In-Reply-To: <159646191446.1784947.11228235431863356055.stgit@warthog.procyon.org.uk>

On Mon, Aug 03, 2020 at 02:38:34PM +0100, David Howells wrote:
> Add a filesystem attribute that exports a list of all the visible mounts in
> a namespace, given the caller's chroot setting.  The returned list is an
> array of:
> 
> 	struct fsinfo_mount_child {
> 		__u64	mnt_unique_id;
> 		__u32	mnt_id;
> 		__u32	parent_id;
> 		__u32	mnt_notify_sum;
> 		__u32	sb_notify_sum;
> 	};
> 
> where each element contains a once-in-a-system-lifetime unique ID, the
> mount ID (which may get reused), the parent mount ID and sums of the
> notification/change counters for the mount and its superblock.

The change counters are currently conditional on CONFIG_MOUNT_NOTIFICATIONS.
Is this is intentional?

> 
> This works with a read lock on the namespace_sem, but ideally would do it
> under the RCU read lock only.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  fs/fsinfo.c                 |    1 +
>  fs/internal.h               |    1 +
>  fs/namespace.c              |   37 +++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fsinfo.h |    4 ++++
>  samples/vfs/test-fsinfo.c   |   22 ++++++++++++++++++++++
>  5 files changed, 65 insertions(+)
> 
> diff --git a/fs/fsinfo.c b/fs/fsinfo.c
> index 0540cce89555..f230124ffdf5 100644
> --- a/fs/fsinfo.c
> +++ b/fs/fsinfo.c
> @@ -296,6 +296,7 @@ static const struct fsinfo_attribute fsinfo_common_attributes[] = {
>  	FSINFO_STRING	(FSINFO_ATTR_MOUNT_POINT,	fsinfo_generic_mount_point),
>  	FSINFO_STRING	(FSINFO_ATTR_MOUNT_POINT_FULL,	fsinfo_generic_mount_point_full),
>  	FSINFO_LIST	(FSINFO_ATTR_MOUNT_CHILDREN,	fsinfo_generic_mount_children),
> +	FSINFO_LIST	(FSINFO_ATTR_MOUNT_ALL,		fsinfo_generic_mount_all),
>  	{}
>  };
>  
> diff --git a/fs/internal.h b/fs/internal.h
> index cb5edcc7125a..267b4aaf0271 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -102,6 +102,7 @@ extern int fsinfo_generic_mount_topology(struct path *, struct fsinfo_context *)
>  extern int fsinfo_generic_mount_point(struct path *, struct fsinfo_context *);
>  extern int fsinfo_generic_mount_point_full(struct path *, struct fsinfo_context *);
>  extern int fsinfo_generic_mount_children(struct path *, struct fsinfo_context *);
> +extern int fsinfo_generic_mount_all(struct path *, struct fsinfo_context *);
>  
>  /*
>   * fs_struct.c
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 122c12f9512b..1f2e06507244 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -4494,4 +4494,41 @@ int fsinfo_generic_mount_children(struct path *path, struct fsinfo_context *ctx)
>  	return ctx->usage;
>  }
>  
> +/*
> + * Return information about all the mounts in the namespace referenced by the
> + * path.
> + */
> +int fsinfo_generic_mount_all(struct path *path, struct fsinfo_context *ctx)
> +{
> +	struct mnt_namespace *ns;
> +	struct mount *m, *p;
> +	struct path chroot;
> +	bool allow;
> +
> +	m = real_mount(path->mnt);
> +	ns = m->mnt_ns;
> +
> +	get_fs_root(current->fs, &chroot);
> +	rcu_read_lock();
> +	allow = are_paths_connected(&chroot, path) || capable(CAP_SYS_ADMIN);
> +	rcu_read_unlock();
> +	path_put(&chroot);
> +	if (!allow)
> +		return -EPERM;
> +
> +	down_read(&namespace_sem);
> +
> +	list_for_each_entry(p, &ns->list, mnt_list) {

This is missing locking and check added by commit 9f6c61f96f2d ("proc/mounts:
add cursor").

> +		struct path mnt_root;
> +
> +		mnt_root.mnt	= &p->mnt;
> +		mnt_root.dentry	= p->mnt.mnt_root;
> +		if (are_paths_connected(path, &mnt_root))
> +			fsinfo_store_mount(ctx, p, p == m);
> +	}
> +
> +	up_read(&namespace_sem);
> +	return ctx->usage;
> +}
> +
>  #endif /* CONFIG_FSINFO */
> diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
> index 81329de6905e..e40192d98648 100644
> --- a/include/uapi/linux/fsinfo.h
> +++ b/include/uapi/linux/fsinfo.h
> @@ -37,6 +37,7 @@
>  #define FSINFO_ATTR_MOUNT_POINT_FULL	0x203	/* Absolute path of mount (string) */
>  #define FSINFO_ATTR_MOUNT_TOPOLOGY	0x204	/* Mount object topology */
>  #define FSINFO_ATTR_MOUNT_CHILDREN	0x205	/* Children of this mount (list) */
> +#define FSINFO_ATTR_MOUNT_ALL		0x206	/* List all mounts in a namespace (list) */
>  
>  #define FSINFO_ATTR_AFS_CELL_NAME	0x300	/* AFS cell name (string) */
>  #define FSINFO_ATTR_AFS_SERVER_NAME	0x301	/* Name of the Nth server (string) */
> @@ -128,6 +129,8 @@ struct fsinfo_mount_topology {
>  /*
>   * Information struct element for fsinfo(FSINFO_ATTR_MOUNT_CHILDREN).
>   * - An extra element is placed on the end representing the parent mount.
> + *
> + * Information struct element for fsinfo(FSINFO_ATTR_MOUNT_ALL).
>   */
>  struct fsinfo_mount_child {
>  	__u64	mnt_unique_id;		/* Kernel-lifetime unique mount ID */
> @@ -139,6 +142,7 @@ struct fsinfo_mount_child {
>  };
>  
>  #define FSINFO_ATTR_MOUNT_CHILDREN__STRUCT struct fsinfo_mount_child
> +#define FSINFO_ATTR_MOUNT_ALL__STRUCT struct fsinfo_mount_child
>  
>  /*
>   * Information struct for fsinfo(FSINFO_ATTR_STATFS).
> diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
> index 374825ab85b0..596fa5e71762 100644
> --- a/samples/vfs/test-fsinfo.c
> +++ b/samples/vfs/test-fsinfo.c
> @@ -365,6 +365,27 @@ static void dump_fsinfo_generic_mount_children(void *reply, unsigned int size)
>  	       (unsigned long long)r->mnt_notify_sum, mp);
>  }
>  
> +static void dump_fsinfo_generic_mount_all(void *reply, unsigned int size)
> +{
> +	struct fsinfo_mount_child *r = reply;
> +	ssize_t mplen;
> +	char path[32], *mp;
> +
> +	struct fsinfo_params params = {
> +		.flags		= FSINFO_FLAGS_QUERY_MOUNT,
> +		.request	= FSINFO_ATTR_MOUNT_POINT_FULL,
> +	};
> +
> +	sprintf(path, "%u", r->mnt_id);
> +	mplen = get_fsinfo(path, "FSINFO_ATTR_MOUNT_POINT_FULL", &params, (void **)&mp);
> +	if (mplen < 0)
> +		mp = "-";
> +
> +	printf("%5x %5x %12llx %10llu %s\n",
> +	       r->mnt_id, r->parent_id, (unsigned long long)r->mnt_unique_id,
> +	       r->mnt_notify_sum, mp);
> +}
> +
>  static void dump_afs_fsinfo_server_address(void *reply, unsigned int size)
>  {
>  	struct fsinfo_afs_server_address *f = reply;
> @@ -492,6 +513,7 @@ static const struct fsinfo_attribute fsinfo_attributes[] = {
>  	FSINFO_STRING_N	(FSINFO_ATTR_MOUNT_POINT,	string),
>  	FSINFO_STRING_N	(FSINFO_ATTR_MOUNT_POINT_FULL,	string),
>  	FSINFO_LIST	(FSINFO_ATTR_MOUNT_CHILDREN,	fsinfo_generic_mount_children),
> +	FSINFO_LIST	(FSINFO_ATTR_MOUNT_ALL,		fsinfo_generic_mount_all),
>  
>  	FSINFO_STRING	(FSINFO_ATTR_AFS_CELL_NAME,	string),
>  	FSINFO_STRING	(FSINFO_ATTR_AFS_SERVER_NAME,	string),
> 
> 

^ permalink raw reply

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Mark Rutland @ 2020-08-04 14:30 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
In-Reply-To: <6236adf7-4bed-534e-0956-fddab4fd96b6@linux.microsoft.com>

On Mon, Aug 03, 2020 at 11:57:57AM -0500, Madhavan T. Venkataraman wrote:
> Responses inline..
> 
> On 7/31/20 1:09 PM, Mark Rutland wrote:
> > Hi,
> >
> > On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote:
> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> >> Trampoline code is placed either in a data page or in a stack page. In
> >> order to execute a trampoline, the page it resides in needs to be mapped
> >> with execute permissions. Writable pages with execute permissions provide
> >> an attack surface for hackers. Attackers can use this to inject malicious
> >> code, modify existing code or do other harm.
> > For the purpose of below, IIUC this assumes the adversary has an
> > arbitrary write.
> >
> >> To mitigate this, LSMs such as SELinux may not allow pages to have both
> >> write and execute permissions. This prevents trampolines from executing
> >> and blocks applications that use trampolines. To allow genuine applications
> >> to run, exceptions have to be made for them (by setting execmem, etc).
> >> In this case, the attack surface is just the pages of such applications.
> >>
> >> An application that is not allowed to have writable executable pages
> >> may try to load trampoline code into a file and map the file with execute
> >> permissions. In this case, the attack surface is just the buffer that
> >> contains trampoline code. However, a successful exploit may provide the
> >> hacker with means to load his own code in a file, map it and execute it.
> > It's not clear to me what power the adversary is assumed to have here,
> > and consequently it's not clear to me how the proposal mitigates this.
> >
> > For example, if the attack can control the arguments to syscalls, and
> > has an arbitrary write as above, what prevents them from creating a
> > trampfd of their own?
> 
> That is the point. If a process is allowed to have pages that are both
> writable and executable, a hacker can exploit some vulnerability such
> as buffer overflow to write his own code into a page and somehow
> contrive to execute that.

I understood that, and that was not my question.

> So, the context is - if security settings in a system disallow a page to have
> both write and execute permissions, how do you allow the execution of
> genuine trampolines that are runtime generated and placed in a data
> page or a stack page?

There are options today, e.g.

a) If the restriction is only per-alias, you can have distinct aliases
   where one is writable and another is executable, and you can make it
   hard to find the relationship between the two.

b) If the restriction is only temporal, you can write instructions into
   an RW- buffer, transition the buffer to R--, verify the buffer
   contents, then transition it to --X.

c) You can have two processes A and B where A generates instrucitons into
   a buffer that (only) B can execute (where B may be restricted from
   making syscalls like write, mprotect, etc).

If (as this series appears to) you assume that an adversary can't
control the arguments trampfd_create() and any such call is legitimate,
then something like (b) is not weaker and can be much more general
without many of the potential ABI or performance problems of trying to
fiddle with precedure call details in the kernel.

If that's not an assumption, then I'm missing how you expect to
determine that a trampfd_create() call is legitimate, and why that could
not be applied to other calls.

[...]

> Could not agree with you more.
> >
> > [...]
> >
> >> Trampoline File Descriptor (trampfd)
> >> --------------------------
> >>
> >> I am proposing a kernel API using anonymous file descriptors that
> >> can be used to create and execute trampolines with the help of the
> >> kernel. In this solution also, the kernel does the work of the trampoline.
> > What's the rationale for the kernel emulating the trampoline here?
> >
> > In ther case of EMUTRAMP this was necessary to work with existing
> > application binaries and kernel ABIs which placed instructions onto the
> > stack, and the stack needed to remain RW for other reasons. That
> > restriction doesn't apply here.
> 
> In addition to the stack, EMUTRAMP also allows the emulation
> of the same well-known trampolines placed in a non-stack data page.
> For instance, libffi closures embed a trampoline in a closure structure.
> That gets executed when the caller of libffi invokes it.
> 
> The goal of EMUTRAMP is to allow safe trampolines to execute when
> security settings disallow their execution. Mainly, it permits applications
> that use libffi to run. A lot of applications use libffi.
> 
> They chose the emulation method so that no changes need to be made
> to application code to use them. But the EMUTRAMP implementors note
> in their description that the real solution to the problem is a kernel
> API that is backed by a safe code generator.
> 
> trampd is an attempt to define such an API. This is just a starting point.
> I realize that we need to have a lot of discussion to refine the approach.
> 
> > Assuming trampfd creation is somehow authenticated, the code could be
> > placed in a r-x page (which the kernel could refuse to add write
> > permission), in order to prevent modification. If that's sufficient,
> > it's not much of a leap to allow userspace to generate the code.
> 
> IIUC, you are suggesting that the user hands the kernel a code fragment
> and requests it to be placed in an r-x page, correct? However, the
> kernel cannot trust any code given to it by the user. Nor can it scan any
> piece of code and reliably decide if it is safe or not.

Per that same logic the kernel cannot trust trampfd creation calls to be
legitimate as the adversary could mess with the arguments. It doesn't
matter if the kernel's codegen is trustworthy if it's potentially driven
by an adversary.

> So, the problem of executing dynamic code when security settings are
> restrictive cannot be solved in userland. The only option I can think of is
> to have the kernel provide support for dynamic code. It must have one
> or more safe, trusted code generation components and an API to use
> the components.
> 
> My goal is to introduce an API and start off by supporting simple, regular
> trampolines that are widely used. Then, evolve the feature over a period
> of time to include other forms of dynamic code such as JIT code.

I think that you're making a leap to this approach without sufficient
justification that it actually solves the problem, and I believe that
there will be ABI issues with this approach which can be sidestepped by
other potential approaches.

Taking a step back, I think it's necessary to better describe the
problem and constraints that you believe apply before attempting to
justify any potential solution.

[...]

> >> The kernel creates the trampoline mapping without any permissions. When
> >> the trampoline is executed by user code, a page fault happens and the
> >> kernel gets control. The kernel recognizes that this is a trampoline
> >> invocation. It sets up the user registers based on the specified
> >> register context, and/or pushes values on the user stack based on the
> >> specified stack context, and sets the user PC to the requested target
> >> PC. When the kernel returns, execution continues at the target PC.
> >> So, the kernel does the work of the trampoline on behalf of the
> >> application.
> >>
> >> In this case, the attack surface is the context buffer. A hacker may
> >> attack an application with a vulnerability and may be able to modify the
> >> context buffer. So, when the register or stack context is set for
> >> a trampoline, the values may have been tampered with. From an attack
> >> surface perspective, this is similar to Trampoline Emulation. But
> >> with trampfd, user code can retrieve a trampoline's context from the
> >> kernel and add defensive checks to see if the context has been
> >> tampered with.
> > Can you elaborate on this: what sort of checks would be applied, and
> > how?
> 
> So, an application that uses trampfd would do the following steps:
> 
> 1. Create a trampoline by calling trampfd_create()
> 2. Set the register and/or stack contexts for the trampoline.
> 3. mmap() the trampoline to get an address
> 4. Invoke the trampoline using the address
> 
> Let us say that the application has a vulnerability such as buffer overflow
> that allows a hacker to modify the data that is used to do step 2.
> 
> Potentially, a hacker could modify the following things:
>     - register values specified in the register context
>     - values specified in the stack context
>     - the target PC specified in the register context
> 
> When the trampoline is invoked in step 4, the kernel will gain control,
> load the registers, push stuff on the stack and transfer control to the target
> PC. Whatever the hacker had modified in step 2 will take effect in step 4.
> His values will get loaded and his PC is the one that will get control.
> 
> A paranoid application could add a step to this sequence. So, the steps
> would be:
> 
> 1. Create a trampoline by calling trampfd_create()
> 2. Set the register and/or stack contexts for the trampoline.
> 3. mmap() the trampoline to get an address
> 4a. Retrieve the register and stack context for the trampoline from the
>       kernel and check if anything has been altered. If yes, abort.
> 4b. Invoke the trampoline using the address

As above, you can also do this when using mprotect today, transitioning
the buffer RWX -> R-- -> R-X. If you're worried about subsequent
modification via an alias, a sealed memfd would work assuming that can
be mapped R-X.

This approach is applicable to trampfd, but it isn't a specific benefit
of trampfd.

[...] 

> >> - In the future, if the kernel can be enhanced to use a safe code
> >>   generation component, that code can be placed in the trampoline mapping
> >>   pages. Then, the trampoline invocation does not have to incur a trip
> >>   into the kernel.
> >>
> >> - Also, if the kernel can be enhanced to use a safe code generation
> >>   component, other forms of dynamic code such as JIT code can be
> >>   addressed by the trampfd framework.
> > I don't see why it's necessary for the kernel to generate code at all.
> > If the trampfd creation requests can be trusted, what prevents trusting
> > a sealed set of instructions generated in userspace?
> 
> Let us consider a system in which:
>     - a process is not permitted to have pages with both write and execute
>     - a process is not permitted to map any file as executable unless it
>       is properly signed. In other words, cryptographically verified.
> 
> Then, the process cannot execute any code that is runtime generated.
> That includes trampolines. Only trampoline code that is part of program
> text at build time would be permitted to execute.
> 
> In this scenario, trampfd requests are coming from signed code. So, they
> are trusted by the kernel. But trampoline code could be dynamically generated.
> The kernel will not trust it.

I think this a very hand-wavy argument, as it suggests that generated
code is not trusted, but what is effectively a generated bytecode is.

If certain codegen can be trusted, then we can add mechanisms to permit
the results of this to be mapped r-x. If that is not possible, then the
same argument says that trampfd requests cannot be trusted.

Thanks,
Mark.

^ permalink raw reply

* RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: David Laight @ 2020-08-04 14:33 UTC (permalink / raw)
  To: 'Mark Rutland', Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
In-Reply-To: <20200804135558.GA7440@C02TD0UTHF1T.local>

> > If you look at the libffi reference patch I have included, the architecture
> > specific changes to use trampfd just involve a single C function call to
> > a common code function.

No idea what libffi is, but it must surely be simpler to
rewrite it to avoid nested function definitions.

Or find a book from the 1960s on how to do recursive
calls and nested functions in FORTRAN-IV.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply

* Re: [GIT PULL] Filesystem Information
From: Miklos Szeredi @ 2020-08-04 14:36 UTC (permalink / raw)
  To: Ian Kent
  Cc: David Howells, Linus Torvalds, Al Viro, Karel Zak, Jeff Layton,
	Miklos Szeredi, Nicolas Dichtel, Christian Brauner,
	Lennart Poettering, Linux API, linux-fsdevel, LSM, linux-kernel
In-Reply-To: <ac1f5e3406abc0af4cd08d818fe920a202a67586.camel@themaw.net>

On Tue, Aug 4, 2020 at 4:15 AM Ian Kent <raven@themaw.net> wrote:
>
> On Mon, 2020-08-03 at 18:42 +0200, Miklos Szeredi wrote:
> > On Mon, Aug 3, 2020 at 5:50 PM David Howells <dhowells@redhat.com>
> > wrote:
> > >
> > > Hi Linus,
> > >
> > > Here's a set of patches that adds a system call, fsinfo(), that
> > > allows
> > > information about the VFS, mount topology, superblock and files to
> > > be
> > > retrieved.
> > >
> > > The patchset is based on top of the mount notifications patchset so
> > > that
> > > the mount notification mechanism can be hooked to provide event
> > > counters
> > > that can be retrieved with fsinfo(), thereby making it a lot faster
> > > to work
> > > out which mounts have changed.
> > >
> > > Note that there was a last minute change requested by Miklós: the
> > > event
> > > counter bits got moved from the mount notification patchset to this
> > > one.
> > > The counters got made atomic_long_t inside the kernel and __u64 in
> > > the
> > > UAPI.  The aggregate changes can be assessed by comparing pre-
> > > change tag,
> > > fsinfo-core-20200724 to the requested pull tag.
> > >
> > > Karel Zak has created preliminary patches that add support to
> > > libmount[*]
> > > and Ian Kent has started working on making systemd use these and
> > > mount
> > > notifications[**].
> >
> > So why are you asking to pull at this stage?
> >
> > Has anyone done a review of the patchset?
>
> I have been working with the patch set as it has evolved for quite a
> while now.
>
> I've been reading the kernel code quite a bit and forwarded questions
> and minor changes to David as they arose.
>
> As for a review, not specifically, but while the series implements a
> rather large change it's surprisingly straight forward to read.
>
> In the time I have been working with it I haven't noticed any problems
> except for those few minor things that I reported to David early on (in
> some cases accompanied by simple patches).
>
> And more recently (obviously) I've been working with the mount
> notifications changes and, from a readability POV, I find it's the
> same as the fsinfo() code.
>
> >
> > I think it's obvious that this API needs more work.  The integration
> > work done by Ian is a good direction, but it's not quite the full
> > validation and review that a complex new API needs.
>
> Maybe but the system call is fundamental to making notifications useful
> and, as I say, after working with it for quite a while I don't fell
> there's missing features (that David hasn't added along the way) and
> have found it provides what's needed for what I'm doing (for mount
> notifications at least).

Apart from the various issues related to the various mount ID's and
their sizes, my general comment is (and was always): why are we adding
a multiplexer for retrieval of mostly unrelated binary structures?

<linux/fsinfo.h> is 345 lines.  This is not a simple and clean API.

A simple and clean replacement API would be:

int get_mount_attribute(int dfd, const char *path, const char
*attr_name, char *value_buf, size_t buf_size, int flags);

No header file needed with dubiously sized binary values.

The only argument was performance, but apart from purely synthetic
microbenchmarks that hasn't been proven to be an issue.

And notice how similar the above interface is to getxattr(), or the
proposed readfile().  Where has the "everything is  a file" philosophy
gone?

I think we already lost that with the xattr API, that should have been
done in a way that fits this philosophy.  But given that we  have "/"
as the only special purpose char in filenames, and even repetitions
are allowed, it's hard to think of a good way to do that.  Pity.

Still I think it would be nice to have a general purpose attribute
retrieval API instead of the multiplicity of binary ioctls, xattrs,
etc.

Is that totally crazy?  Nobody missing the beauty in recently introduced APIs?

Thanks,
Miklos

^ permalink raw reply

* RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: David Laight @ 2020-08-04 14:44 UTC (permalink / raw)
  To: David Laight, 'Mark Rutland', Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
In-Reply-To: <c898918d18f34fd5b004cd1549b6a99e@AcuMS.aculab.com>

> > > If you look at the libffi reference patch I have included, the architecture
> > > specific changes to use trampfd just involve a single C function call to
> > > a common code function.
> 
> No idea what libffi is, but it must surely be simpler to
> rewrite it to avoid nested function definitions.
> 
> Or find a book from the 1960s on how to do recursive
> calls and nested functions in FORTRAN-IV.

FWIW it is probably as simple as:
1) Put all the 'variables' the nested function accesses into a struct.
2) Add a field for the address of the 'nested' function.
3) Pass the address of the structure down instead of the
   address of the function.

If you aren't in control of the call sites then add the
structure to a linked list on a thread-local variable.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Madhavan T. Venkataraman @ 2020-08-04 14:48 UTC (permalink / raw)
  To: David Laight, 'Mark Rutland'
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
In-Reply-To: <c898918d18f34fd5b004cd1549b6a99e@AcuMS.aculab.com>



On 8/4/20 9:33 AM, David Laight wrote:
>>> If you look at the libffi reference patch I have included, the architecture
>>> specific changes to use trampfd just involve a single C function call to
>>> a common code function.
> No idea what libffi is, but it must surely be simpler to
> rewrite it to avoid nested function definitions.

Sorry if I wasn't clear.

libffi is a separate use case and GCC nested functions is a separate one.
libffi is not used to solve the nested function stuff.

For nested functions, GCC generates trampoline code and arranges to
place it on the stack and execute it.

I agree with your other points about nested function implementation.

Madhavan
> Or find a book from the 1960s on how to do recursive
> calls and nested functions in FORTRAN-IV.
>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)


^ permalink raw reply

* Re: [PATCH v5 3/4] LSM: Define SELinux function to measure state and policy
From: Stephen Smalley @ 2020-08-04 15:20 UTC (permalink / raw)
  To: Lakshmi Ramasubramanian
  Cc: Mimi Zohar, Casey Schaufler, Tyler Hicks, sashal, James Morris,
	linux-integrity, SElinux list, LSM List, linux-kernel
In-Reply-To: <3e766eed-7a0b-afca-6139-ac43dea053d7@linux.microsoft.com>

On 8/3/20 6:08 PM, Lakshmi Ramasubramanian wrote:

> On 8/3/20 2:07 PM, Stephen Smalley wrote:
>
>>>>> [   68.870715] irq event stamp: 23486085
>>>>> [   68.870715] hardirqs last  enabled at (23486085):
>>>>> [<ffffffffaa419406>] _raw_spin_unlock_irqrestore+0x46/0x60
>>>>> [   68.870715] hardirqs last disabled at (23486084):
>>>>> [<ffffffffaa419443>] _raw_spin_lock_irqsave+0x23/0x90
>>>>> [   68.870715] softirqs last  enabled at (23486074):
>>>>> [<ffffffffaa8004f3>] __do_softirq+0x4f3/0x662
>>>>> [   68.870715] softirqs last disabled at (23486067):
>>>>> [<ffffffffaa601072>] asm_call_on_stack+0x12/0x20
>>>>> [   68.870715] ---[ end trace fb02740ff6f4d0cd ]---
>>>>
>>>> I think one issue here is that systemd loads SELinux policy first, 
>>>> then IMA policy, so it doesn't know whether it needs to measure 
>>>> SELinux policy on first policy load, and another issue is that the 
>>>> policy is too large to just queue the policy data itself this way 
>>>> (or you need to use an allocator that can handle larger sizes).
>>>>
>>>
>>> The problem seems to be that a lock is held when the IMA hook to 
>>> measure the LSM state is called. So memory allocation is not 
>>> allowed, but the hook is doing an allocation. I'll address this - 
>>> thanks for catching it.
>>>
>>> I have the following CONFIGs enabled, but I still don't see the 
>>> above issue on my machine.
>>>
>> The warning has to do with the memory allocation order being above 
>> the max order supported for kmalloc.  I think the problem is that 
>> ima_alloc_data_entry() is using kmemdup() to duplicate a payload of 
>> arbitrary size.  Policies on e.g. Fedora can be quite large, so you 
>> can't assume they can be allocated with kmalloc and friends.
>>
>
> Thanks for clarifying. Yes ima_alloc_entry() does use kmemdup to save 
> the given buffer (to be measured) until IMA loads custom policy.
>
> On my machine the SELinux policy size is about 2MB.
>
> Perhaps vmalloc would be better than using kmalloc? If there are 
> better options for such large buffer allocation, please let me know.

kvmalloc() can be used to select whichever one is most appropriate.



^ permalink raw reply

* Re: [PATCH v5 3/4] LSM: Define SELinux function to measure state and policy
From: Stephen Smalley @ 2020-08-04 15:29 UTC (permalink / raw)
  To: Lakshmi Ramasubramanian
  Cc: Mimi Zohar, Casey Schaufler, Tyler Hicks, sashal, James Morris,
	linux-integrity, SElinux list, LSM List, linux-kernel
In-Reply-To: <0fa0b1f3-6226-c307-0f11-8b3a881a070e@gmail.com>

On 8/4/20 11:20 AM, Stephen Smalley wrote:

> On 8/3/20 6:08 PM, Lakshmi Ramasubramanian wrote:
>
>> On 8/3/20 2:07 PM, Stephen Smalley wrote:
>>
>>>>>> [   68.870715] irq event stamp: 23486085
>>>>>> [   68.870715] hardirqs last  enabled at (23486085):
>>>>>> [<ffffffffaa419406>] _raw_spin_unlock_irqrestore+0x46/0x60
>>>>>> [   68.870715] hardirqs last disabled at (23486084):
>>>>>> [<ffffffffaa419443>] _raw_spin_lock_irqsave+0x23/0x90
>>>>>> [   68.870715] softirqs last  enabled at (23486074):
>>>>>> [<ffffffffaa8004f3>] __do_softirq+0x4f3/0x662
>>>>>> [   68.870715] softirqs last disabled at (23486067):
>>>>>> [<ffffffffaa601072>] asm_call_on_stack+0x12/0x20
>>>>>> [   68.870715] ---[ end trace fb02740ff6f4d0cd ]---
>>>>>
>>>>> I think one issue here is that systemd loads SELinux policy first, 
>>>>> then IMA policy, so it doesn't know whether it needs to measure 
>>>>> SELinux policy on first policy load, and another issue is that the 
>>>>> policy is too large to just queue the policy data itself this way 
>>>>> (or you need to use an allocator that can handle larger sizes).
>>>>>
>>>>
>>>> The problem seems to be that a lock is held when the IMA hook to 
>>>> measure the LSM state is called. So memory allocation is not 
>>>> allowed, but the hook is doing an allocation. I'll address this - 
>>>> thanks for catching it.
>>>>
>>>> I have the following CONFIGs enabled, but I still don't see the 
>>>> above issue on my machine.
>>>>
>>> The warning has to do with the memory allocation order being above 
>>> the max order supported for kmalloc.  I think the problem is that 
>>> ima_alloc_data_entry() is using kmemdup() to duplicate a payload of 
>>> arbitrary size.  Policies on e.g. Fedora can be quite large, so you 
>>> can't assume they can be allocated with kmalloc and friends.
>>>
>>
>> Thanks for clarifying. Yes ima_alloc_entry() does use kmemdup to save 
>> the given buffer (to be measured) until IMA loads custom policy.
>>
>> On my machine the SELinux policy size is about 2MB.
>>
>> Perhaps vmalloc would be better than using kmalloc? If there are 
>> better options for such large buffer allocation, please let me know.
>
> kvmalloc() can be used to select whichever one is most appropriate.

Other option would be for ima to compute and save the hash(es) of the 
payload and not the payload itself for later use.  I guess you won't 
know at that point which hash algorithm is desired?



^ permalink raw reply

* Re: [PATCH 00/18] VFS: Filesystem information [ver #21]
From: James Bottomley @ 2020-08-04 15:39 UTC (permalink / raw)
  To: David Howells, viro
  Cc: Theodore Ts'o, Andreas Dilger, Eric Biggers, Jeff Layton,
	linux-ext4, Carlos Maiolino, Darrick J. Wong, linux-api, torvalds,
	raven, mszeredi, christian, jannh, kzak, jlayton, linux-fsdevel,
	linux-security-module, linux-kernel
In-Reply-To: <159646178122.1784947.11705396571718464082.stgit@warthog.procyon.org.uk>

On Mon, 2020-08-03 at 14:36 +0100, David Howells wrote:
> Here's a set of patches that adds a system call, fsinfo(), that
> allows information about the VFS, mount topology, superblock and
> files to be retrieved.
> 
> The patchset is based on top of the notifications patchset and allows
> event counters implemented in the latter to be retrieved to allow
> overruns to be efficiently managed.

Could I repeat the question I asked about six months back that never
got answered:

https://lore.kernel.org/linux-api/1582316494.3376.45.camel@HansenPartnership.com/

It sort of petered out into a long winding thread about why not use
sysfs instead, which really doesn't look like a good idea to me.

I'll repeat the information for those who want to quote it easily on
reply without having to use a web interface:

---
Could I make a suggestion about how this should be done in a way that
doesn't actually require the fsinfo syscall at all: it could just be
done with fsconfig.  The idea is based on something I've wanted to do
for configfd but couldn't because otherwise it wouldn't substitute for
fsconfig, but Christian made me think it was actually essential to the
ability of the seccomp and other verifier tools in the critique of
configfd and I belive the same critique applies here.

Instead of making fsconfig functionally configure ... as in you pass
the attribute name, type and parameters down into the fs specific
handler and the handler does a string match and then verifies the
parameters and then acts on them, make it table configured, so what
each fstype does is register a table of attributes which can be got and
optionally set (with each attribute having a get and optional set
function).  We'd have multiple tables per fstype, so the generic VFS
can register a table of attributes it understands for every fstype
(things like name, uuid and the like) and then each fs type would
register a table of fs specific attributes following the same pattern. 
The system would examine the fs specific table before the generic one,
allowing overrides.  fsconfig would have the ability to both get and
set attributes, permitting retrieval as well as setting (which is how I
get rid of the fsinfo syscall), we'd have a global parameter, which
would retrieve the entire table by name and type so the whole thing is
introspectable because the upper layer knows a-priori all the
attributes which can be set for a given fs type and what type they are
(so we can make more of the parsing generic).  Any attribute which
doesn't have a set routine would be read only and all attributes would
have to have a get routine meaning everything is queryable.

I think I know how to code this up in a way that would be fully
transparent to the existing syscalls.
---

James




^ permalink raw reply

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Madhavan T. Venkataraman @ 2020-08-04 15:46 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML
In-Reply-To: <20200804135558.GA7440@C02TD0UTHF1T.local>

Hey Mark,

I am working on putting together an improved definition of trampfd per
Andy's comment. I will try to address your comments in that improved
definition. Once I send that out, I will respond to your emails as well.

Thanks.

Madhavan

On 8/4/20 8:55 AM, Mark Rutland wrote:
> On Mon, Aug 03, 2020 at 12:58:04PM -0500, Madhavan T. Venkataraman wrote:
>> On 7/31/20 1:31 PM, Mark Rutland wrote:
>>> On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
>>>> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
>>>>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
>>>>> <madvenka@linux.microsoft.com> wrote:
>>>>> When the kernel generates the code for a trampoline, it can hard code data values
>>>> in the generated code itself so it does not need PC-relative data referencing.
>>>>
>>>> And, for ISAs that do support the large offset, we do have to implement and
>>>> maintain the code page stuff for different ISAs for each application and library
>>>> if we did not use trampfd.
>>> Trampoline code is architecture specific today, so I don't see that as a
>>> major issue. Common structural bits can probably be shared even if the
>>> specifid machine code cannot.
>> True. But an implementor may prefer a standard mechanism provided by
>> the kernel so all of his architectures can be supported easily with less
>> effort.
>>
>> If you look at the libffi reference patch I have included, the architecture
>> specific changes to use trampfd just involve a single C function call to
>> a common code function.
> Sure but in addition to that each architecture backend had to define a
> set of arguments to that. I view the C function is analagous to the
> "common structural bits".
>
> I appreciate that your patch is small today (and architectures seem to
> largely align on what they need), but I don't think it's necessarily
> true that things will remain so simple as architecture are extended and
> their calling conventions evolve, and I also don't think it's clear that
> this will work for more complex cases elsewhere.
>
> [...]
>
>>>> With the user level trampoline table approach, the data part of the trampoline table
>>>> can be hacked by an attacker if an application has a vulnerability. Specifically, the
>>>> target PC can be altered to some arbitrary location. Trampfd implements an
>>>> "Allowed PCS" context. In the libffi changes, I have created a read-only array of
>>>> all ABI handlers used in closures for each architecture. This read-only array
>>>> can be used to restrict the PC values for libffi trampolines to prevent hacking.
>>>>
>>>> To generalize, we can implement security rules/features if the trampoline
>>>> object is in the kernel.
>>> I don't follow this argument. If it's possible to statically define that
>>> in the kernel, it's also possible to do that in userspace without any
>>> new kernel support.
>> It is not statically defined in the kernel.
>>
>> Let us take the libffi example. In the 64-bit X86 arch code, there are 3
>> ABI handlers:
>>
>>     ffi_closure_unix64_sse
>>     ffi_closure_unix64
>>     ffi_closure_win64
>>
>> I could create an "Allowed PCs" context like this:
>>
>> struct my_allowed_pcs {
>>     struct trampfd_values    pcs;
>>     __u64                             pc_values[3];
>> };
>>
>> const struct my_allowed_pcs    my_allowed_pcs = {
>>     { 3, 0 },
>>     (uintptr_t) ffi_closure_unix64_sse,
>>     (uintptr_t) ffi_closure_unix64,
>>     (uintptr_t) ffi_closure_win64,
>> };
>>
>> I have created a read-only array of allowed ABI handlers that closures use.
>>
>> When I set up the context for a closure trampoline, I could do this:
>>
>>     pwrite(trampfd, &my_allowed_pcs, sizeof(my_allowed_pcs), TRAMPFD_ALLOWED_PCS_OFFSET);
>>    
>> This copies the array into the trampoline object in the kernel.
>> When the register context is set for the trampoline, the kernel checks
>> the PC register value against allowed PCs.
>>
>> Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only
>> permitted target PCs enforced by the kernel are the ABI handlers.
> Sorry, when I said "statically define" meant when you knew legitimate
> targets ahead of time when you create the trampoline (i.e. whether you
> could enumerate those and know they would not change dynamically).
>
> My point was that you can achieve the same in userspace if the
> trampoline and array of legitimate targets are in read-only memory,
> without having to trap to the kernel.
>
> I think the key point here is that an adversary must be prevented from
> altering a trampoline and any associated metadata, and I think that
> there are ways of achieving that without having to trap into the kernel,
> and without the kernel having to be intimately aware of the calling
> conventions used in userspace.
>
> [...]
>
>>>> Trampfd is a framework that can be used to implement multiple things. May be,
>>>> a few of those things can also be implemented in user land itself. But I think having
>>>> just one mechanism to execute dynamic code objects is preferable to having
>>>> multiple mechanisms not standardized across all applications.
>>> In abstract, having a common interface sounds nice, but in practice
>>> elements of this are always architecture-specific (e.g. interactiosn
>>> with HW CFI), and that common interface can result in more pain as it
>>> doesn't fit naturally into the context that ISAs were designed for (e.g. 
>>> where control-flow instructions are extended with new semantics).
>> In the case of trampfd, the code generation is indeed architecture
>> specific. But that is in the kernel. The application is not affected by it.
> As an ABI detail, applications are *definitely* affected by this, and it
> is wrong to suggest they are not even if you don't have a specific case
> in mind today. As this forms a contract between userspace and the kernel
> it's overly simplistic to say that it's the kernel's problem
>
> For example, in the case of BTI on arm64, what should the trampoline
> set PSTATE.BTYPE to? Different use-cases *will* want different values,
> and not necessarily the value of PSTATE at the instant the call to the
> trampoline was made. In the case of libffi specifically using the
> original value of PSTATE.BTYPE probably is sound, but other code
> sequences may need to restrict/broaden or entirely change that.
>
>> Again, referring to the libffi reference patch, I have defined wrapper
>> functions for trampfd in common code. The architecture specific code
>> in libffi only calls the set_context function defined in common code.
>> Even this is required only because register names are specific to each
>> architecture and the target PC (to the ABI handler) is specific to
>> each architecture-ABI combo.
>>
>>> It also meass that you can't share the rough approach across OSs which
>>> do not implement an identical mechanism, so for code abstracting by ISA
>>> first, then by platform/ABI, there isn't much saving.
>> Why can you not share the same approach across OSes? In fact,
>> I have tried to design it so that other OSes can use the same
>> mechanism.
> Sure, but where they *don't*, you must fall back to the existing
> purely-userspace mechanisms, and so a codebase now has the burden of
> maintaining two distinct mechanisms.
>
> Whereas if there's a way of doing this in userspace with (stronger)
> enforcement of memory permissions the trampoline code can be common for
> when this is present or absent, which is much easier for a codebase rto
> maintain, and could make use of weaker existing mechanisms to improve
> the situation on systems without the new functionality.
>
> Thanks,
> Mark.


^ permalink raw reply

* Re: [PATCH v5 3/4] LSM: Define SELinux function to measure state and policy
From: Lakshmi Ramasubramanian @ 2020-08-04 15:57 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Mimi Zohar, Casey Schaufler, Tyler Hicks, sashal, James Morris,
	linux-integrity, SElinux list, LSM List, linux-kernel
In-Reply-To: <32da0a4a-252a-67d8-5dc8-173959f6ddb4@gmail.com>

On 8/4/20 8:29 AM, Stephen Smalley wrote:

>>> Perhaps vmalloc would be better than using kmalloc? If there are 
>>> better options for such large buffer allocation, please let me know.
>>
>> kvmalloc() can be used to select whichever one is most appropriate.
> 
> Other option would be for ima to compute and save the hash(es) of the 
> payload and not the payload itself for later use.  I guess you won't 
> know at that point which hash algorithm is desired?
> 

I think IMA hash algorithm would be known at that point, but IMA policy 
is not loaded yet (which is why I need to queue up the buffer and 
process when policy is loaded).

I tried vmalloc and tested it with upto 16MB buffer (just made up a 
SELinux policy buffer of size 16MB) - that works fine.

I will try kvmalloc().

Also, I fixed the issue with LSM data not measured when using the IMA 
policy you had. Good catch.

Will post the updated patches today.

thanks,
  -lakshmi

^ permalink raw reply

* Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)
From: Deven Bowers @ 2020-08-04 16:07 UTC (permalink / raw)
  To: James Bottomley, Pavel Machek, Sasha Levin
  Cc: snitzer, zohar, dm-devel, tyhicks, agk, paul, mdsakib, jmorris,
	nramas, serge, pasha.tatashin, jannh, linux-block, viro, axboe,
	corbet, linux-kernel, eparis, linux-security-module, linux-audit,
	linux-fsdevel, linux-integrity, jaskarankhurana
In-Reply-To: <1596386606.4087.20.camel@HansenPartnership.com>



On 8/2/2020 9:43 AM, James Bottomley wrote:
> On Sun, 2020-08-02 at 16:31 +0200, Pavel Machek wrote:
>> On Sun 2020-08-02 10:03:00, Sasha Levin wrote:
>>> On Sun, Aug 02, 2020 at 01:55:45PM +0200, Pavel Machek wrote:
>>>> Hi!
>>>>
>>>>> IPE is a Linux Security Module which allows for a configurable
>>>>> policy to enforce integrity requirements on the whole system.
>>>>> It attempts to solve the issue of Code Integrity: that any code
>>>>> being executed (or files being read), are identical to the
>>>>> version that was built by a trusted source.
>>>>
>>>> How is that different from security/integrity/ima?
>>>
>>> Maybe if you would have read the cover letter all the way down to
>>> the 5th paragraph which explains how IPE is different from IMA we
>>> could avoided this mail exchange...
>>
>> "
>> IPE differs from other LSMs which provide integrity checking (for
>> instance,
>> IMA), as it has no dependency on the filesystem metadata itself. The
>> attributes that IPE checks are deterministic properties that exist
>> solely
>> in the kernel. Additionally, IPE provides no additional mechanisms of
>> verifying these files (e.g. IMA Signatures) - all of the attributes
>> of
>> verifying files are existing features within the kernel, such as
>> dm-verity
>> or fsverity.
>> "
>>
>> That is not really helpful.

Perhaps I can explain (and re-word this paragraph) a bit better.

As James indicates, IPE does try to close the gap of the IMA limitation
with xattr. I honestly wasn’t familiar with the appended signatures,
which seems fine.

Regardless, this isn’t the larger benefit that IPE provides. The
larger benefit of this is how IPE separates _mechanisms_ (properties)
to enforce integrity requirements, from _policy_. The LSM provides
policy, while things like dm-verity provide mechanism.

So to speak, IPE acts as the glue for other mechanisms to leverage a
customizable, system-wide policy to enforce. While this initial
patchset only onboards dm-verity, there’s also potential for MAC labels,
fs-verity, authenticated BTRFS, dm-integrity, etc. IPE leverages
existing systems in the kernel, while IMA uses its own.

Another difference is the general coverage. IMA has some difficulties
in covering mprotect[1], IPE doesn’t (the MAP_ANONYMOUS indicated by
Jann in that thread would be denied as the file struct would be null,
with IPE’s current set of supported mechanisms. mprotect would continue
to function as expected if you change to PROT_EXEC).

> Perhaps the big question is: If we used the existing IMA appended
> signature for detached signatures (effectively becoming the
> "properties" referred to in the cover letter) and hooked IMA into
> device mapper using additional policy terms, would that satisfy all the
> requirements this new LSM has?

Well, Mimi, what do you think? Should we integrate all the features of
IPE into IMA, or do you think they are sufficiently different in
architecture that it would be worth it to keep the code base in separate
LSMs?


[1] 
https://lore.kernel.org/linux-integrity/1588688204.5157.5.camel@linux.ibm.com/


^ permalink raw reply

* [GIT PULL] Smack patches for v5.9
From: Casey Schaufler @ 2020-08-04 17:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Security Module list, LKML, Casey Schaufler
In-Reply-To: <8ce85723-5656-0ee8-67a7-35597d9df0dd.ref@schaufler-ca.com>

Hello Linus

Here are three minor fixes to Smack for the v5.9 release.
All were found by automated checkers and have straight forward
resolution.

--
The following changes since commit b3a9e3b9622ae10064826dccb4f7a52bd88c7407:

  Linux 5.8-rc1 (2020-06-14 12:45:04 -0700)

are available in the Git repository at:

  https://github.com/cschaufler/smack-next smack-for-5.9

for you to fetch changes up to 42a2df3e829f3c5562090391b33714b2e2e5ad4a:

  Smack: prevent underflow in smk_set_cipso() (2020-07-27 13:35:12 -0700)

----------------------------------------------------------------
Dan Carpenter (2):
      Smack: fix another vsscanf out of bounds
      Smack: prevent underflow in smk_set_cipso()

Eric Biggers (1):
      Smack: fix use-after-free in smk_write_relabel_self()

 security/smack/smackfs.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)


^ permalink raw reply

* Re: [PATCH 00/18] VFS: Filesystem information [ver #21]
From: Miklos Szeredi @ 2020-08-04 19:18 UTC (permalink / raw)
  To: James Bottomley
  Cc: David Howells, Al Viro, Theodore Ts'o, Andreas Dilger,
	Eric Biggers, Jeff Layton, linux-ext4, Carlos Maiolino,
	Darrick J. Wong, Linux API, Linus Torvalds, Ian Kent,
	Miklos Szeredi, Christian Brauner, Jann Horn, Karel Zak,
	Jeff Layton, linux-fsdevel, LSM, linux-kernel
In-Reply-To: <1596555579.10158.23.camel@HansenPartnership.com>

On Tue, Aug 4, 2020 at 5:40 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> On Mon, 2020-08-03 at 14:36 +0100, David Howells wrote:
> > Here's a set of patches that adds a system call, fsinfo(), that
> > allows information about the VFS, mount topology, superblock and
> > files to be retrieved.
> >
> > The patchset is based on top of the notifications patchset and allows
> > event counters implemented in the latter to be retrieved to allow
> > overruns to be efficiently managed.
>
> Could I repeat the question I asked about six months back that never
> got answered:
>
> https://lore.kernel.org/linux-api/1582316494.3376.45.camel@HansenPartnership.com/
>
> It sort of petered out into a long winding thread about why not use
> sysfs instead, which really doesn't look like a good idea to me.
>
> I'll repeat the information for those who want to quote it easily on
> reply without having to use a web interface:
>
> ---
> Could I make a suggestion about how this should be done in a way that
> doesn't actually require the fsinfo syscall at all: it could just be
> done with fsconfig.  The idea is based on something I've wanted to do
> for configfd but couldn't because otherwise it wouldn't substitute for
> fsconfig, but Christian made me think it was actually essential to the
> ability of the seccomp and other verifier tools in the critique of
> configfd and I belive the same critique applies here.
>
> Instead of making fsconfig functionally configure ... as in you pass
> the attribute name, type and parameters down into the fs specific
> handler and the handler does a string match and then verifies the
> parameters and then acts on them, make it table configured, so what
> each fstype does is register a table of attributes which can be got and
> optionally set (with each attribute having a get and optional set
> function).  We'd have multiple tables per fstype, so the generic VFS
> can register a table of attributes it understands for every fstype
> (things like name, uuid and the like) and then each fs type would
> register a table of fs specific attributes following the same pattern.
> The system would examine the fs specific table before the generic one,
> allowing overrides.  fsconfig would have the ability to both get and
> set attributes, permitting retrieval as well as setting (which is how I
> get rid of the fsinfo syscall), we'd have a global parameter, which
> would retrieve the entire table by name and type so the whole thing is
> introspectable because the upper layer knows a-priori all the
> attributes which can be set for a given fs type and what type they are
> (so we can make more of the parsing generic).  Any attribute which
> doesn't have a set routine would be read only and all attributes would
> have to have a get routine meaning everything is queryable.

fsconfig(2) takes an fd referring to an fs_context, that in turn
refers to a super_block.

So using fsconfig() for retrieving super_block attributes would be
fine (modulo value being const, and lack of buffer size).

But what about mount attributes?

I don't buy the argument that an API needs to be designed around the
requirements of seccomp and the like.  It should be the other way
round.  In that, I think your configfd idea was fine, and would answer
the above question.

Thanks,
Miklos

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox