Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 01/13] vfs: verify param type in vfs_parse_sb_flag()
From: Miklos Szeredi @ 2019-07-01  8:45 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Miklos Szeredi, Ian Kent, Linux API, linux-fsdevel,
	linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

Hi David,

Ping?  Have you had a chance of looking at this series?

Köszi,
Miklos

On Wed, Jun 19, 2019 at 2:30 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> vfs_parse_sb_flag() accepted any kind of param with a matching key, not
> just a flag.  This is wrong, only allow flag type and return -EINVAL
> otherwise.
>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  fs/fs_context.c | 31 +++++++++++++++----------------
>  1 file changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> index 103643c68e3f..e56310fd8c75 100644
> --- a/fs/fs_context.c
> +++ b/fs/fs_context.c
> @@ -81,30 +81,29 @@ static const char *const forbidden_sb_flag[] = {
>  /*
>   * Check for a common mount option that manipulates s_flags.
>   */
> -static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
> +static int vfs_parse_sb_flag(struct fs_context *fc, struct fs_parameter *param)
>  {
> -       unsigned int token;
> +       const char *key = param->key;
> +       unsigned int set, clear;
>         unsigned int i;
>
>         for (i = 0; i < ARRAY_SIZE(forbidden_sb_flag); i++)
>                 if (strcmp(key, forbidden_sb_flag[i]) == 0)
>                         return -EINVAL;
>
> -       token = lookup_constant(common_set_sb_flag, key, 0);
> -       if (token) {
> -               fc->sb_flags |= token;
> -               fc->sb_flags_mask |= token;
> -               return 0;
> -       }
> +       set = lookup_constant(common_set_sb_flag, key, 0);
> +       clear = lookup_constant(common_clear_sb_flag, key, 0);
> +       if (!set && !clear)
> +               return -ENOPARAM;
>
> -       token = lookup_constant(common_clear_sb_flag, key, 0);
> -       if (token) {
> -               fc->sb_flags &= ~token;
> -               fc->sb_flags_mask |= token;
> -               return 0;
> -       }
> +       if (param->type != fs_value_is_flag)
> +               return invalf(fc, "%s: Unexpected value for '%s'",
> +                             fc->fs_type->name, param->key);
>
> -       return -ENOPARAM;
> +       fc->sb_flags |= set;
> +       fc->sb_flags &= ~clear;
> +       fc->sb_flags_mask |= set | clear;
> +       return 0;
>  }
>
>  /**
> @@ -130,7 +129,7 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
>         if (!param->key)
>                 return invalf(fc, "Unnamed parameter\n");
>
> -       ret = vfs_parse_sb_flag(fc, param->key);
> +       ret = vfs_parse_sb_flag(fc, param);
>         if (ret != -ENOPARAM)
>                 return ret;
>
> --
> 2.21.0
>

^ permalink raw reply

* Re: [PATCH 2/6] Adjust watch_queue documentation to mention mount and superblock watches. [ver #5]
From: David Howells @ 2019-07-01  8:52 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: dhowells, viro, Casey Schaufler, Stephen Smalley,
	Greg Kroah-Hartman, nicolas.dichtel, raven, Christian Brauner,
	keyrings, linux-usb, linux-security-module, linux-fsdevel,
	linux-api, linux-block, linux-kernel
In-Reply-To: <7a288c2c-11a1-87df-9550-b247d6ce3010@infradead.org>

Randy Dunlap <rdunlap@infradead.org> wrote:

> I'm having a little trouble parsing that sentence.
> Could you clarify it or maybe rewrite/modify it?
> Thanks.

How about:

  * ``info_filter`` and ``info_mask`` act as a filter on the info field of the
    notification record.  The notification is only written into the buffer if::

	(watch.info & info_mask) == info_filter

    This could be used, for example, to ignore events that are not exactly on
    the watched point in a mount tree by specifying NOTIFY_MOUNT_IN_SUBTREE
    must not be set, e.g.::

	{
		.type = WATCH_TYPE_MOUNT_NOTIFY,
		.info_filter = 0,
		.info_mask = NOTIFY_MOUNT_IN_SUBTREE,
		.subtype_filter = ...,
	}

    as an event would be only permissible with this filter if::

    	(watch.info & NOTIFY_MOUNT_IN_SUBTREE) == 0

David

^ permalink raw reply

* Re: [PATCH v3 0/5] Introduce MADV_COLD and MADV_PAGEOUT
From: Michal Hocko @ 2019-07-01 10:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, oleksandr, hdanton, lizeb, Dave Hansen,
	Kirill A . Shutemov
In-Reply-To: <20190701073848.GB136163@google.com>

On Mon 01-07-19 16:38:48, Minchan Kim wrote:
> 
> Hi Folks,
> 
> Do you guys have comments? I think it would be long enough to be
> pending. If there is no further comments, I want to ask to merge.

This is definitely on my todo list for this week. But please be patient.
It's been _one_ work day since you posted this last version so I do not
think this is stalling for too long. Sure the current version is
probably not too much different from the previous but I didn't get to
review it in the depth yet. All the code duplication doesn't make it
much easier but I understand your reasoning that sharing more code is
not really straightforward.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH 01/11] vfs: syscall: Add fsinfo() to query filesystem information [ver #15]
From: Christian Brauner @ 2019-07-01 10:40 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, mszeredi, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <156173662509.14042.3867242748127323502.stgit@warthog.procyon.org.uk>

On Fri, Jun 28, 2019 at 04:43:45PM +0100, David Howells wrote:
> Add a system call to allow filesystem information to be queried.  A request
> value can be given to indicate the desired attribute.  Support is provided
> for enumerating multi-value attributes.
> 
> ===============
> NEW SYSTEM CALL
> ===============
> 
> The new system call looks like:
> 
> 	int ret = fsinfo(int dfd,
> 			 const char *filename,
> 			 const struct fsinfo_params *params,
> 			 void *buffer,
> 			 size_t buf_size);
> 
> The params parameter optionally points to a block of parameters:
> 
> 	struct fsinfo_params {
> 		__u32	at_flags;
> 		__u32	request;
> 		__u32	Nth;
> 		__u32	Mth;
> 		__u64	__reserved[3];
> 	};
> 
> If params is NULL, it is assumed params->request should be
> fsinfo_attr_statfs, params->Nth should be 0, params->Mth should be 0 and
> params->at_flags should be 0.
> 
> If params is given, all of params->__reserved[] must be 0.
> 
> dfd, filename and params->at_flags indicate the file to query.  There is no
> equivalent of lstat() as that can be emulated with fsinfo() by setting
> AT_SYMLINK_NOFOLLOW in params->at_flags.  There is also no equivalent of
> fstat() as that can be emulated by passing a NULL filename to fsinfo() with
> the fd of interest in dfd.  AT_NO_AUTOMOUNT can also be used to an allow
> automount point to be queried without triggering it.
> 
> params->request indicates the attribute/attributes to be queried.  This can
> be one of:
> 
> 	FSINFO_ATTR_STATFS		- statfs-style info
> 	FSINFO_ATTR_FSINFO		- Information about fsinfo()
> 	FSINFO_ATTR_IDS			- Filesystem IDs
> 	FSINFO_ATTR_LIMITS		- Filesystem limits
> 	FSINFO_ATTR_SUPPORTS		- What's supported in statx(), IOC flags
> 	FSINFO_ATTR_CAPABILITIES	- Filesystem capabilities
> 	FSINFO_ATTR_TIMESTAMP_INFO	- Inode timestamp info
> 	FSINFO_ATTR_VOLUME_ID		- Volume ID (string)
> 	FSINFO_ATTR_VOLUME_UUID		- Volume UUID
> 	FSINFO_ATTR_VOLUME_NAME		- Volume name (string)
> 	FSINFO_ATTR_NAME_ENCODING	- Filename encoding (string)
> 	FSINFO_ATTR_NAME_CODEPAGE	- Filename codepage (string)
> 
> Some attributes (such as the servers backing a network filesystem) can have
> multiple values.  These can be enumerated by setting params->Nth and
> params->Mth to 0, 1, ... until ENODATA is returned.
> 
> buffer and buf_size point to the reply buffer.  The buffer is filled up to
> the specified size, even if this means truncating the reply.  The full size
> of the reply is returned.  In future versions, this will allow extra fields
> to be tacked on to the end of the reply, but anyone not expecting them will
> only get the subset they're expecting.  If either buffer of buf_size are 0,
> no copy will take place and the data size will be returned.
> 
> At the moment, this will only work on x86_64 and i386 as it requires the
> system call to be wired up.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: linux-api@vger.kernel.org
> ---
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 
>  fs/Kconfig                             |    7 
>  fs/Makefile                            |    1 
>  fs/fsinfo.c                            |  545 ++++++++++++++++++++++++++++++++
>  include/linux/fs.h                     |    5 
>  include/linux/fsinfo.h                 |   65 ++++
>  include/linux/syscalls.h               |    4 
>  include/uapi/asm-generic/unistd.h      |    4 
>  include/uapi/linux/fsinfo.h            |  219 +++++++++++++
>  kernel/sys_ni.c                        |    1 
>  samples/vfs/Makefile                   |    4 
>  samples/vfs/test-fsinfo.c              |  551 ++++++++++++++++++++++++++++++++
>  13 files changed, 1407 insertions(+), 1 deletion(-)
>  create mode 100644 fs/fsinfo.c
>  create mode 100644 include/linux/fsinfo.h
>  create mode 100644 include/uapi/linux/fsinfo.h
>  create mode 100644 samples/vfs/test-fsinfo.c
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index ad968b7bac72..03decae51513 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -438,3 +438,4 @@
>  431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
>  432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
>  433	i386	fspick			sys_fspick			__ia32_sys_fspick
> +434	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index b4e6f9e6204a..ea63df9a1020 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -355,6 +355,7 @@
>  431	common	fsconfig		__x64_sys_fsconfig
>  432	common	fsmount			__x64_sys_fsmount
>  433	common	fspick			__x64_sys_fspick
> +434	common	fsinfo			__x64_sys_fsinfo
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/Kconfig b/fs/Kconfig
> index cbbffc8b9ef5..9e7d2f2c0111 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -15,6 +15,13 @@ config VALIDATE_FS_PARSER
>  	  Enable this to perform validation of the parameter description for a
>  	  filesystem when it is registered.
>  
> +config FSINFO

Hm, any reason why we would hide that syscalls under a config option?

> +	bool "Enable the fsinfo() system call"
> +	help
> +	  Enable the file system information querying system call to allow
> +	  comprehensive information to be retrieved about a filesystem,
> +	  superblock or mount object.
> +
>  if BLOCK
>  
>  config FS_IOMAP
> diff --git a/fs/Makefile b/fs/Makefile
> index c9aea23aba56..26eaeae4b9a1 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -53,6 +53,7 @@ obj-$(CONFIG_SYSCTL)		+= drop_caches.o
>  
>  obj-$(CONFIG_FHANDLE)		+= fhandle.o
>  obj-$(CONFIG_FS_IOMAP)		+= iomap.o
> +obj-$(CONFIG_FSINFO)		+= fsinfo.o
>  
>  obj-y				+= quota/
>  
> diff --git a/fs/fsinfo.c b/fs/fsinfo.c
> new file mode 100644
> index 000000000000..09e743b16235
> --- /dev/null
> +++ b/fs/fsinfo.c
> @@ -0,0 +1,545 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Filesystem information query.
> + *
> + * Copyright (C) 2019 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +#include <linux/syscalls.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/mount.h>
> +#include <linux/namei.h>
> +#include <linux/statfs.h>
> +#include <linux/security.h>
> +#include <linux/uaccess.h>
> +#include <linux/fsinfo.h>
> +#include <uapi/linux/mount.h>
> +#include "internal.h"
> +
> +static u32 calc_mount_attrs(u32 mnt_flags)
> +{
> +	u32 attrs = 0;
> +
> +	if (mnt_flags & MNT_READONLY)
> +		attrs |= MOUNT_ATTR_RDONLY;
> +	if (mnt_flags & MNT_NOSUID)
> +		attrs |= MOUNT_ATTR_NOSUID;
> +	if (mnt_flags & MNT_NODEV)
> +		attrs |= MOUNT_ATTR_NODEV;
> +	if (mnt_flags & MNT_NOEXEC)
> +		attrs |= MOUNT_ATTR_NOEXEC;
> +	if (mnt_flags & MNT_NODIRATIME)
> +		attrs |= MOUNT_ATTR_NODIRATIME;
> +
> +	if (mnt_flags & MNT_NOATIME)
> +		attrs |= MOUNT_ATTR_NOATIME;
> +	else if (mnt_flags & MNT_RELATIME)
> +		attrs |= MOUNT_ATTR_RELATIME;
> +	else
> +		attrs |= MOUNT_ATTR_STRICTATIME;
> +	return attrs;
> +}
> +
> +/*
> + * Get basic filesystem stats from statfs.
> + */
> +static int fsinfo_generic_statfs(struct path *path, struct fsinfo_statfs *p)
> +{
> +	struct kstatfs buf;
> +	int ret;
> +
> +	ret = vfs_statfs(path, &buf);
> +	if (ret < 0)
> +		return ret;
> +
> +	p->f_blocks.hi	= 0;
> +	p->f_blocks.lo	= buf.f_blocks;
> +	p->f_bfree.hi	= 0;
> +	p->f_bfree.lo	= buf.f_bfree;
> +	p->f_bavail.hi	= 0;
> +	p->f_bavail.lo	= buf.f_bavail;
> +	p->f_files.hi	= 0;
> +	p->f_files.lo	= buf.f_files;
> +	p->f_ffree.hi	= 0;
> +	p->f_ffree.lo	= buf.f_ffree;
> +	p->f_favail.hi	= 0;
> +	p->f_favail.lo	= buf.f_ffree;
> +	p->f_bsize	= buf.f_bsize;
> +	p->f_frsize	= buf.f_frsize;
> +
> +	p->mnt_attrs	= calc_mount_attrs(path->mnt->mnt_flags);
> +	return sizeof(*p);
> +}
> +
> +static int fsinfo_generic_ids(struct path *path, struct fsinfo_ids *p)
> +{
> +	struct super_block *sb;
> +	struct kstatfs buf;
> +	int ret;
> +
> +	ret = vfs_statfs(path, &buf);
> +	if (ret < 0 && ret != -ENOSYS)
> +		return ret;
> +
> +	sb = path->dentry->d_sb;
> +	p->f_fstype	= sb->s_magic;
> +	p->f_dev_major	= MAJOR(sb->s_dev);
> +	p->f_dev_minor	= MINOR(sb->s_dev);
> +
> +	memcpy(&p->f_fsid, &buf.f_fsid, sizeof(p->f_fsid));
> +	strlcpy(p->f_fs_name, path->dentry->d_sb->s_type->name,
> +		sizeof(p->f_fs_name));
> +	return sizeof(*p);
> +}
> +
> +static int fsinfo_generic_limits(struct path *path, struct fsinfo_limits *lim)
> +{
> +	struct super_block *sb = path->dentry->d_sb;
> +
> +	lim->max_file_size.hi = 0;
> +	lim->max_file_size.lo = sb->s_maxbytes;
> +	lim->max_hard_links = sb->s_max_links;
> +	lim->max_uid = UINT_MAX;
> +	lim->max_gid = UINT_MAX;
> +	lim->max_projid = UINT_MAX;
> +	lim->max_filename_len = NAME_MAX;
> +	lim->max_symlink_len = PAGE_SIZE;
> +	lim->max_xattr_name_len = XATTR_NAME_MAX;
> +	lim->max_xattr_body_len = XATTR_SIZE_MAX;
> +	lim->max_dev_major = 0xffffff;
> +	lim->max_dev_minor = 0xff;
> +	return sizeof(*lim);
> +}
> +
> +static int fsinfo_generic_supports(struct path *path, struct fsinfo_supports *c)
> +{
> +	struct super_block *sb = path->dentry->d_sb;
> +
> +	c->stx_mask = STATX_BASIC_STATS;
> +	if (sb->s_d_op && sb->s_d_op->d_automount)
> +		c->stx_attributes |= STATX_ATTR_AUTOMOUNT;
> +	return sizeof(*c);
> +}
> +
> +static int fsinfo_generic_capabilities(struct path *path,
> +				       struct fsinfo_capabilities *c)
> +{
> +	struct super_block *sb = path->dentry->d_sb;
> +
> +	if (sb->s_mtd)
> +		fsinfo_set_cap(c, FSINFO_CAP_IS_FLASH_FS);
> +	else if (sb->s_bdev)
> +		fsinfo_set_cap(c, FSINFO_CAP_IS_BLOCK_FS);
> +
> +	if (sb->s_quota_types & QTYPE_MASK_USR)
> +		fsinfo_set_cap(c, FSINFO_CAP_USER_QUOTAS);
> +	if (sb->s_quota_types & QTYPE_MASK_GRP)
> +		fsinfo_set_cap(c, FSINFO_CAP_GROUP_QUOTAS);
> +	if (sb->s_quota_types & QTYPE_MASK_PRJ)
> +		fsinfo_set_cap(c, FSINFO_CAP_PROJECT_QUOTAS);
> +	if (sb->s_d_op && sb->s_d_op->d_automount)
> +		fsinfo_set_cap(c, FSINFO_CAP_AUTOMOUNTS);
> +	if (sb->s_id[0])
> +		fsinfo_set_cap(c, FSINFO_CAP_VOLUME_ID);
> +
> +	fsinfo_set_cap(c, FSINFO_CAP_HAS_ATIME);
> +	fsinfo_set_cap(c, FSINFO_CAP_HAS_CTIME);
> +	fsinfo_set_cap(c, FSINFO_CAP_HAS_MTIME);
> +	return sizeof(*c);
> +}
> +
> +static const struct fsinfo_timestamp_info fsinfo_default_timestamp_info = {
> +	.atime = {
> +		.minimum	= S64_MIN,
> +		.maximum	= S64_MAX,
> +		.gran_mantissa	= 1,
> +		.gran_exponent	= 0,
> +	},
> +	.mtime = {
> +		.minimum	= S64_MIN,
> +		.maximum	= S64_MAX,
> +		.gran_mantissa	= 1,
> +		.gran_exponent	= 0,
> +	},
> +	.ctime = {
> +		.minimum	= S64_MIN,
> +		.maximum	= S64_MAX,
> +		.gran_mantissa	= 1,
> +		.gran_exponent	= 0,
> +	},
> +	.btime = {
> +		.minimum	= S64_MIN,
> +		.maximum	= S64_MAX,
> +		.gran_mantissa	= 1,
> +		.gran_exponent	= 0,
> +	},
> +};
> +
> +static int fsinfo_generic_timestamp_info(struct path *path,
> +					 struct fsinfo_timestamp_info *ts)
> +{
> +	struct super_block *sb = path->dentry->d_sb;
> +	s8 exponent;
> +
> +	*ts = fsinfo_default_timestamp_info;
> +
> +

nit: redundant newline

> +	if (sb->s_time_gran < 1000000000) {
> +		if (sb->s_time_gran < 1000)
> +			exponent = -9;
> +		else if (sb->s_time_gran < 1000000)
> +			exponent = -6;
> +		else
> +			exponent = -3;
> +
> +		ts->atime.gran_exponent = exponent;
> +		ts->mtime.gran_exponent = exponent;
> +		ts->ctime.gran_exponent = exponent;
> +		ts->btime.gran_exponent = exponent;
> +	}
> +
> +	return sizeof(*ts);
> +}
> +
> +static int fsinfo_generic_volume_uuid(struct path *path,
> +				      struct fsinfo_volume_uuid *vu)
> +{
> +	struct super_block *sb = path->dentry->d_sb;
> +
> +	memcpy(vu, &sb->s_uuid, sizeof(*vu));
> +	return sizeof(*vu);
> +}
> +
> +static int fsinfo_generic_volume_id(struct path *path, char *buf)
> +{
> +	struct super_block *sb = path->dentry->d_sb;
> +	size_t len = strlen(sb->s_id);
> +
> +	memcpy(buf, sb->s_id, len + 1);
> +	return len;
> +}
> +
> +static int fsinfo_generic_name_encoding(struct path *path, char *buf)
> +{
> +	static const char encoding[] = "utf8";
> +
> +	memcpy(buf, encoding, sizeof(encoding) - 1);
> +	return sizeof(encoding) - 1;

Do we, not have any dumb helpers for scenarios like this?:

#define strlen_literal(x) (sizeof(""x"") - 1)
#define strlen_array(x) (sizeof(x) - 1)

Repeating sizeof(bla) - 1 seems like a good way to forget that -1 later
on :)

> +}
> +
> +/*
> + * Implement some queries generically from stuff in the superblock.
> + */
> +int generic_fsinfo(struct path *path, struct fsinfo_kparams *params)
> +{
> +#define _gen(X, Y) FSINFO_ATTR_##X: return fsinfo_generic_##Y(path, params->buffer)
> +
> +	switch (params->request) {
> +	case _gen(STATFS,		statfs);
> +	case _gen(IDS,			ids);
> +	case _gen(LIMITS,		limits);
> +	case _gen(SUPPORTS,		supports);
> +	case _gen(CAPABILITIES,		capabilities);
> +	case _gen(TIMESTAMP_INFO,	timestamp_info);
> +	case _gen(VOLUME_UUID,		volume_uuid);
> +	case _gen(VOLUME_ID,		volume_id);
> +	case _gen(NAME_ENCODING,	name_encoding);
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +}

[1]:
*grumble* *grumble*
Formal complaint about these code-generating macros again. :)
But fine. :)

> +EXPORT_SYMBOL(generic_fsinfo);
> +
> +/*
> + * Retrieve the filesystem info.  We make some stuff up if the operation is not
> + * supported.
> + */
> +static int vfs_fsinfo(struct path *path, struct fsinfo_kparams *params)
> +{
> +	struct dentry *dentry = path->dentry;
> +	int (*fsinfo)(struct path *, struct fsinfo_kparams *);
> +	int ret;
> +
> +	if (params->request == FSINFO_ATTR_FSINFO) {
> +		struct fsinfo_fsinfo *info = params->buffer;
> +
> +		info->max_attr	= FSINFO_ATTR__NR;
> +		info->max_cap	= FSINFO_CAP__NR;
> +		return sizeof(*info);
> +	}
> +
> +	fsinfo = dentry->d_sb->s_op->fsinfo;
> +	if (!fsinfo) {
> +		if (!dentry->d_sb->s_op->statfs)
> +			return -EOPNOTSUPP;
> +		fsinfo = generic_fsinfo;
> +	}
> +
> +	ret = security_sb_statfs(dentry);
> +	if (ret)
> +		return ret;
> +
> +	if (!params->overlarge)
> +		return fsinfo(path, params);
> +
> +	while (!signal_pending(current)) {
> +		params->usage = 0;
> +		ret = fsinfo(path, params);
> +		if (IS_ERR_VALUE((long)ret))
> +			return ret; /* Error */
> +		if ((unsigned int)ret <= params->buf_size)

if ((size_t)ret ...? Just for the sake of clarity if for nothing else.

> +			return ret; /* It fitted */

Ok, a little confused here, tbh. params->buf_size is size_t and this
function returns an int. Forgot whether you mentioned this before,
buf_size exceed can't exceed INT_MAX?

> +		kvfree(params->buffer);
> +		params->buffer = NULL;
> +		params->buf_size = roundup(ret, PAGE_SIZE);
> +		if (params->buf_size > INT_MAX)
> +			return -ETOOSMALL;
> +		params->buffer = kvmalloc(params->buf_size, GFP_KERNEL);
> +		if (!params->buffer)
> +			return -ENOMEM;
> +	}
> +
> +	return -ERESTARTSYS;
> +}
> +
> +static int vfs_fsinfo_path(int dfd, const char __user *pathname,
> +			   struct fsinfo_kparams *params)
> +{
> +	struct path path;
> +	unsigned lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
> +	int ret = -EINVAL;
> +
> +	if ((params->at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
> +				 AT_EMPTY_PATH)) != 0)
> +		return -EINVAL;
> +
> +	if (params->at_flags & AT_SYMLINK_NOFOLLOW)
> +		lookup_flags &= ~LOOKUP_FOLLOW;
> +	if (params->at_flags & AT_NO_AUTOMOUNT)
> +		lookup_flags &= ~LOOKUP_AUTOMOUNT;
> +	if (params->at_flags & AT_EMPTY_PATH)
> +		lookup_flags |= LOOKUP_EMPTY;
> +
> +retry:
> +	ret = user_path_at(dfd, pathname, lookup_flags, &path);
> +	if (ret)
> +		goto out;
> +
> +	ret = vfs_fsinfo(&path, params);
> +	path_put(&path);
> +	if (retry_estale(ret, lookup_flags)) {
> +		lookup_flags |= LOOKUP_REVAL;
> +		goto retry;
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_kparams *params)
> +{
> +	struct fd f = fdget_raw(fd);
> +	int ret = -EBADF;
> +
> +	if (f.file) {
> +		ret = vfs_fsinfo(&f.file->f_path, params);
> +		fdput(f);
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Return buffer information by requestable attribute.
> + *
> + * STRUCT	- a fixed-size structure with only one instance.
> + * STRUCT_N	- a sequence of STRUCTs, indexed by Nth
> + * STRUCT_NM	- a sequence of sequences of STRUCTs, indexed by Nth, Mth
> + * STRING	- a string with only one instance.
> + * STRING_N	- a sequence of STRING, indexed by Nth
> + * STRING_NM	- a sequence of sequences of STRING, indexed by Nth, Mth
> + * OPAQUE	- a blob that can be larger than 4K.
> + * STRUCT_ARRAY - an array of structs that can be larger than 4K
> + *
> + * If an entry is marked STRUCT, STRUCT_N or STRUCT_NM then if no buffer is
> + * supplied to sys_fsinfo(), sys_fsinfo() will handle returning the buffer size
> + * without calling vfs_fsinfo() and the filesystem.
> + *
> + * No struct may have more than 4K bytes.
> + */
> +struct fsinfo_attr_info {
> +	u8 type;
> +	u8 flags;
> +	u16 size;
> +};
> +
> +#define __FSINFO_STRUCT		0
> +#define __FSINFO_STRING		1
> +#define __FSINFO_OPAQUE		2
> +#define __FSINFO_STRUCT_ARRAY	3
> +#define __FSINFO_0		0
> +#define __FSINFO_N		0x0001
> +#define __FSINFO_NM		0x0002
> +
> +#define _Z(T, F, S) { .type = __FSINFO_##T, .flags = __FSINFO_##F, .size = S }
> +#define FSINFO_STRING(X)	 [FSINFO_ATTR_##X] = _Z(STRING, 0, 0)
> +#define FSINFO_STRUCT(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRUCT, 0, sizeof(struct fsinfo_##Y))
> +#define FSINFO_STRING_N(X)	 [FSINFO_ATTR_##X] = _Z(STRING, N, 0)
> +#define FSINFO_STRUCT_N(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRUCT, N, sizeof(struct fsinfo_##Y))
> +#define FSINFO_STRING_NM(X)	 [FSINFO_ATTR_##X] = _Z(STRING, NM, 0)
> +#define FSINFO_STRUCT_NM(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRUCT, NM, sizeof(struct fsinfo_##Y))
> +#define FSINFO_OPAQUE(X)	 [FSINFO_ATTR_##X] = _Z(OPAQUE, 0, 0)
> +#define FSINFO_STRUCT_ARRAY(X,Y) [FSINFO_ATTR_##X] = _Z(STRUCT_ARRAY, 0, sizeof(struct fsinfo_##Y))
> +
> +static const struct fsinfo_attr_info fsinfo_buffer_info[FSINFO_ATTR__NR] = {
> +	FSINFO_STRUCT		(STATFS,		statfs),
> +	FSINFO_STRUCT		(FSINFO,		fsinfo),
> +	FSINFO_STRUCT		(IDS,			ids),
> +	FSINFO_STRUCT		(LIMITS,		limits),
> +	FSINFO_STRUCT		(CAPABILITIES,		capabilities),
> +	FSINFO_STRUCT		(SUPPORTS,		supports),
> +	FSINFO_STRUCT		(TIMESTAMP_INFO,	timestamp_info),
> +	FSINFO_STRING		(VOLUME_ID),
> +	FSINFO_STRUCT		(VOLUME_UUID,		volume_uuid),
> +	FSINFO_STRING		(VOLUME_NAME),
> +	FSINFO_STRING		(NAME_ENCODING),
> +	FSINFO_STRING		(NAME_CODEPAGE),
> +};

See [1]. :)
Is it really wort it to have this code generating stuff in there?
I urge you to think about git grep users. For them this is an absolute
nightmare. :)
It's also annoying because one needs to expand the macro to review the
fsinfo() syscalls below that switches on a lot of the stuff you define
here.

> +
> +/**
> + * sys_fsinfo - System call to get filesystem information
> + * @dfd: Base directory to pathwalk from or fd referring to filesystem.
> + * @pathname: Filesystem to query or NULL.
> + * @_params: Parameters to define request (or NULL for enhanced statfs).
> + * @user_buffer: Result buffer.
> + * @user_buf_size: Size of result buffer.
> + *
> + * Get information on a filesystem.  The filesystem attribute to be queried is
> + * indicated by @_params->request, and some of the attributes can have multiple
> + * values, indexed by @_params->Nth and @_params->Mth.  If @_params is NULL,
> + * then the 0th fsinfo_attr_statfs attribute is queried.  If an attribute does
> + * not exist, EOPNOTSUPP is returned; if the Nth,Mth value does not exist,
> + * ENODATA is returned.
> + *
> + * On success, the size of the attribute's value is returned.  If
> + * @user_buf_size is 0 or @user_buffer is NULL, only the size is returned.  If
> + * the size of the value is larger than @user_buf_size, it will be truncated by
> + * the copy.  If the size of the value is smaller than @user_buf_size then the
> + * excess buffer space will be cleared.  The full size of the value will be
> + * returned, irrespective of how much data is actually placed in the buffer.
> + */
> +SYSCALL_DEFINE5(fsinfo,
> +		int, dfd, const char __user *, pathname,
> +		struct fsinfo_params __user *, params,
> +		void __user *, user_buffer, size_t, user_buf_size)
> +{
> +	struct fsinfo_attr_info info;
> +	struct fsinfo_params user_params;
> +	struct fsinfo_kparams kparams;
> +	unsigned int result_size;

Wouldn't it be better if this could be a size_t?

> +	int ret;
> +
> +	memset(&kparams, 0, sizeof(kparams));
> +
> +	if (params) {
> +		if (copy_from_user(&user_params, params, sizeof(user_params)))
> +			return -EFAULT;
> +		if (user_params.__reserved[0] ||
> +		    user_params.__reserved[1] ||
> +		    user_params.__reserved[2])
> +			return -EINVAL;
> +		if (user_params.request >= FSINFO_ATTR__NR)
> +			return -EOPNOTSUPP;
> +		kparams.at_flags = user_params.at_flags;
> +		kparams.request = user_params.request;
> +		kparams.Nth = user_params.Nth;
> +		kparams.Mth = user_params.Mth;
> +	} else {
> +		kparams.request = FSINFO_ATTR_STATFS;
> +	}
> +
> +	if (!user_buffer || !user_buf_size) {

Maybe we could be a little more strict and require both be set to their
respective zero values, i.e. only support reporting the size if
!user_buffer && user_buf_size = 0 for that to work. If only one of them
is set to their zero value we report EINVAL.

> +		user_buf_size = 0;
> +		user_buffer = NULL;
> +	}
> +
> +	/* Allocate an appropriately-sized buffer.  We will truncate the
> +	 * contents when we write the contents back to userspace.
> +	 */
> +	info = fsinfo_buffer_info[kparams.request];
> +	if (kparams.Nth != 0 && !(info.flags & (__FSINFO_N | __FSINFO_NM)))
> +		return -ENODATA;
> +	if (kparams.Mth != 0 && !(info.flags & __FSINFO_NM))
> +		return -ENODATA;
> +
> +	switch (info.type) {
> +	case __FSINFO_STRUCT:
> +		kparams.buf_size = info.size;
> +		if (user_buf_size == 0)
> +			return info.size; /* We know how big the buffer should be */
> +		break;
> +
> +	case __FSINFO_STRING:
> +		kparams.buf_size = FSINFO_NORMAL_ATTR_MAX_SIZE;
> +		break;
> +
> +	case __FSINFO_OPAQUE:
> +	case __FSINFO_STRUCT_ARRAY:
> +		/* Opaque blob or array of struct elements.  We also create a
> +		 * buffer that can be used for scratch space.
> +		 */
> +		ret = -ENOMEM;
> +		kparams.scratch_buffer = kmalloc(FSINFO_SCRATCH_BUFFER_SIZE,
> +						GFP_KERNEL);
> +		if (!kparams.scratch_buffer)
> +			goto error;
> +		kparams.overlarge = true;
> +		kparams.buf_size = FSINFO_NORMAL_ATTR_MAX_SIZE;
> +		break;
> +
> +	default:
> +		return -ENOBUFS;
> +	}
> +
> +	/* We always allocate a buffer for a string, even if buf_size == 0 and
> +	 * we're not going to return any data.  This means that the filesystem
> +	 * code needn't care about whether the buffer actually exists or not.
> +	 */
> +	ret = -ENOMEM;
> +	kparams.buffer = kvzalloc(kparams.buf_size, GFP_KERNEL);
> +	if (!kparams.buffer)
> +		goto error_scratch;
> +
> +	if (pathname)
> +		ret = vfs_fsinfo_path(dfd, pathname, &kparams);
> +	else
> +		ret = vfs_fsinfo_fd(dfd, &kparams);
> +	if (ret < 0)
> +		goto error_buffer;
> +
> +	result_size = ret;
> +	if (result_size > user_buf_size)
> +		result_size = user_buf_size;
> +
> +	if (result_size > 0 &&
> +	    copy_to_user(user_buffer, kparams.buffer, result_size) != 0) {
> +		ret = -EFAULT;
> +		goto error_buffer;
> +	}
> +
> +	/* Clear any part of the buffer that we won't fill if we're putting a
> +	 * struct in there.  Strings, opaque objects and arrays are expected to
> +	 * be variable length.
> +	 */
> +	if (info.type == __FSINFO_STRUCT &&
> +	    user_buf_size > result_size &&
> +	    clear_user(user_buffer + result_size, user_buf_size - result_size) != 0) {
> +		ret = -EFAULT;
> +		goto error_buffer;
> +	}
> +
> +error_buffer:
> +	kvfree(kparams.buffer);
> +error_scratch:
> +	kfree(kparams.scratch_buffer);
> +error:
> +	return ret;
> +}
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f7fdfe93e25d..50f58eac3e1f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -66,6 +66,8 @@ struct fscrypt_info;
>  struct fscrypt_operations;
>  struct fs_context;
>  struct fs_parameter_description;
> +struct fsinfo_kparams;
> +enum fsinfo_attribute;
>  
>  extern void __init inode_init(void);
>  extern void __init inode_init_early(void);
> @@ -1922,6 +1924,9 @@ struct super_operations {
>  	int (*thaw_super) (struct super_block *);
>  	int (*unfreeze_fs) (struct super_block *);
>  	int (*statfs) (struct dentry *, struct kstatfs *);
> +#ifdef CONFIG_FSINFO
> +	int (*fsinfo) (struct path *, struct fsinfo_kparams *);
> +#endif
>  	int (*remount_fs) (struct super_block *, int *, char *);
>  	void (*umount_begin) (struct super_block *);
>  
> diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
> new file mode 100644
> index 000000000000..4c250136d693
> --- /dev/null
> +++ b/include/linux/fsinfo.h
> @@ -0,0 +1,65 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Filesystem information query
> + *
> + * Copyright (C) 2019 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +
> +#ifndef _LINUX_FSINFO_H
> +#define _LINUX_FSINFO_H
> +
> +#ifdef CONFIG_FSINFO
> +
> +#include <uapi/linux/fsinfo.h>
> +
> +#define FSINFO_NORMAL_ATTR_MAX_SIZE 4096
> +#define FSINFO_SCRATCH_BUFFER_SIZE 4096
> +
> +struct fsinfo_kparams {
> +	__u32			at_flags;	/* AT_SYMLINK_NOFOLLOW and similar */
> +	enum fsinfo_attribute	request;	/* What is being asking for */
> +	__u32			Nth;		/* Instance of it (some may have multiple) */
> +	__u32			Mth;		/* Subinstance */
> +	bool			overlarge;	/* T if the buffer may be resized */
> +	unsigned int		usage;		/* Amount of buffer used (if overlarge=T) */
> +	unsigned int		buf_size;	/* Size of ->buffer[] */
> +	void			*buffer;	/* Where to place the reply */
> +	char			*scratch_buffer; /* 4K scratch buffer (if overlarge=T) */
> +};
> +
> +extern int generic_fsinfo(struct path *, struct fsinfo_kparams *);
> +
> +static inline void fsinfo_set_cap(struct fsinfo_capabilities *c,
> +				  enum fsinfo_capability cap)
> +{
> +	c->capabilities[cap / 8] |= 1 << (cap % 8);
> +}
> +
> +static inline void fsinfo_clear_cap(struct fsinfo_capabilities *c,
> +				    enum fsinfo_capability cap)
> +{
> +	c->capabilities[cap / 8] &= ~(1 << (cap % 8));
> +}
> +
> +/**
> + * fsinfo_set_unix_caps - Set standard UNIX capabilities.

Hm, I'm not sure that "capabilities" is a good name here. This is
potentially misleading because of other uses of "capabilities" we
already have. Like, I don't want thes capabilities to pop up when I do
git grep capabilities. Just a short way until someone also speaks of
"fscaps" or "fsinfocaps" and then confusion is basically guaranteed. :)

Maybe "features" would be better?

> + * @c: The capabilities mask to alter
> + */
> +static inline void fsinfo_set_unix_caps(struct fsinfo_capabilities *caps)
> +{
> +	fsinfo_set_cap(caps, FSINFO_CAP_UIDS);
> +	fsinfo_set_cap(caps, FSINFO_CAP_GIDS);
> +	fsinfo_set_cap(caps, FSINFO_CAP_DIRECTORIES);
> +	fsinfo_set_cap(caps, FSINFO_CAP_SYMLINKS);
> +	fsinfo_set_cap(caps, FSINFO_CAP_HARD_LINKS);
> +	fsinfo_set_cap(caps, FSINFO_CAP_DEVICE_FILES);
> +	fsinfo_set_cap(caps, FSINFO_CAP_UNIX_SPECIALS);
> +	fsinfo_set_cap(caps, FSINFO_CAP_SPARSE);
> +	fsinfo_set_cap(caps, FSINFO_CAP_HAS_ATIME);
> +	fsinfo_set_cap(caps, FSINFO_CAP_HAS_CTIME);
> +	fsinfo_set_cap(caps, FSINFO_CAP_HAS_MTIME);
> +}
> +
> +#endif /* CONFIG_FSINFO */
> +
> +#endif /* _LINUX_FSINFO_H */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..958ac427ff37 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -50,6 +50,7 @@ struct stat64;
>  struct statfs;
>  struct statfs64;
>  struct statx;
> +struct fsinfo_params;
>  struct __sysctl_args;
>  struct sysinfo;
>  struct timespec;
> @@ -997,6 +998,9 @@ asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags)
>  asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
>  				       siginfo_t __user *info,
>  				       unsigned int flags);
> +asmlinkage long sys_fsinfo(int dfd, const char __user *pathname,
> +			   struct fsinfo_params __user *params,
> +			   void __user *buffer, size_t buf_size);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index a87904daf103..50ddf5f25122 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
>  __SYSCALL(__NR_fsmount, sys_fsmount)
>  #define __NR_fspick 433
>  __SYSCALL(__NR_fspick, sys_fspick)
> +#define __NR_fsinfo 434
> +__SYSCALL(__NR_fsinfo, sys_fsinfo)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 434
> +#define __NR_syscalls 435
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
> new file mode 100644
> index 000000000000..cc7e13a9b95f
> --- /dev/null
> +++ b/include/uapi/linux/fsinfo.h
> @@ -0,0 +1,219 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/* fsinfo() definitions.
> + *
> + * Copyright (C) 2019 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +#ifndef _UAPI_LINUX_FSINFO_H
> +#define _UAPI_LINUX_FSINFO_H
> +
> +#include <linux/types.h>
> +#include <linux/socket.h>
> +
> +/*
> + * The filesystem attributes that can be requested.  Note that some attributes
> + * may have multiple instances which can be switched in the parameter block.
> + */
> +enum fsinfo_attribute {
> +	FSINFO_ATTR_STATFS		= 0,	/* statfs()-style state */
> +	FSINFO_ATTR_FSINFO		= 1,	/* Information about fsinfo() */
> +	FSINFO_ATTR_IDS			= 2,	/* Filesystem IDs */
> +	FSINFO_ATTR_LIMITS		= 3,	/* Filesystem limits */
> +	FSINFO_ATTR_SUPPORTS		= 4,	/* What's supported in statx, iocflags, ... */
> +	FSINFO_ATTR_CAPABILITIES	= 5,	/* Filesystem capabilities (bits) */
> +	FSINFO_ATTR_TIMESTAMP_INFO	= 6,	/* Inode timestamp info */
> +	FSINFO_ATTR_VOLUME_ID		= 7,	/* Volume ID (string) */
> +	FSINFO_ATTR_VOLUME_UUID		= 8,	/* Volume UUID (LE uuid) */
> +	FSINFO_ATTR_VOLUME_NAME		= 9,	/* Volume name (string) */
> +	FSINFO_ATTR_NAME_ENCODING	= 10,	/* Filename encoding (string) */
> +	FSINFO_ATTR_NAME_CODEPAGE	= 11,	/* Filename codepage (string) */
> +	FSINFO_ATTR__NR
> +};
> +
> +/*
> + * Optional fsinfo() parameter structure.
> + *
> + * If this is not given, it is assumed that fsinfo_attr_statfs instance 0,0 is
> + * desired.
> + */
> +struct fsinfo_params {
> +	__u32	at_flags;	/* AT_SYMLINK_NOFOLLOW and similar flags */
> +	__u32	request;	/* What is being asking for (enum fsinfo_attribute) */
> +	__u32	Nth;		/* Instance of it (some may have multiple) */
> +	__u32	Mth;		/* Subinstance of Nth instance */
> +	__u64	__reserved[3];	/* Reserved params; all must be 0 */
> +};
> +
> +struct fsinfo_u128 {
> +#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> +	__u64	hi;
> +	__u64	lo;
> +#elif defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN)
> +	__u64	lo;
> +	__u64	hi;
> +#endif
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_statfs).
> + * - This gives extended filesystem information.
> + */
> +struct fsinfo_statfs {
> +	struct fsinfo_u128 f_blocks;	/* Total number of blocks in fs */
> +	struct fsinfo_u128 f_bfree;	/* Total number of free blocks */
> +	struct fsinfo_u128 f_bavail;	/* Number of free blocks available to ordinary user */
> +	struct fsinfo_u128 f_files;	/* Total number of file nodes in fs */
> +	struct fsinfo_u128 f_ffree;	/* Number of free file nodes */
> +	struct fsinfo_u128 f_favail;	/* Number of file nodes available to ordinary user */
> +	__u64	f_bsize;		/* Optimal block size */
> +	__u64	f_frsize;		/* Fragment size */
> +	__u64	mnt_attrs;		/* Mount attributes (MOUNT_ATTR_*) */
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_ids).
> + *
> + * List of basic identifiers as is normally found in statfs().
> + */
> +struct fsinfo_ids {
> +	char	f_fs_name[15 + 1];	/* Filesystem name */
> +	__u64	f_fsid;			/* Short 64-bit Filesystem ID (as statfs) */
> +	__u64	f_sb_id;		/* Internal superblock ID for sbnotify()/mntnotify() */
> +	__u32	f_fstype;		/* Filesystem type from linux/magic.h [uncond] */
> +	__u32	f_dev_major;		/* As st_dev_* from struct statx [uncond] */
> +	__u32	f_dev_minor;
> +	__u32	__reserved[1];
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_limits).
> + *
> + * List of supported filesystem limits.
> + */
> +struct fsinfo_limits {
> +	struct fsinfo_u128 max_file_size;	/* Maximum file size */
> +	struct fsinfo_u128 max_ino;		/* Maximum inode number */
> +	__u64	max_uid;			/* Maximum UID supported */
> +	__u64	max_gid;			/* Maximum GID supported */
> +	__u64	max_projid;			/* Maximum project ID supported */
> +	__u64	max_hard_links;			/* Maximum number of hard links on a file */
> +	__u64	max_xattr_body_len;		/* Maximum xattr content length */
> +	__u32	max_xattr_name_len;		/* Maximum xattr name length */
> +	__u32	max_filename_len;		/* Maximum filename length */
> +	__u32	max_symlink_len;		/* Maximum symlink content length */
> +	__u32	max_dev_major;			/* Maximum device major representable */
> +	__u32	max_dev_minor;			/* Maximum device minor representable */
> +	__u32	__reserved[1];
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_supports).
> + *
> + * What's supported in various masks, such as statx() attribute and mask bits
> + * and IOC flags.
> + */
> +struct fsinfo_supports {
> +	__u64	stx_attributes;		/* What statx::stx_attributes are supported */
> +	__u32	stx_mask;		/* What statx::stx_mask bits are supported */
> +	__u32	ioc_flags;		/* What FS_IOC_* flags are supported */
> +	__u32	win_file_attrs;		/* What DOS/Windows FILE_* attributes are supported */
> +	__u32	__reserved[1];
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_capabilities).
> + *
> + * Bitmask indicating filesystem capabilities where renderable as single bits.
> + */
> +enum fsinfo_capability {

Again, something other than "capability" might bet better, e.g.
"features".

> +	FSINFO_CAP_IS_KERNEL_FS		= 0,	/* fs is kernel-special filesystem */
> +	FSINFO_CAP_IS_BLOCK_FS		= 1,	/* fs is block-based filesystem */
> +	FSINFO_CAP_IS_FLASH_FS		= 2,	/* fs is flash filesystem */
> +	FSINFO_CAP_IS_NETWORK_FS	= 3,	/* fs is network filesystem */
> +	FSINFO_CAP_IS_AUTOMOUNTER_FS	= 4,	/* fs is automounter special filesystem */
> +	FSINFO_CAP_IS_MEMORY_FS		= 5,	/* fs is memory-based filesystem */
> +	FSINFO_CAP_AUTOMOUNTS		= 6,	/* fs supports automounts */
> +	FSINFO_CAP_ADV_LOCKS		= 7,	/* fs supports advisory file locking */
> +	FSINFO_CAP_MAND_LOCKS		= 8,	/* fs supports mandatory file locking */
> +	FSINFO_CAP_LEASES		= 9,	/* fs supports file leases */
> +	FSINFO_CAP_UIDS			= 10,	/* fs supports numeric uids */
> +	FSINFO_CAP_GIDS			= 11,	/* fs supports numeric gids */
> +	FSINFO_CAP_PROJIDS		= 12,	/* fs supports numeric project ids */
> +	FSINFO_CAP_STRING_USER_IDS	= 13,	/* fs supports string user identifiers */
> +	FSINFO_CAP_GUID_USER_IDS	= 14,	/* fs supports GUID user identifiers */
> +	FSINFO_CAP_WINDOWS_ATTRS	= 15,	/* fs has windows attributes */
> +	FSINFO_CAP_USER_QUOTAS		= 16,	/* fs has per-user quotas */
> +	FSINFO_CAP_GROUP_QUOTAS		= 17,	/* fs has per-group quotas */
> +	FSINFO_CAP_PROJECT_QUOTAS	= 18,	/* fs has per-project quotas */
> +	FSINFO_CAP_XATTRS		= 19,	/* fs has xattrs */
> +	FSINFO_CAP_JOURNAL		= 20,	/* fs has a journal */
> +	FSINFO_CAP_DATA_IS_JOURNALLED	= 21,	/* fs is using data journalling */
> +	FSINFO_CAP_O_SYNC		= 22,	/* fs supports O_SYNC */
> +	FSINFO_CAP_O_DIRECT		= 23,	/* fs supports O_DIRECT */
> +	FSINFO_CAP_VOLUME_ID		= 24,	/* fs has a volume ID */
> +	FSINFO_CAP_VOLUME_UUID		= 25,	/* fs has a volume UUID */
> +	FSINFO_CAP_VOLUME_NAME		= 26,	/* fs has a volume name */
> +	FSINFO_CAP_VOLUME_FSID		= 27,	/* fs has a volume FSID */
> +	FSINFO_CAP_IVER_ALL_CHANGE	= 28,	/* i_version represents data + meta changes */
> +	FSINFO_CAP_IVER_DATA_CHANGE	= 29,	/* i_version represents data changes only */
> +	FSINFO_CAP_IVER_MONO_INCR	= 30,	/* i_version incremented monotonically */
> +	FSINFO_CAP_DIRECTORIES		= 31,	/* fs supports (sub)directories */
> +	FSINFO_CAP_SYMLINKS		= 32,	/* fs supports symlinks */
> +	FSINFO_CAP_HARD_LINKS		= 33,	/* fs supports hard links */
> +	FSINFO_CAP_HARD_LINKS_1DIR	= 34,	/* fs supports hard links in same dir only */
> +	FSINFO_CAP_DEVICE_FILES		= 35,	/* fs supports bdev, cdev */
> +	FSINFO_CAP_UNIX_SPECIALS	= 36,	/* fs supports pipe, fifo, socket */
> +	FSINFO_CAP_RESOURCE_FORKS	= 37,	/* fs supports resource forks/streams */
> +	FSINFO_CAP_NAME_CASE_INDEP	= 38,	/* Filename case independence is mandatory */
> +	FSINFO_CAP_NAME_NON_UTF8	= 39,	/* fs has non-utf8 names */
> +	FSINFO_CAP_NAME_HAS_CODEPAGE	= 40,	/* fs has a filename codepage */
> +	FSINFO_CAP_SPARSE		= 41,	/* fs supports sparse files */
> +	FSINFO_CAP_NOT_PERSISTENT	= 42,	/* fs is not persistent */
> +	FSINFO_CAP_NO_UNIX_MODE		= 43,	/* fs does not support unix mode bits */
> +	FSINFO_CAP_HAS_ATIME		= 44,	/* fs supports access time */
> +	FSINFO_CAP_HAS_BTIME		= 45,	/* fs supports birth/creation time */
> +	FSINFO_CAP_HAS_CTIME		= 46,	/* fs supports change time */
> +	FSINFO_CAP_HAS_MTIME		= 47,	/* fs supports modification time */
> +	FSINFO_CAP__NR
> +};
> +
> +struct fsinfo_capabilities {
> +	__u8	capabilities[(FSINFO_CAP__NR + 7) / 8];
> +};
> +
> +struct fsinfo_timestamp_one {
> +	__s64	minimum;	/* Minimum timestamp value in seconds */
> +	__u64	maximum;	/* Maximum timestamp value in seconds */
> +	__u16	gran_mantissa;	/* Granularity(secs) = mant * 10^exp */
> +	__s8	gran_exponent;
> +	__u8	reserved[5];
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_timestamp_info).
> + */
> +struct fsinfo_timestamp_info {
> +	struct fsinfo_timestamp_one	atime;	/* Access time */
> +	struct fsinfo_timestamp_one	mtime;	/* Modification time */
> +	struct fsinfo_timestamp_one	ctime;	/* Change time */
> +	struct fsinfo_timestamp_one	btime;	/* Birth/creation time */
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_volume_uuid).
> + */
> +struct fsinfo_volume_uuid {
> +	__u8	uuid[16];
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_fsinfo).
> + *
> + * This gives information about fsinfo() itself.
> + */
> +struct fsinfo_fsinfo {
> +	__u32	max_attr;	/* Number of supported attributes (fsinfo_attr__nr) */
> +	__u32	max_cap;	/* Number of supported capabilities (fsinfo_cap__nr) */
> +};
> +
> +#endif /* _UAPI_LINUX_FSINFO_H */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 4d9ae5ea6caf..93927072396c 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -51,6 +51,7 @@ COND_SYSCALL_COMPAT(io_pgetevents);
>  COND_SYSCALL(io_uring_setup);
>  COND_SYSCALL(io_uring_enter);
>  COND_SYSCALL(io_uring_register);
> +COND_SYSCALL(fsinfo);
>  
>  /* fs/xattr.c */
>  
> diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
> index a3e4ffd4c773..d3cc8e9a4fd8 100644
> --- a/samples/vfs/Makefile
> +++ b/samples/vfs/Makefile
> @@ -1,10 +1,14 @@
>  # List of programs to build
>  hostprogs-y := \
> +	test-fsinfo \
>  	test-fsmount \
>  	test-statx
>  
>  # Tell kbuild to always build the programs
>  always := $(hostprogs-y)
>  
> +HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
> +HOSTLDLIBS_test-fsinfo += -lm
> +
>  HOSTCFLAGS_test-fsmount.o += -I$(objtree)/usr/include
>  HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
> diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
> new file mode 100644
> index 000000000000..8cce1986df7e
> --- /dev/null
> +++ b/samples/vfs/test-fsinfo.c
> @@ -0,0 +1,551 @@
> +/* Test the fsinfo() system call
> + *
> + * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#define _GNU_SOURCE
> +#define _ATFILE_SOURCE

nit: Defining fsinfoat() implicitly or what's that supposed to do? If that's
the case wouldn't it be nicer to just explicitly declare fsinfoat()

> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <ctype.h>
> +#include <errno.h>
> +#include <time.h>
> +#include <math.h>
> +#include <fcntl.h>
> +#include <sys/syscall.h>
> +#include <linux/fsinfo.h>
> +#include <linux/socket.h>
> +#include <sys/stat.h>
> +#include <arpa/inet.h>
> +
> +#ifndef __NR_fsinfo
> +#define __NR_fsinfo -1
> +#endif
> +
> +static bool debug = 0;
> +
> +static __attribute__((unused))
> +ssize_t fsinfo(int dfd, const char *filename, struct fsinfo_params *params,
> +	       void *buffer, size_t buf_size)
> +{
> +	return syscall(__NR_fsinfo, dfd, filename, params, buffer, buf_size);
> +}
> +
> +struct fsinfo_attr_info {
> +	unsigned char	type;
> +	unsigned char	flags;
> +	unsigned short	size;
> +};
> +
> +#define __FSINFO_STRUCT		0
> +#define __FSINFO_STRING		1
> +#define __FSINFO_OVER		2
> +#define __FSINFO_STRUCT_ARRAY	3
> +#define __FSINFO_0		0
> +#define __FSINFO_N		0x0001
> +#define __FSINFO_NM		0x0002
> +
> +#define _Z(T, F, S) { .type = __FSINFO_##T, .flags = __FSINFO_##F, .size = S }
> +#define FSINFO_STRING(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRING, 0, 0)
> +#define FSINFO_STRUCT(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRUCT, 0, sizeof(struct fsinfo_##Y))
> +#define FSINFO_STRING_N(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRING, N, 0)
> +#define FSINFO_STRUCT_N(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRUCT, N, sizeof(struct fsinfo_##Y))
> +#define FSINFO_STRING_NM(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRING, NM, 0)
> +#define FSINFO_STRUCT_NM(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRUCT, NM, sizeof(struct fsinfo_##Y))
> +#define FSINFO_OVERLARGE(X,Y)	 [FSINFO_ATTR_##X] = _Z(OVER, 0, 0)
> +#define FSINFO_STRUCT_ARRAY(X,Y) [FSINFO_ATTR_##X] = _Z(STRUCT_ARRAY, 0, sizeof(struct fsinfo_##Y))

See [1] above but here it's less of an issue since this is a test file.
I missed that in the first review. :)

> +
> +static const struct fsinfo_attr_info fsinfo_buffer_info[FSINFO_ATTR__NR] = {
> +	FSINFO_STRUCT		(STATFS,		statfs),
> +	FSINFO_STRUCT		(FSINFO,		fsinfo),
> +	FSINFO_STRUCT		(IDS,			ids),
> +	FSINFO_STRUCT		(LIMITS,		limits),
> +	FSINFO_STRUCT		(CAPABILITIES,		capabilities),
> +	FSINFO_STRUCT		(SUPPORTS,		supports),
> +	FSINFO_STRUCT		(TIMESTAMP_INFO,	timestamp_info),
> +	FSINFO_STRING		(VOLUME_ID,		volume_id),
> +	FSINFO_STRUCT		(VOLUME_UUID,		volume_uuid),
> +	FSINFO_STRING		(VOLUME_NAME,		volume_name),
> +	FSINFO_STRING		(NAME_ENCODING,		name_encoding),
> +	FSINFO_STRING		(NAME_CODEPAGE,		name_codepage),
> +};
> +
> +#define FSINFO_NAME(X,Y) [FSINFO_ATTR_##X] = #Y
> +static const char *fsinfo_attr_names[FSINFO_ATTR__NR] = {
> +	FSINFO_NAME		(STATFS,		statfs),
> +	FSINFO_NAME		(FSINFO,		fsinfo),
> +	FSINFO_NAME		(IDS,			ids),
> +	FSINFO_NAME		(LIMITS,		limits),
> +	FSINFO_NAME		(CAPABILITIES,		capabilities),
> +	FSINFO_NAME		(SUPPORTS,		supports),
> +	FSINFO_NAME		(TIMESTAMP_INFO,	timestamp_info),
> +	FSINFO_NAME		(VOLUME_ID,		volume_id),
> +	FSINFO_NAME		(VOLUME_UUID,		volume_uuid),
> +	FSINFO_NAME		(VOLUME_NAME,		volume_name),
> +	FSINFO_NAME		(NAME_ENCODING,		name_encoding),
> +	FSINFO_NAME		(NAME_CODEPAGE,		name_codepage),
> +};
> +
> +union reply {
> +	char buffer[4096];
> +	struct fsinfo_statfs statfs;
> +	struct fsinfo_fsinfo fsinfo;
> +	struct fsinfo_ids ids;
> +	struct fsinfo_limits limits;
> +	struct fsinfo_supports supports;
> +	struct fsinfo_capabilities caps;
> +	struct fsinfo_timestamp_info timestamps;
> +	struct fsinfo_volume_uuid uuid;
> +};
> +
> +static void dump_hex(unsigned int *data, int from, int to)
> +{
> +	unsigned offset, print_offset = 1, col = 0;
> +
> +	from /= 4;
> +	to = (to + 3) / 4;
> +
> +	for (offset = from; offset < to; offset++) {
> +		if (print_offset) {
> +			printf("%04x: ", offset * 8);
> +			print_offset = 0;
> +		}
> +		printf("%08x", data[offset]);
> +		col++;
> +		if ((col & 3) == 0) {
> +			printf("\n");
> +			print_offset = 1;
> +		} else {
> +			printf(" ");
> +		}
> +	}
> +
> +	if (!print_offset)
> +		printf("\n");
> +}
> +
> +static void dump_attr_STATFS(union reply *r, int size)
> +{
> +	struct fsinfo_statfs *f = &r->statfs;
> +
> +	printf("\n");
> +	printf("\tblocks: n=%llu fr=%llu av=%llu\n",
> +	       (unsigned long long)f->f_blocks.lo,
> +	       (unsigned long long)f->f_bfree.lo,
> +	       (unsigned long long)f->f_bavail.lo);
> +
> +	printf("\tfiles : n=%llu fr=%llu av=%llu\n",
> +	       (unsigned long long)f->f_files.lo,
> +	       (unsigned long long)f->f_ffree.lo,
> +	       (unsigned long long)f->f_favail.lo);
> +	printf("\tbsize : %llu\n", f->f_bsize);
> +	printf("\tfrsize: %llu\n", f->f_frsize);
> +	printf("\tmntfl : %llx\n", (unsigned long long)f->mnt_attrs);
> +}
> +
> +static void dump_attr_FSINFO(union reply *r, int size)
> +{
> +	struct fsinfo_fsinfo *f = &r->fsinfo;
> +
> +	printf("max_attr=%u max_cap=%u\n", f->max_attr, f->max_cap);
> +}
> +
> +static void dump_attr_IDS(union reply *r, int size)
> +{
> +	struct fsinfo_ids *f = &r->ids;
> +
> +	printf("\n");
> +	printf("\tdev   : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
> +	printf("\tfs    : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
> +	printf("\tfsid  : %llx\n", (unsigned long long)f->f_fsid);
> +}
> +
> +static void dump_attr_LIMITS(union reply *r, int size)
> +{
> +	struct fsinfo_limits *f = &r->limits;
> +
> +	printf("\n");
> +	printf("\tmax file size: %llx%016llx\n",
> +	       (unsigned long long)f->max_file_size.hi,
> +	       (unsigned long long)f->max_file_size.lo);
> +	printf("\tmax ino:       %llx%016llx\n",
> +	       (unsigned long long)f->max_ino.hi,
> +	       (unsigned long long)f->max_ino.lo);
> +	printf("\tmax ids      : u=%llx g=%llx p=%llx\n",
> +	       (unsigned long long)f->max_uid,
> +	       (unsigned long long)f->max_gid,
> +	       (unsigned long long)f->max_projid);
> +	printf("\tmax dev      : maj=%x min=%x\n",
> +	       f->max_dev_major, f->max_dev_minor);
> +	printf("\tmax links    : %llx\n",
> +	       (unsigned long long)f->max_hard_links);
> +	printf("\tmax xattr    : n=%x b=%llx\n",
> +	       f->max_xattr_name_len,
> +	       (unsigned long long)f->max_xattr_body_len);
> +	printf("\tmax len      : file=%x sym=%x\n",
> +	       f->max_filename_len, f->max_symlink_len);
> +}
> +
> +static void dump_attr_SUPPORTS(union reply *r, int size)
> +{
> +	struct fsinfo_supports *f = &r->supports;
> +
> +	printf("\n");
> +	printf("\tstx_attr=%llx\n", (unsigned long long)f->stx_attributes);
> +	printf("\tstx_mask=%x\n", f->stx_mask);
> +	printf("\tioc_flags=%x\n", f->ioc_flags);
> +	printf("\twin_fattrs=%x\n", f->win_file_attrs);
> +}
> +
> +#define FSINFO_CAP_NAME(C) [FSINFO_CAP_##C] = #C
> +static const char *fsinfo_cap_names[FSINFO_CAP__NR] = {
> +	FSINFO_CAP_NAME(IS_KERNEL_FS),
> +	FSINFO_CAP_NAME(IS_BLOCK_FS),
> +	FSINFO_CAP_NAME(IS_FLASH_FS),
> +	FSINFO_CAP_NAME(IS_NETWORK_FS),
> +	FSINFO_CAP_NAME(IS_AUTOMOUNTER_FS),
> +	FSINFO_CAP_NAME(IS_MEMORY_FS),
> +	FSINFO_CAP_NAME(AUTOMOUNTS),
> +	FSINFO_CAP_NAME(ADV_LOCKS),
> +	FSINFO_CAP_NAME(MAND_LOCKS),
> +	FSINFO_CAP_NAME(LEASES),
> +	FSINFO_CAP_NAME(UIDS),
> +	FSINFO_CAP_NAME(GIDS),
> +	FSINFO_CAP_NAME(PROJIDS),
> +	FSINFO_CAP_NAME(STRING_USER_IDS),
> +	FSINFO_CAP_NAME(GUID_USER_IDS),
> +	FSINFO_CAP_NAME(WINDOWS_ATTRS),
> +	FSINFO_CAP_NAME(USER_QUOTAS),
> +	FSINFO_CAP_NAME(GROUP_QUOTAS),
> +	FSINFO_CAP_NAME(PROJECT_QUOTAS),
> +	FSINFO_CAP_NAME(XATTRS),
> +	FSINFO_CAP_NAME(JOURNAL),
> +	FSINFO_CAP_NAME(DATA_IS_JOURNALLED),
> +	FSINFO_CAP_NAME(O_SYNC),
> +	FSINFO_CAP_NAME(O_DIRECT),
> +	FSINFO_CAP_NAME(VOLUME_ID),
> +	FSINFO_CAP_NAME(VOLUME_UUID),
> +	FSINFO_CAP_NAME(VOLUME_NAME),
> +	FSINFO_CAP_NAME(VOLUME_FSID),
> +	FSINFO_CAP_NAME(IVER_ALL_CHANGE),
> +	FSINFO_CAP_NAME(IVER_DATA_CHANGE),
> +	FSINFO_CAP_NAME(IVER_MONO_INCR),
> +	FSINFO_CAP_NAME(DIRECTORIES),
> +	FSINFO_CAP_NAME(SYMLINKS),
> +	FSINFO_CAP_NAME(HARD_LINKS),
> +	FSINFO_CAP_NAME(HARD_LINKS_1DIR),
> +	FSINFO_CAP_NAME(DEVICE_FILES),
> +	FSINFO_CAP_NAME(UNIX_SPECIALS),
> +	FSINFO_CAP_NAME(RESOURCE_FORKS),
> +	FSINFO_CAP_NAME(NAME_CASE_INDEP),
> +	FSINFO_CAP_NAME(NAME_NON_UTF8),
> +	FSINFO_CAP_NAME(NAME_HAS_CODEPAGE),
> +	FSINFO_CAP_NAME(SPARSE),
> +	FSINFO_CAP_NAME(NOT_PERSISTENT),
> +	FSINFO_CAP_NAME(NO_UNIX_MODE),
> +	FSINFO_CAP_NAME(HAS_ATIME),
> +	FSINFO_CAP_NAME(HAS_BTIME),
> +	FSINFO_CAP_NAME(HAS_CTIME),
> +	FSINFO_CAP_NAME(HAS_MTIME),
> +};
> +
> +static void dump_attr_CAPABILITIES(union reply *r, int size)
> +{
> +	struct fsinfo_capabilities *f = &r->caps;
> +	int i;
> +
> +	for (i = 0; i < sizeof(f->capabilities); i++)
> +		printf("%02x", f->capabilities[i]);
> +	printf("\n");
> +	for (i = 0; i < FSINFO_CAP__NR; i++)
> +		if (f->capabilities[i / 8] & (1 << (i % 8)))
> +			printf("\t- %s\n", fsinfo_cap_names[i]);
> +}
> +
> +static void print_time(struct fsinfo_timestamp_one *t, char stamp)
> +{
> +	printf("\t%ctime : gran=%gs range=%llx-%llx\n",
> +	       stamp,
> +	       t->gran_mantissa * pow(10., t->gran_exponent),
> +	       (long long)t->minimum,
> +	       (long long)t->maximum);
> +}
> +
> +static void dump_attr_TIMESTAMP_INFO(union reply *r, int size)
> +{
> +	struct fsinfo_timestamp_info *f = &r->timestamps;
> +
> +	printf("\n");
> +	print_time(&f->atime, 'a');
> +	print_time(&f->mtime, 'm');
> +	print_time(&f->ctime, 'c');
> +	print_time(&f->btime, 'b');
> +}
> +
> +static void dump_attr_VOLUME_UUID(union reply *r, int size)
> +{
> +	struct fsinfo_volume_uuid *f = &r->uuid;
> +
> +	printf("%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
> +	       "-%02x%02x%02x%02x%02x%02x\n",
> +	       f->uuid[ 0], f->uuid[ 1],
> +	       f->uuid[ 2], f->uuid[ 3],
> +	       f->uuid[ 4], f->uuid[ 5],
> +	       f->uuid[ 6], f->uuid[ 7],
> +	       f->uuid[ 8], f->uuid[ 9],
> +	       f->uuid[10], f->uuid[11],
> +	       f->uuid[12], f->uuid[13],
> +	       f->uuid[14], f->uuid[15]);
> +}
> +
> +/*
> + *
> + */
> +typedef void (*dumper_t)(union reply *r, int size);
> +
> +#define FSINFO_DUMPER(N) [FSINFO_ATTR_##N] = dump_attr_##N
> +static const dumper_t fsinfo_attr_dumper[FSINFO_ATTR__NR] = {
> +	FSINFO_DUMPER(STATFS),
> +	FSINFO_DUMPER(FSINFO),
> +	FSINFO_DUMPER(IDS),
> +	FSINFO_DUMPER(LIMITS),
> +	FSINFO_DUMPER(SUPPORTS),
> +	FSINFO_DUMPER(CAPABILITIES),
> +	FSINFO_DUMPER(TIMESTAMP_INFO),
> +	FSINFO_DUMPER(VOLUME_UUID),
> +};
> +
> +static void dump_fsinfo(enum fsinfo_attribute attr,
> +			struct fsinfo_attr_info about,
> +			union reply *r, int size)
> +{
> +	dumper_t dumper = fsinfo_attr_dumper[attr];
> +	unsigned int len;
> +
> +	if (!dumper) {
> +		printf("<no dumper>\n");
> +		return;
> +	}
> +
> +	len = about.size;
> +	if (about.type == __FSINFO_STRUCT && size < len) {
> +		printf("<short data %u/%u>\n", size, len);
> +		return;
> +	}
> +
> +	dumper(r, size);
> +}
> +
> +/*
> + * Try one subinstance of an attribute.
> + */
> +static int try_one(const char *file, struct fsinfo_params *params, bool raw)
> +{
> +	struct fsinfo_attr_info about;
> +	union reply *r;
> +	size_t buf_size = 4096;
> +	char *p;
> +	int ret;
> +
> +	for (;;) {
> +		r = malloc(buf_size);
> +		if (!r) {
> +			perror("malloc");
> +			exit(1);
> +		}
> +		memset(r->buffer, 0xbd, buf_size);
> +
> +		errno = 0;
> +		ret = fsinfo(AT_FDCWD, file, params, r->buffer, buf_size);
> +		if (params->request >= FSINFO_ATTR__NR) {
> +			if (ret == -1 && errno == EOPNOTSUPP)
> +				exit(0);
> +			fprintf(stderr, "Unexpected error for too-large command %u: %m\n",
> +				params->request);
> +			exit(1);
> +		}
> +		if (ret == -1)
> +			break;
> +
> +		if (ret <= buf_size)
> +			break;
> +		buf_size = (ret + 4096 - 1) & ~(4096 - 1);
> +	}
> +
> +	if (debug)
> +		printf("fsinfo(%s,%s,%u,%u) = %d: %m\n",
> +		       file, fsinfo_attr_names[params->request],
> +		       params->Nth, params->Mth, ret);
> +
> +	about = fsinfo_buffer_info[params->request];
> +	if (ret == -1) {
> +		if (errno == ENODATA) {
> +			if (!(about.flags & (__FSINFO_N | __FSINFO_NM)) &&
> +			    params->Nth == 0 && params->Mth == 0) {
> +				fprintf(stderr,
> +					"Unexpected ENODATA (%u[%u][%u])\n",
> +					params->request, params->Nth, params->Mth);
> +				exit(1);
> +			}
> +			return (params->Mth == 0) ? 2 : 1;
> +		}
> +		if (errno == EOPNOTSUPP) {
> +			if (params->Nth > 0 || params->Mth > 0) {
> +				fprintf(stderr,
> +					"Should return -ENODATA (%u[%u][%u])\n",
> +					params->request, params->Nth, params->Mth);
> +				exit(1);
> +			}
> +			//printf("\e[33m%s\e[m: <not supported>\n",
> +			//       fsinfo_attr_names[attr]);
> +			return 2;
> +		}
> +		perror(file);
> +		exit(1);
> +	}
> +
> +	if (raw) {
> +		if (ret > 4096)
> +			ret = 4096;
> +		dump_hex((unsigned int *)r->buffer, 0, ret);
> +		return 0;
> +	}
> +
> +	switch (about.flags & (__FSINFO_N | __FSINFO_NM)) {
> +	case 0:
> +		printf("\e[33m%s\e[m: ",
> +		       fsinfo_attr_names[params->request]);
> +		break;
> +	case __FSINFO_N:
> +		printf("\e[33m%s[%u]\e[m: ",
> +		       fsinfo_attr_names[params->request],
> +		       params->Nth);
> +		break;
> +	case __FSINFO_NM:
> +		printf("\e[33m%s[%u][%u]\e[m: ",
> +		       fsinfo_attr_names[params->request],
> +		       params->Nth, params->Mth);
> +		break;
> +	}
> +
> +	switch (about.type) {
> +	case __FSINFO_STRUCT:
> +		dump_fsinfo(params->request, about, r, ret);
> +		return 0;
> +
> +	case __FSINFO_STRING:
> +		if (ret >= 4096) {
> +			ret = 4096;
> +			r->buffer[4092] = '.';
> +			r->buffer[4093] = '.';
> +			r->buffer[4094] = '.';
> +			r->buffer[4095] = 0;
> +		} else {
> +			r->buffer[ret] = 0;
> +		}
> +		for (p = r->buffer; *p; p++) {
> +			if (!isprint(*p)) {
> +				printf("<non-printable>\n");
> +				continue;
> +			}
> +		}
> +		printf("%s\n", r->buffer);
> +		return 0;
> +
> +	case __FSINFO_OVER:
> +		return 0;
> +
> +	case __FSINFO_STRUCT_ARRAY:
> +		dump_fsinfo(params->request, about, r, ret);
> +		return 0;
> +
> +	default:
> +		fprintf(stderr, "Fishy about %u %u,%u,%u\n",
> +			params->request, about.type, about.flags, about.size);
> +		exit(1);
> +	}
> +}
> +
> +/*
> + *
> + */
> +int main(int argc, char **argv)
> +{
> +	struct fsinfo_params params = {
> +		.at_flags = AT_SYMLINK_NOFOLLOW,
> +	};
> +	unsigned int attr;
> +	int raw = 0, opt, Nth, Mth;
> +
> +	while ((opt = getopt(argc, argv, "adlr"))) {
> +		switch (opt) {
> +		case 'a':
> +			params.at_flags |= AT_NO_AUTOMOUNT;
> +			continue;
> +		case 'd':
> +			debug = true;
> +			continue;
> +		case 'l':
> +			params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
> +			continue;
> +		case 'r':
> +			raw = 1;
> +			continue;
> +		}
> +		break;
> +	}
> +
> +	argc -= optind;
> +	argv += optind;
> +
> +	if (argc != 1) {
> +		printf("Format: test-fsinfo [-alr] <file>\n");
> +		exit(2);
> +	}
> +
> +	for (attr = 0; attr <= FSINFO_ATTR__NR; attr++) {
> +		Nth = 0;
> +		do {
> +			Mth = 0;
> +			do {
> +				params.request = attr;
> +				params.Nth = Nth;
> +				params.Mth = Mth;
> +
> +				switch (try_one(argv[0], &params, raw)) {
> +				case 0:
> +					continue;
> +				case 1:
> +					goto done_M;
> +				case 2:
> +					goto done_N;
> +				}
> +			} while (++Mth < 100);
> +
> +		done_M:
> +			if (Mth >= 100) {
> +				fprintf(stderr, "Fishy: Mth == %u\n", Mth);
> +				break;
> +			}
> +
> +		} while (++Nth < 100);
> +
> +	done_N:
> +		if (Nth >= 100) {
> +			fprintf(stderr, "Fishy: Nth == %u\n", Nth);
> +			break;
> +		}
> +	}
> +
> +	return 0;
> +}
> 

^ permalink raw reply

* Re: [PATCH v4 2/3] initramfs: read metadata from special file METADATA!!!
From: Mimi Zohar @ 2019-07-01 12:54 UTC (permalink / raw)
  To: Roberto Sassu, viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, rob,
	james.w.mcmechan, niveditas98
In-Reply-To: <20190523121803.21638-3-roberto.sassu@huawei.com>

Hi Roberto,

> diff --git a/init/initramfs.c b/init/initramfs.c
> index 5de396a6aac0..862c03123de8 100644
> --- a/init/initramfs.c
> +++ b/init/initramfs.c

> +static int __init do_process_metadata(char *buf, int len, bool last)
> +{

Part of the problem in upstreaming CPIO xattr support has been the
difficulty in reading and understanding the initramfs code due to a
lack of comments.  At least for any new code, let's add some comments
to simplify the review.  In this case, understanding "last", before
reading the code, would help.

Mimi

> +	int ret = 0;
> +
> +	if (!metadata_buf) {
> +		metadata_buf_ptr = metadata_buf = kmalloc(body_len, GFP_KERNEL);
> +		if (!metadata_buf_ptr) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		metadata_len = body_len;
> +	}
> +
> +	if (metadata_buf_ptr + len > metadata_buf + metadata_len) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	memcpy(metadata_buf_ptr, buf, len);
> +	metadata_buf_ptr += len;
> +
> +	if (last)
> +		do_parse_metadata(previous_name_buf);
> +out:
> +	if (ret < 0 || last) {
> +		kfree(metadata_buf);
> +		metadata_buf = NULL;
> +		metadata = 0;
> +	}
> +
> +	return ret;
> +}
> +
>  static int __init do_copy(void)
>  {
>  	if (byte_count >= body_len) {
>  		if (xwrite(wfd, victim, body_len) != body_len)
>  			error("write error");
> +		if (metadata)
> +			do_process_metadata(victim, body_len, true);
>  		ksys_close(wfd);
>  		do_utime(vcollected, mtime);
>  		kfree(vcollected);
> @@ -458,6 +500,8 @@ static int __init do_copy(void)
>  	} else {
>  		if (xwrite(wfd, victim, byte_count) != byte_count)
>  			error("write error");
> +		if (metadata)
> +			do_process_metadata(victim, byte_count, false);
>  		body_len -= byte_count;
>  		eat(byte_count);
>  		return 1;
> 

^ permalink raw reply

* Re: [PATCH 01/11] vfs: syscall: Add fsinfo() to query filesystem information [ver #15]
From: David Howells @ 2019-07-01 13:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: dhowells, viro, raven, mszeredi, linux-api, linux-fsdevel,
	linux-kernel, Rasmus Villemoes
In-Reply-To: <20190701104048.c2t5aful2sabngmr@brauner.io>

Christian Brauner <christian@brauner.io> wrote:

> > +config FSINFO
> 
> Hm, any reason why we would hide that syscalls under a config option?

Rasmus Villemoes asked for it to be made conditional.

https://lore.kernel.org/lkml/f3646774-ee9e-d5b7-8a11-670012034d59@rasmusvillemoes.dk/

> Do we, not have any dumb helpers for scenarios like this?:
> 
> #define strlen_literal(x) (sizeof(""x"") - 1)
> #define strlen_array(x) (sizeof(x) - 1)

git grep doesn't find them under this name.

> > +	while (!signal_pending(current)) {
> > +		params->usage = 0;
> > +		ret = fsinfo(path, params);
> > +		if (IS_ERR_VALUE((long)ret))
> > +			return ret; /* Error */
> > +		if ((unsigned int)ret <= params->buf_size)
> 
> if ((size_t)ret ...? Just for the sake of clarity if for nothing else.
> 
> > +			return ret; /* It fitted */
> 
> Ok, a little confused here, tbh. params->buf_size is size_t

It's "unsigned int".

> and this function returns an int. Forgot whether you mentioned this before,
> buf_size exceed can't exceed INT_MAX?

It's mentioned in the documentation (ie. fsinfo.rst).  I'll mention it in the
comments adjacent to the attribute definition table also.

> Is it really wort it to have this code generating stuff in there?

>From a readability PoV, yes, tabulation is awesome, IMO;-).  Up to 5 lines per
attribute is too much vertical space and expanding it makes the whole thing
much less readable.  Add to that that not all attributes will be the same
number of lines.

It would be easier if the I could get away with making the constant names
lower case, but the thou-shalt-capitalise-constantists dislike that, so, given
that I don't know of a way to make the C preprocessor change the case of a
symbol, I have to include both parts.

I have four pieces of information: type, depth, constant name, struct name (if
applicable), and I can fit them on one line this way.

You really find this:

static const struct fsinfo_attr_info fsinfo_buffer_info[FSINFO_ATTR__NR] = {
	[FSINFO_ATTR_STATFS] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_SINGLE,
		.size	= sizeof(struct fsinfo_statfs)
	},
	[FSINFO_ATTR_FSINFO] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_SINGLE,
		.size	= sizeof(struct fsinfo_fsinfo)
	},
	[FSINFO_ATTR_IDS] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_SINGLE,
		.size	= sizeof(struct fsinfo_ids)
	},
	[FSINFO_ATTR_LIMITS] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_SINGLE,
		.size	= sizeof(struct fsinfo_limits)
	},
	[FSINFO_ATTR_CAPABILITIES] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_SINGLE,
		.size	= sizeof(struct fsinfo_capabilities)
	},
	[FSINFO_ATTR_SUPPORTS] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_SINGLE,
		.size	= sizeof(struct fsinfo_supports)
	},
	[FSINFO_ATTR_TIMESTAMP_INFO] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_SINGLE,
		.size	= sizeof(struct fsinfo_timestamp_info)
	},
	[FSINFO_ATTR_VOLUME_ID] = {
		.type	= __FSINFO_STRING,
		.flags	= __FSINFO_SINGLE,
	},
	[FSINFO_ATTR_VOLUME_UUID] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_SINGLE,
		.size	= sizeof(struct fsinfo_volume_uuid)
	},
	[FSINFO_ATTR_VOLUME_NAME] = {
		.type	= __FSINFO_STRING,
		.flags	= __FSINFO_SINGLE,
	},
	[FSINFO_ATTR_NAME_ENCODING] = {
		.type	= __FSINFO_STRING,
		.flags	= __FSINFO_SINGLE,
	},
	[FSINFO_ATTR_NAME_CODEPAGE] = {
		.type	= __FSINFO_STRING,
		.flags	= __FSINFO_SINGLE,
	},
	[FSINFO_ATTR_PARAM_DESCRIPTION] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_SINGLE,
		.size	= sizeof(struct fsinfo_param_description)
	},
	[FSINFO_ATTR_PARAM_SPECIFICATION] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_N,
		.size	= sizeof(struct fsinfo_param_specification)
	},
	[FSINFO_ATTR_PARAM_ENUM] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_N,
		.size	= sizeof(struct fsinfo_param_enum)
	},
	[FSINFO_ATTR_PARAMETERS] = {
		.type	= __FSINFO_OPAQUE,
		.flags	= __FSINFO_SINGLE,
	},
	[FSINFO_ATTR_LSM_PARAMETERS] = {
		.type	= __FSINFO_OPAQUE,
		.flags	= __FSINFO_SINGLE,
	},
	[FSINFO_ATTR_SERVER_NAME] = {
		.type	= __FSINFO_STRING,
		.flags	= __FSINFO_N,
	},
	[FSINFO_ATTR_SERVER_ADDRESS] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_NM,
		.size	= sizeof(struct fsinfo_server_address)
	},
	[FSINFO_ATTR_AFS_CELL_NAME] = {
		.type	= __FSINFO_STRING,
		.flags	= __FSINFO_SINGLE,
	},
	[FSINFO_ATTR_MOUNT_INFO] = {
		.type	= __FSINFO_STRUCT,
		.flags	= __FSINFO_SINGLE,
		.size	= sizeof(struct fsinfo_mount_info)
	},
	[FSINFO_ATTR_MOUNT_DEVNAME] = {
		.type	= __FSINFO_STRING,
		.flags	= __FSINFO_SINGLE,
	},
	[FSINFO_ATTR_MOUNT_CHILDREN] = {
		.type	= __FSINFO_STRUCT_ARRAY,
		.flags	= __FSINFO_SINGLE,
		.size	= sizeof(struct fsinfo_mount_child)
	},
	[FSINFO_ATTR_MOUNT_SUBMOUNT] = {
		.type	= __FSINFO_STRING,
		.flags	= __FSINFO_N,
	},
};

is easier to read than this?:

static const struct fsinfo_attr_info fsinfo_buffer_info[FSINFO_ATTR__NR] = {
	FSINFO_STRUCT		(STATFS,		statfs),
	FSINFO_STRUCT		(FSINFO,		fsinfo),
	FSINFO_STRUCT		(IDS,			ids),
	FSINFO_STRUCT		(LIMITS,		limits),
	FSINFO_STRUCT		(CAPABILITIES,		capabilities),
	FSINFO_STRUCT		(SUPPORTS,		supports),
	FSINFO_STRUCT		(TIMESTAMP_INFO,	timestamp_info),
	FSINFO_STRING		(VOLUME_ID),
	FSINFO_STRUCT		(VOLUME_UUID,		volume_uuid),
	FSINFO_STRING		(VOLUME_NAME),
	FSINFO_STRING		(NAME_ENCODING),
	FSINFO_STRING		(NAME_CODEPAGE),
	FSINFO_STRUCT		(PARAM_DESCRIPTION,	param_description),
	FSINFO_STRUCT_N		(PARAM_SPECIFICATION,	param_specification),
	FSINFO_STRUCT_N		(PARAM_ENUM,		param_enum),
	FSINFO_OPAQUE		(PARAMETERS),
	FSINFO_OPAQUE		(LSM_PARAMETERS),
	FSINFO_STRING_N		(SERVER_NAME),
	FSINFO_STRUCT_NM	(SERVER_ADDRESS,	server_address),
	FSINFO_STRING		(AFS_CELL_NAME),
	FSINFO_STRUCT		(MOUNT_INFO,		mount_info),
	FSINFO_STRING		(MOUNT_DEVNAME),
	FSINFO_STRUCT_ARRAY	(MOUNT_CHILDREN,	mount_child),
	FSINFO_STRING_N		(MOUNT_SUBMOUNT),
};

The latter also has the advantage that I can take this and drop it into the
test program and change the helper macros to make it do other things.  With
the fully expanded code, that isn't possible.

One thing I will grant you, though, I can simplify:

#define __FSINFO_STRUCT		0
#define __FSINFO_STRING		1
#define __FSINFO_OPAQUE		2
#define __FSINFO_STRUCT_ARRAY	3
#define __FSINFO_0		0
#define __FSINFO_N		0x0001
#define __FSINFO_NM		0x0002

#define _Z(T, F, S) { .type = __FSINFO_##T, .flags = __FSINFO_##F, .size = S }
#define FSINFO_STRING(X)	 [FSINFO_ATTR_##X] = _Z(STRING, 0, 0)
#define FSINFO_STRUCT(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRUCT, 0, sizeof(struct fsinfo_##Y))
#define FSINFO_STRING_N(X)	 [FSINFO_ATTR_##X] = _Z(STRING, N, 0)
#define FSINFO_STRUCT_N(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRUCT, N, sizeof(struct fsinfo_##Y))
#define FSINFO_STRING_NM(X)	 [FSINFO_ATTR_##X] = _Z(STRING, NM, 0)
#define FSINFO_STRUCT_NM(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRUCT, NM, sizeof(struct fsinfo_##Y))
#define FSINFO_OPAQUE(X)	 [FSINFO_ATTR_##X] = _Z(OPAQUE, 0, 0)
#define FSINFO_STRUCT_ARRAY(X,Y) [FSINFO_ATTR_##X] = _Z(STRUCT_ARRAY, 0, sizeof(struct fsinfo_##Y))

a bit:

#define __FSINFO_STRUCT		0
#define __FSINFO_STRING		1
#define __FSINFO_OPAQUE		2
#define __FSINFO_STRUCT_ARRAY	3
#define __FSINFO_N		0x01
#define __FSINFO_NM		0x02

#define _Z(T, S)    { .type = __FSINFO_##T, .flags = 0,		  .size = S }
#define _Z_N(T, S)  { .type = __FSINFO_##T, .flags = __FSINFO_N,  .size = S }
#define _Z_NM(T, S) { .type = __FSINFO_##T, .flags = __FSINFO_NM, .size = S }
#define FSINFO_STRING(X)	 [FSINFO_ATTR_##X] = _Z(STRING, 0)
#define FSINFO_STRUCT(X,Y)	 [FSINFO_ATTR_##X] = _Z(STRUCT, sizeof(struct fsinfo_##Y))
#define FSINFO_STRING_N(X)	 [FSINFO_ATTR_##X] = _Z_N(STRING, 0)
#define FSINFO_STRUCT_N(X,Y)	 [FSINFO_ATTR_##X] = _Z_N(STRUCT, sizeof(struct fsinfo_##Y))
#define FSINFO_STRING_NM(X)	 [FSINFO_ATTR_##X] = _Z_NM(STRING, 0)
#define FSINFO_STRUCT_NM(X,Y)	 [FSINFO_ATTR_##X] = _Z_NM(STRUCT, sizeof(struct fsinfo_##Y))
#define FSINFO_OPAQUE(X)	 [FSINFO_ATTR_##X] = _Z(OPAQUE, 0)
#define FSINFO_STRUCT_ARRAY(X,Y) [FSINFO_ATTR_##X] = _Z(STRUCT_ARRAY, sizeof(struct fsinfo_##Y))

> I urge you to think about git grep users. For them this is an absolute
> nightmare. :)

That's a valid point, but it's a problem all over the kernel.  We use
macroisation everywhere.  See all the declaration and define macros that nest
layers deep.

If that's your main worry, The attribute type name could be fully expanded in
the table, eg.:

	FSINFO_STRUCT		(FSINFO_ATTR_CAPABILITIES,	capabilities),
	FSINFO_STRING_N		(FSINFO_ATTR_MOUNT_SUBMOUNT),

> > +	unsigned int result_size;
> 
> Wouldn't it be better if this could be a size_t?

Why?  size_t takes more space on a 64-bit system, but I'm not allowing the
filesystem to return that much data, mainly because I don't really want to be
allocating a >2G buffer.

In fact, for large objects there's something to be said for writing directly
to userspace rather than going through a buffer, but for the fact that I want
to hold, say, the RCU readlock across the entire transaction in some
instances.

> > +	if (!user_buffer || !user_buf_size) {
> 
> Maybe we could be a little more strict and require both be set to their
> respective zero values, i.e. only support reporting the size if
> !user_buffer && user_buf_size = 0 for that to work. If only one of them
> is set to their zero value we report EINVAL.

That's an option, certainly.

> Hm, I'm not sure that "capabilities" is a good name here. This is
> potentially misleading because of other uses of "capabilities" we
> already have. Like, I don't want thes capabilities to pop up when I do
> git grep capabilities. Just a short way until someone also speaks of
> "fscaps" or "fsinfocaps" and then confusion is basically guaranteed. :)
> 
> Maybe "features" would be better?

Yeah - that's probably better.  The only issue is that it doesn't have a nice
short hypocoristicon like "cap", though I could use "feat" I guess.

> > +#define _ATFILE_SOURCE
> 
> nit: Defining fsinfoat() implicitly or what's that supposed to do? If that's
> the case wouldn't it be nicer to just explicitly declare fsinfoat()

Um...  fsinfo() takes AT_* flags.  It's fsinfoat(), ffsinfo() and lfsinfo()
all rolled into one, plus a couple of extra bits.  It doesn't really need an
at-suffix on the name as there's no at-less original.

David

^ permalink raw reply

* Re: [PATCH v4 0/3] initramfs: add support for xattrs in the initial ram disk
From: Mimi Zohar @ 2019-07-01 13:22 UTC (permalink / raw)
  To: Roberto Sassu, viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, rob,
	james.w.mcmechan, niveditas98
In-Reply-To: <20190523121803.21638-1-roberto.sassu@huawei.com>

On Thu, 2019-05-23 at 14:18 +0200, Roberto Sassu wrote:
> This patch set aims at solving the following use case: appraise files from
> the initial ram disk. To do that, IMA checks the signature/hash from the
> security.ima xattr. Unfortunately, this use case cannot be implemented
> currently, as the CPIO format does not support xattrs.
> 
> This proposal consists in including file metadata as additional files named
> METADATA!!!, for each file added to the ram disk. The CPIO parser in the
> kernel recognizes these special files from the file name, and calls the
> appropriate parser to add metadata to the previously extracted file. It has
> been proposed to use bit 17:16 of the file mode as a way to recognize files
> with metadata, but both the kernel and the cpio tool declare the file mode
> as unsigned short.

Thanks, Roberto!

Victor, Taras, Rob, Arvind, Peter, if you're good with this latest
design, could we get some Reviewed-by, Acked-by, or Tested-by?

thanks!

Mimi

^ permalink raw reply

* Re: [PATCH v5 3/3] fpga: dfl: fme: add power management support
From: Guenter Roeck @ 2019-07-01 13:25 UTC (permalink / raw)
  To: Wu Hao, mdf, linux-fpga, linux-kernel
  Cc: linux-api, linux-hwmon, jdelvare, atull, gregkh, Luwei Kang,
	Xu Yilun
In-Reply-To: <1561963027-4213-4-git-send-email-hao.wu@intel.com>

On 6/30/19 11:37 PM, Wu Hao wrote:
> This patch adds support for power management private feature under
> FPGA Management Engine (FME). This private feature driver registers
> a hwmon for power (power1_input), thresholds information, e.g.
> (power1_max / crit / max_alarm / crit_alarm) and also read-only sysfs
> interfaces for other power management information. For configuration,
> user could write threshold values via above power1_max / crit sysfs
> interface under hwmon too.
> 
> Signed-off-by: Luwei Kang <luwei.kang@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@intel.com>
> Signed-off-by: Wu Hao <hao.wu@intel.com>

Acked-by: Guenter Roeck <linux@roeck-us.net>

> ---
> v2: create a dfl_fme_power hwmon to expose power sysfs interfaces.
>      move all sysfs interfaces under hwmon
>          consumed          --> hwmon power1_input
>          threshold1        --> hwmon power1_cap
>          threshold2        --> hwmon power1_crit
>          threshold1_status --> hwmon power1_cap_status
>          threshold2_status --> hwmon power1_crit_status
>          xeon_limit        --> hwmon power1_xeon_limit
>          fpga_limit        --> hwmon power1_fpga_limit
>          ltr               --> hwmon power1_ltr
> v3: rename some hwmon sysfs interfaces to follow hwmon ABI.
> 	power1_cap         --> power1_max
> 	power1_cap_status  --> power1_max_alarm
> 	power1_crit_status --> power1_crit_alarm
>      update sysfs doc for above sysfs interface changes.
>      replace scnprintf with sprintf in sysfs interface.
> v4: use HWMON_CHANNEL_INFO.
>      update date in sysfs doc.
> v5: clamp threshold inputs in power_hwmon_write function.
>      update sysfs doc as threshold inputs are clamped now.
>      add more descriptions to ltr sysfs interface.
> ---
>   Documentation/ABI/testing/sysfs-platform-dfl-fme |  68 +++++++
>   drivers/fpga/dfl-fme-main.c                      | 216 +++++++++++++++++++++++
>   2 files changed, 284 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-platform-dfl-fme b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> index 2cd17dc..5c2e49d 100644
> --- a/Documentation/ABI/testing/sysfs-platform-dfl-fme
> +++ b/Documentation/ABI/testing/sysfs-platform-dfl-fme
> @@ -127,6 +127,7 @@ Contact:	Wu Hao <hao.wu@intel.com>
>   Description:	Read-Only. Read this file to get the name of hwmon device, it
>   		supports values:
>   		    'dfl_fme_thermal' - thermal hwmon device name
> +		    'dfl_fme_power'   - power hwmon device name
>   
>   What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/temp1_input
>   Date:		June 2019
> @@ -183,3 +184,70 @@ Description:	Read-Only. Read this file to get the policy of hardware threshold1
>   		(see 'temp1_max'). It only supports two values (policies):
>   		    0 - AP2 state (90% throttling)
>   		    1 - AP1 state (50% throttling)
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_input
> +Date:		June 2019
> +KernelVersion:	5.3
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-Only. It returns current FPGA power consumption in uW.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max
> +Date:		June 2019
> +KernelVersion:	5.3
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-Write. Read this file to get current hardware power
> +		threshold1 in uW. If power consumption rises at or above
> +		this threshold, hardware starts 50% throttling.
> +		Write this file to set current hardware power threshold1 in uW.
> +		As hardware only accepts values in Watts, so input value will
> +		be round down per Watts (< 1 watts part will be discarded) and
> +		clamped within the range from 0 to 127 Watts. Write fails with
> +		-EINVAL if input parsing fails.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit
> +Date:		June 2019
> +KernelVersion:	5.3
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-Write. Read this file to get current hardware power
> +		threshold2 in uW. If power consumption rises at or above
> +		this threshold, hardware starts 90% throttling.
> +		Write this file to set current hardware power threshold2 in uW.
> +		As hardware only accepts values in Watts, so input value will
> +		be round down per Watts (< 1 watts part will be discarded) and
> +		clamped within the range from 0 to 127 Watts. Write fails with
> +		-EINVAL if input parsing fails.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_max_alarm
> +Date:		June 2019
> +KernelVersion:	5.3
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. It returns 1 if power consumption is currently at or
> +		above hardware threshold1 (see 'power1_max'), otherwise 0.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_crit_alarm
> +Date:		June 2019
> +KernelVersion:	5.3
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. It returns 1 if power consumption is currently at or
> +		above hardware threshold2 (see 'power1_crit'), otherwise 0.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_xeon_limit
> +Date:		June 2019
> +KernelVersion:	5.3
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-Only. It returns power limit for XEON in uW.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_fpga_limit
> +Date:		June 2019
> +KernelVersion:	5.3
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-Only. It returns power limit for FPGA in uW.
> +
> +What:		/sys/bus/platform/devices/dfl-fme.0/hwmon/hwmonX/power1_ltr
> +Date:		June 2019
> +KernelVersion:	5.3
> +Contact:	Wu Hao <hao.wu@intel.com>
> +Description:	Read-only. Read this file to get current Latency Tolerance
> +		Reporting (ltr) value. It returns 1 if all Accelerated
> +		Function Units (AFUs) can tolerate latency >= 40us for memory
> +		access or 0 if any AFU is latency sensitive (< 40us).
> diff --git a/drivers/fpga/dfl-fme-main.c b/drivers/fpga/dfl-fme-main.c
> index 59ff9f1..1ff386d 100644
> --- a/drivers/fpga/dfl-fme-main.c
> +++ b/drivers/fpga/dfl-fme-main.c
> @@ -400,6 +400,218 @@ static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
>   	.uinit = fme_thermal_mgmt_uinit,
>   };
>   
> +#define FME_PWR_STATUS		0x8
> +#define FME_LATENCY_TOLERANCE	BIT_ULL(18)
> +#define PWR_CONSUMED		GENMASK_ULL(17, 0)
> +
> +#define FME_PWR_THRESHOLD	0x10
> +#define PWR_THRESHOLD1		GENMASK_ULL(6, 0)	/* in Watts */
> +#define PWR_THRESHOLD2		GENMASK_ULL(14, 8)	/* in Watts */
> +#define PWR_THRESHOLD_MAX	0x7f			/* in Watts */
> +#define PWR_THRESHOLD1_STATUS	BIT_ULL(16)
> +#define PWR_THRESHOLD2_STATUS	BIT_ULL(17)
> +
> +#define FME_PWR_XEON_LIMIT	0x18
> +#define XEON_PWR_LIMIT		GENMASK_ULL(14, 0)	/* in 0.1 Watts */
> +#define XEON_PWR_EN		BIT_ULL(15)
> +#define FME_PWR_FPGA_LIMIT	0x20
> +#define FPGA_PWR_LIMIT		GENMASK_ULL(14, 0)	/* in 0.1 Watts */
> +#define FPGA_PWR_EN		BIT_ULL(15)
> +
> +static int power_hwmon_read(struct device *dev, enum hwmon_sensor_types type,
> +			    u32 attr, int channel, long *val)
> +{
> +	struct dfl_feature *feature = dev_get_drvdata(dev);
> +	u64 v;
> +
> +	switch (attr) {
> +	case hwmon_power_input:
> +		v = readq(feature->ioaddr + FME_PWR_STATUS);
> +		*val = (long)(FIELD_GET(PWR_CONSUMED, v) * 1000000);
> +		break;
> +	case hwmon_power_max:
> +		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> +		*val = (long)(FIELD_GET(PWR_THRESHOLD1, v) * 1000000);
> +		break;
> +	case hwmon_power_crit:
> +		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> +		*val = (long)(FIELD_GET(PWR_THRESHOLD2, v) * 1000000);
> +		break;
> +	case hwmon_power_max_alarm:
> +		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> +		*val = (long)FIELD_GET(PWR_THRESHOLD1_STATUS, v);
> +		break;
> +	case hwmon_power_crit_alarm:
> +		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> +		*val = (long)FIELD_GET(PWR_THRESHOLD2_STATUS, v);
> +		break;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +
> +	return 0;
> +}
> +
> +static int power_hwmon_write(struct device *dev, enum hwmon_sensor_types type,
> +			     u32 attr, int channel, long val)
> +{
> +	struct dfl_feature_platform_data *pdata = dev_get_platdata(dev->parent);
> +	struct dfl_feature *feature = dev_get_drvdata(dev);
> +	int ret = 0;
> +	u64 v;
> +
> +	val = clamp_val(val / 1000000, 0, PWR_THRESHOLD_MAX);
> +
> +	mutex_lock(&pdata->lock);
> +
> +	switch (attr) {
> +	case hwmon_power_max:
> +		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> +		v &= ~PWR_THRESHOLD1;
> +		v |= FIELD_PREP(PWR_THRESHOLD1, val);
> +		writeq(v, feature->ioaddr + FME_PWR_THRESHOLD);
> +		break;
> +	case hwmon_power_crit:
> +		v = readq(feature->ioaddr + FME_PWR_THRESHOLD);
> +		v &= ~PWR_THRESHOLD2;
> +		v |= FIELD_PREP(PWR_THRESHOLD2, val);
> +		writeq(v, feature->ioaddr + FME_PWR_THRESHOLD);
> +		break;
> +	default:
> +		ret = -EOPNOTSUPP;
> +		break;
> +	}
> +
> +	mutex_unlock(&pdata->lock);
> +
> +	return ret;
> +}
> +
> +static umode_t power_hwmon_attrs_visible(const void *drvdata,
> +					 enum hwmon_sensor_types type,
> +					 u32 attr, int channel)
> +{
> +	switch (attr) {
> +	case hwmon_power_input:
> +	case hwmon_power_max_alarm:
> +	case hwmon_power_crit_alarm:
> +		return 0444;
> +	case hwmon_power_max:
> +	case hwmon_power_crit:
> +		return 0644;
> +	}
> +
> +	return 0;
> +}
> +
> +static const struct hwmon_ops power_hwmon_ops = {
> +	.is_visible = power_hwmon_attrs_visible,
> +	.read = power_hwmon_read,
> +	.write = power_hwmon_write,
> +};
> +
> +static const struct hwmon_channel_info *power_hwmon_info[] = {
> +	HWMON_CHANNEL_INFO(power, HWMON_P_INPUT |
> +				  HWMON_P_MAX   | HWMON_P_MAX_ALARM |
> +				  HWMON_P_CRIT  | HWMON_P_CRIT_ALARM),
> +	NULL
> +};
> +
> +static const struct hwmon_chip_info power_hwmon_chip_info = {
> +	.ops = &power_hwmon_ops,
> +	.info = power_hwmon_info,
> +};
> +
> +static ssize_t power1_xeon_limit_show(struct device *dev,
> +				      struct device_attribute *attr, char *buf)
> +{
> +	struct dfl_feature *feature = dev_get_drvdata(dev);
> +	u16 xeon_limit = 0;
> +	u64 v;
> +
> +	v = readq(feature->ioaddr + FME_PWR_XEON_LIMIT);
> +
> +	if (FIELD_GET(XEON_PWR_EN, v))
> +		xeon_limit = FIELD_GET(XEON_PWR_LIMIT, v);
> +
> +	return sprintf(buf, "%u\n", xeon_limit * 100000);
> +}
> +
> +static ssize_t power1_fpga_limit_show(struct device *dev,
> +				      struct device_attribute *attr, char *buf)
> +{
> +	struct dfl_feature *feature = dev_get_drvdata(dev);
> +	u16 fpga_limit = 0;
> +	u64 v;
> +
> +	v = readq(feature->ioaddr + FME_PWR_FPGA_LIMIT);
> +
> +	if (FIELD_GET(FPGA_PWR_EN, v))
> +		fpga_limit = FIELD_GET(FPGA_PWR_LIMIT, v);
> +
> +	return sprintf(buf, "%u\n", fpga_limit * 100000);
> +}
> +
> +static ssize_t power1_ltr_show(struct device *dev,
> +			       struct device_attribute *attr, char *buf)
> +{
> +	struct dfl_feature *feature = dev_get_drvdata(dev);
> +	u64 v;
> +
> +	v = readq(feature->ioaddr + FME_PWR_STATUS);
> +
> +	return sprintf(buf, "%u\n",
> +		       (unsigned int)FIELD_GET(FME_LATENCY_TOLERANCE, v));
> +}
> +
> +static DEVICE_ATTR_RO(power1_xeon_limit);
> +static DEVICE_ATTR_RO(power1_fpga_limit);
> +static DEVICE_ATTR_RO(power1_ltr);
> +
> +static struct attribute *power_extra_attrs[] = {
> +	&dev_attr_power1_xeon_limit.attr,
> +	&dev_attr_power1_fpga_limit.attr,
> +	&dev_attr_power1_ltr.attr,
> +	NULL
> +};
> +
> +ATTRIBUTE_GROUPS(power_extra);
> +
> +static int fme_power_mgmt_init(struct platform_device *pdev,
> +			       struct dfl_feature *feature)
> +{
> +	struct device *hwmon;
> +
> +	dev_dbg(&pdev->dev, "FME Power Management Init.\n");
> +
> +	hwmon = devm_hwmon_device_register_with_info(&pdev->dev,
> +						     "dfl_fme_power", feature,
> +						     &power_hwmon_chip_info,
> +						     power_extra_groups);
> +	if (IS_ERR(hwmon)) {
> +		dev_err(&pdev->dev, "Fail to register power hwmon\n");
> +		return PTR_ERR(hwmon);
> +	}
> +
> +	return 0;
> +}
> +
> +static void fme_power_mgmt_uinit(struct platform_device *pdev,
> +				 struct dfl_feature *feature)
> +{
> +	dev_dbg(&pdev->dev, "FME Power Management UInit.\n");
> +}
> +
> +static const struct dfl_feature_id fme_power_mgmt_id_table[] = {
> +	{.id = FME_FEATURE_ID_POWER_MGMT,},
> +	{0,}
> +};
> +
> +static const struct dfl_feature_ops fme_power_mgmt_ops = {
> +	.init = fme_power_mgmt_init,
> +	.uinit = fme_power_mgmt_uinit,
> +};
> +
>   static struct dfl_feature_driver fme_feature_drvs[] = {
>   	{
>   		.id_table = fme_hdr_id_table,
> @@ -418,6 +630,10 @@ static void fme_thermal_mgmt_uinit(struct platform_device *pdev,
>   		.ops = &fme_thermal_mgmt_ops,
>   	},
>   	{
> +		.id_table = fme_power_mgmt_id_table,
> +		.ops = &fme_power_mgmt_ops,
> +	},
> +	{
>   		.ops = NULL,
>   	},
>   };
> 

^ permalink raw reply

* Re: [PATCH v4 0/3] initramfs: add support for xattrs in the initial ram disk
From: Roberto Sassu @ 2019-07-01 13:42 UTC (permalink / raw)
  To: Mimi Zohar, Rob Landley, viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, james.w.mcmechan,
	niveditas98
In-Reply-To: <1561909199.3985.33.camel@linux.ibm.com>

On 6/30/2019 6:39 PM, Mimi Zohar wrote:
> On Wed, 2019-06-26 at 10:15 +0200, Roberto Sassu wrote:
>> On 6/3/2019 8:32 PM, Rob Landley wrote:
>>> On 6/3/19 4:31 AM, Roberto Sassu wrote:
>>>>> This patch set aims at solving the following use case: appraise files from
>>>>> the initial ram disk. To do that, IMA checks the signature/hash from the
>>>>> security.ima xattr. Unfortunately, this use case cannot be implemented
>>>>> currently, as the CPIO format does not support xattrs.
>>>>>
>>>>> This proposal consists in including file metadata as additional files named
>>>>> METADATA!!!, for each file added to the ram disk. The CPIO parser in the
>>>>> kernel recognizes these special files from the file name, and calls the
>>>>> appropriate parser to add metadata to the previously extracted file. It has
>>>>> been proposed to use bit 17:16 of the file mode as a way to recognize files
>>>>> with metadata, but both the kernel and the cpio tool declare the file mode
>>>>> as unsigned short.
>>>>
>>>> Any opinion on this patch set?
>>>>
>>>> Thanks
>>>>
>>>> Roberto
>>>
>>> Sorry, I've had the window open since you posted it but haven't gotten around to
>>> it. I'll try to build it later today.
>>>
>>> It does look interesting, and I have no objections to the basic approach. I
>>> should be able to add support to toybox cpio over a weekend once I've got the
>>> kernel doing it to test against.
>>
>> Ok.
>>
>> Let me give some instructions so that people can test this patch set.
>>
>> To add xattrs to the ram disk embedded in the kernel it is sufficient
>> to set CONFIG_INITRAMFS_FILE_METADATA="xattr" and
>> CONFIG_INITRAMFS_SOURCE="<file with xattr>" in the kernel configuration.
>>
>> To add xattrs to the external ram disk, it is necessary to patch cpio:
>>
>> https://github.com/euleros/cpio/commit/531cabc88e9ecdc3231fad6e4856869baa9a91ef
>> (xattr-v1 branch)
>>
>> and dracut:
>>
>> https://github.com/euleros/dracut/commit/a2dee56ea80495c2c1871bc73186f7b00dc8bf3b
>> (digest-lists branch)
>>
>> The same modification can be done for mkinitramfs (add '-e xattr' to the
>> cpio command line).
>>
>> To simplify the test, it would be sufficient to replace only the cpio
>> binary and the dracut script with the modified versions. For dracut, the
>> patch should be applied to the local dracut (after it has been renamed
>> to dracut.sh).
>>
>> Then, run:
>>
>> dracut -e xattr -I <file with xattr> (add -f to overwrite the ram disk)
>>
>> Xattrs can be seen by stopping the boot process for example by adding
>> rd.break to the kernel command line.
> 
> A simple way of testing, without needing any changes other than the
> kernel patches, is to save the dracut temporary directory by supplying
> "--keep" on the dracut command line, calling
> usr/gen_initramfs_list.sh, followed by usr/gen_init_cpio with the "-e
> xattr" option.

Alternatively, follow the instructions to create the embedded ram disk
with xattrs, and use the existing external ram disk created with dracut
to check if xattrs are created.

Roberto

-- 
HUAWEI TECHNOLOGIES Duesseldorf GmbH, HRB 56063
Managing Director: Bo PENG, Jian LI, Yanli SHI

^ permalink raw reply

* Re: [PATCH v4 0/3] initramfs: add support for xattrs in the initial ram disk
From: Mimi Zohar @ 2019-07-01 14:31 UTC (permalink / raw)
  To: Roberto Sassu, Rob Landley, viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, james.w.mcmechan,
	niveditas98
In-Reply-To: <45164486-782f-a442-e442-6f56f9299c66@huawei.com>

On Mon, 2019-07-01 at 16:42 +0300, Roberto Sassu wrote:
> On 6/30/2019 6:39 PM, Mimi Zohar wrote:
> > On Wed, 2019-06-26 at 10:15 +0200, Roberto Sassu wrote:
> >> On 6/3/2019 8:32 PM, Rob Landley wrote:
> >>> On 6/3/19 4:31 AM, Roberto Sassu wrote:
> >>>>> This patch set aims at solving the following use case: appraise files from
> >>>>> the initial ram disk. To do that, IMA checks the signature/hash from the
> >>>>> security.ima xattr. Unfortunately, this use case cannot be implemented
> >>>>> currently, as the CPIO format does not support xattrs.
> >>>>>
> >>>>> This proposal consists in including file metadata as additional files named
> >>>>> METADATA!!!, for each file added to the ram disk. The CPIO parser in the
> >>>>> kernel recognizes these special files from the file name, and calls the
> >>>>> appropriate parser to add metadata to the previously extracted file. It has
> >>>>> been proposed to use bit 17:16 of the file mode as a way to recognize files
> >>>>> with metadata, but both the kernel and the cpio tool declare the file mode
> >>>>> as unsigned short.
> >>>>
> >>>> Any opinion on this patch set?
> >>>>
> >>>> Thanks
> >>>>
> >>>> Roberto
> >>>
> >>> Sorry, I've had the window open since you posted it but haven't gotten around to
> >>> it. I'll try to build it later today.
> >>>
> >>> It does look interesting, and I have no objections to the basic approach. I
> >>> should be able to add support to toybox cpio over a weekend once I've got the
> >>> kernel doing it to test against.
> >>
> >> Ok.
> >>
> >> Let me give some instructions so that people can test this patch set.
> >>
> >> To add xattrs to the ram disk embedded in the kernel it is sufficient
> >> to set CONFIG_INITRAMFS_FILE_METADATA="xattr" and
> >> CONFIG_INITRAMFS_SOURCE="<file with xattr>" in the kernel configuration.
> >>
> >> To add xattrs to the external ram disk, it is necessary to patch cpio:
> >>
> >> https://github.com/euleros/cpio/commit/531cabc88e9ecdc3231fad6e4856869baa9a91ef
> >> (xattr-v1 branch)
> >>
> >> and dracut:
> >>
> >> https://github.com/euleros/dracut/commit/a2dee56ea80495c2c1871bc73186f7b00dc8bf3b
> >> (digest-lists branch)
> >>
> >> The same modification can be done for mkinitramfs (add '-e xattr' to the
> >> cpio command line).
> >>
> >> To simplify the test, it would be sufficient to replace only the cpio
> >> binary and the dracut script with the modified versions. For dracut, the
> >> patch should be applied to the local dracut (after it has been renamed
> >> to dracut.sh).
> >>
> >> Then, run:
> >>
> >> dracut -e xattr -I <file with xattr> (add -f to overwrite the ram disk)
> >>
> >> Xattrs can be seen by stopping the boot process for example by adding
> >> rd.break to the kernel command line.
> > 
> > A simple way of testing, without needing any changes other than the
> > kernel patches, is to save the dracut temporary directory by supplying
> > "--keep" on the dracut command line, calling
> > usr/gen_initramfs_list.sh, followed by usr/gen_init_cpio with the "-e
> > xattr" option.
> 
> Alternatively, follow the instructions to create the embedded ram disk
> with xattrs, and use the existing external ram disk created with dracut
> to check if xattrs are created.

True, but this alternative is for those who normally use dracut to
create an initramfs, but don't want to update cpio or dracut.

Mimi

^ permalink raw reply

* Re: [PATCH 2/6] Adjust watch_queue documentation to mention mount and superblock watches. [ver #5]
From: Randy Dunlap @ 2019-07-01 14:52 UTC (permalink / raw)
  To: David Howells
  Cc: viro, Casey Schaufler, Stephen Smalley, Greg Kroah-Hartman,
	nicolas.dichtel, raven, Christian Brauner, keyrings, linux-usb,
	linux-security-module, linux-fsdevel, linux-api, linux-block,
	linux-kernel
In-Reply-To: <8212.1561971170@warthog.procyon.org.uk>

On 7/1/19 1:52 AM, David Howells wrote:
> Randy Dunlap <rdunlap@infradead.org> wrote:
> 
>> I'm having a little trouble parsing that sentence.
>> Could you clarify it or maybe rewrite/modify it?
>> Thanks.
> 
> How about:
> 
>   * ``info_filter`` and ``info_mask`` act as a filter on the info field of the
>     notification record.  The notification is only written into the buffer if::
> 
> 	(watch.info & info_mask) == info_filter
> 
>     This could be used, for example, to ignore events that are not exactly on
>     the watched point in a mount tree by specifying NOTIFY_MOUNT_IN_SUBTREE
>     must not be set, e.g.::
> 
> 	{
> 		.type = WATCH_TYPE_MOUNT_NOTIFY,
> 		.info_filter = 0,
> 		.info_mask = NOTIFY_MOUNT_IN_SUBTREE,
> 		.subtype_filter = ...,
> 	}
> 
>     as an event would be only permissible with this filter if::
> 
>     	(watch.info & NOTIFY_MOUNT_IN_SUBTREE) == 0
> 
> David
> 

Yes, better.  Thanks.

-- 
~Randy

^ permalink raw reply

* Re: [PATCH v3 2/2] arch: wire-up clone3() syscall
From: Arnd Bergmann @ 2019-07-01 15:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Guenter Roeck, Al Viro, Linux Kernel Mailing List, Linus Torvalds,
	Jann Horn, Kees Cook, Florian Weimer, Oleg Nesterov,
	David Howells, Andrew Morton, Adrian Reber, Linux API, linux-arch,
	the arch/x86 maintainers, Ley Foon Tan,
	moderated list:NIOS2 ARCHITECTURE
In-Reply-To: <20190621153012.fxwhx25mzmzueqh7@brauner.io>

On Fri, Jun 21, 2019 at 5:30 PM Christian Brauner <christian@brauner.io> wrote:
> On Fri, Jun 21, 2019 at 04:20:15PM +0200, Arnd Bergmann wrote:
> > On Fri, Jun 21, 2019 at 1:18 PM Christian Brauner <christian@brauner.io> wrote:
> Hm, if you believe that this is fine and want to "vouch" for it by
> whipping up a patch that replaces the wiring up done in [1] I'm happy to
> take it. :) Otherwise I'd feel more comfortable not adding all arches at
> once.
>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=clone

Sorry for my late reply. I had actually looked at the implementations
in a little
more detail and I think you are right that adding these are better
left to the arch
maintainers in case of clone3.

      Arnd

^ permalink raw reply

* Re: [PATCH v3 2/2] arch: wire-up clone3() syscall
From: Christian Brauner @ 2019-07-01 15:24 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Guenter Roeck, Al Viro, Linux Kernel Mailing List, Linus Torvalds,
	Jann Horn, Kees Cook, Florian Weimer, Oleg Nesterov,
	David Howells, Andrew Morton, Adrian Reber, Linux API, linux-arch,
	the arch/x86 maintainers, Ley Foon Tan,
	moderated list:NIOS2 ARCHITECTURE
In-Reply-To: <CAK8P3a0f_=q88JB=t7fbmweAbZ2E2_uCMt+2JoBYx3od_M6fHQ@mail.gmail.com>

On Mon, Jul 01, 2019 at 05:14:51PM +0200, Arnd Bergmann wrote:
> On Fri, Jun 21, 2019 at 5:30 PM Christian Brauner <christian@brauner.io> wrote:
> > On Fri, Jun 21, 2019 at 04:20:15PM +0200, Arnd Bergmann wrote:
> > > On Fri, Jun 21, 2019 at 1:18 PM Christian Brauner <christian@brauner.io> wrote:
> > Hm, if you believe that this is fine and want to "vouch" for it by
> > whipping up a patch that replaces the wiring up done in [1] I'm happy to
> > take it. :) Otherwise I'd feel more comfortable not adding all arches at
> > once.
> >
> > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=clone
> 
> Sorry for my late reply. I had actually looked at the implementations
> in a little
> more detail and I think you are right that adding these are better
> left to the arch
> maintainers in case of clone3.

Perfect, thanks!

Christian

^ permalink raw reply

* [PATCH v6 00/17] fs-verity: read-only file-based authenticity protection
From: Eric Biggers @ 2019-07-01 15:32 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh

Hello,

This is a redesigned version of the fs-verity patchset, implementing
Ted's suggestion to build the Merkle tree in the kernel
(https://lore.kernel.org/linux-fsdevel/20190207031101.GA7387@mit.edu/).
This greatly simplifies the UAPI, since the verity metadata no longer
needs to be transferred to the kernel.  Now to enable fs-verity on a
file, one simply calls FS_IOC_ENABLE_VERITY, passing it this structure:

	struct fsverity_enable_arg {
		__u32 version;
		__u32 hash_algorithm;
		__u32 block_size;
		__u32 salt_size;
		__u64 salt_ptr;
		__u32 sig_size;
		__u32 __reserved1;
		__u64 sig_ptr;
		__u64 __reserved2[11];
	};

The filesystem then builds the file's Merkle tree and stores it in a
filesystem-specific location associated with the file.  Afterwards,
FS_IOC_MEASURE_VERITY can be used to retrieve the file measurement
("root hash").  The way the file measurement is computed is also
effectively part of the API (it has to be), but it's logically
independent of where/how the filesystem stores the Merkle tree.

The API is fully documented in Documentation/filesystems/fsverity.rst,
along with other aspects of fs-verity.  I also added an FAQ section that
answers frequently asked questions about fs-verity, e.g. why isn't it
all at the VFS level, why isn't it part of IMA, why does the Merkle tree
need to be stored on-disk, etc.

Overview
--------

This patchset implements fs-verity for ext4 and f2fs.  fs-verity is
similar to dm-verity, but implemented on a per-file basis: a Merkle tree
is used to measure (hash) a read-only file's data as it is paged in.
ext4 and f2fs hide this Merkle tree beyond the end of the file, but
other filesystems can implement it differently if desired.

In general, fs-verity is intended for use on writable filesystems;
dm-verity is still recommended on read-only ones.

Similar to fscrypt, most of the code is in fs/verity/, and not too many
filesystem-specific changes are needed.  The Merkle tree is built by the
filesystem when the FS_IOC_ENABLE_VERITY ioctl is executed.

fs-verity provides a file measurement (hash) in constant time and
verifies data on-demand.  Thus, it is useful for efficiently verifying
the authenticity of large files of which only a small portion may be
accessed, such as Android application package (APK) files.  It may also
be useful in "audit" use cases where file hashes are logged.

fs-verity can also provide better protection against malicious disks
than an ahead-of-time hash, since fs-verity re-verifies data each time
it's paged in.  Note, however, that any authenticity guarantee is still
dependent on verification of the file measurement and other relevant
metadata in a way that makes sense for the overall system; fs-verity is
only a tool to help with this.

This patchset doesn't include IMA support for fs-verity file
measurements.  This is planned and we'd like to collaborate with the IMA
maintainers.  Although fs-verity can be used on its own without IMA,
fs-verity is primarily a lower level feature (think of it as a way of
hashing a file), so some users may still need IMA's policy mechanism.
However, an optional in-kernel signature verification mechanism within
fs-verity itself is also included.

This patchset is based on v5.2-rc3.  It can also be found in git at tag
fsverity_2019-07-01 of:

	https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git

fs-verity has a userspace utility:

	https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/fsverity-utils.git

xfstests for fs-verity can be found at branch "fsverity" of:

	https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/xfstests-dev.git

fs-verity is supported by f2fs-tools v1.11.0+ and e2fsprogs v1.45.2+.

Examples of setting up fs-verity protected files can be found in the
README.md file of fsverity-utils.

Other useful references include:

  - Documentation/filesystems/fsverity.rst, added by the first patch.

  - LWN coverage of v3 patchset: https://lwn.net/Articles/790185/

  - LWN coverage of v2 patchset: https://lwn.net/Articles/775872/

  - LWN coverage of v1 patchset: https://lwn.net/Articles/763729/

  - Presentation at Linux Security Summit North America 2018:
      - Slides: https://schd.ws/hosted_files/lssna18/af/fs-verity%20slide%20deck.pdf
      - Video: https://www.youtube.com/watch?v=Aw5h6aBhu6M
      (This corresponded to the v1 patchset; changes have been made since then.)

  - LWN coverage of LSFMM 2018 discussion: https://lwn.net/Articles/752614/

Changed since v5:

  - Switched to using detached signatures.  This simplifies the
    signature verification code considerably.

  - On f2fs, forbid enabling verity on files that have atomic or
    volatile writes pending.

  - Initialize quotas before evicting inline data.

  - Prevent writing verity metadata beyond s_maxbytes.

  - Switched from truncate_inode_pages() to invalidate_inode_pages2()
    (fixes FS_IOC_ENABLE_VERITY on ext4 with data=journal)

  - Always truncate the verity metadata if there's an error writing it,
    even if the error doesn't occur until ->end_enable_verity().

  - Updated the ext4 on-disk format documentation.

  - A few minor cleanups.

Changed since v4:

  - Made ext4 and f2fs store the verity metadata beginning at a 64K
    aligned boundary, to be ready for architectures with 64K pages.

  - Made ext4 store the verity descriptor size in the file data stream,
    so that no xattr is needed.

  - Added support for empty files.

  - A few minor cleanups.

Changed since v3:

  - The FS_IOC_GETFLAGS ioctl now returns the verity flag.

  - Fixed setting i_verity_info too early.

  - Restored pagecache invalidation in FS_IOC_ENABLE_VERITY.

  - Fixed truncation of fsverity_enable_arg::hash_algorithm.

  - Reject empty files for both open and enable, not just enable.

  - Added a couple more FAQ entries to the documentation.

  - A few minor cleanups.

  - Rebased onto v5.2-rc3.

Changed since v2:

  - Large redesign: the Merkle tree is now built by
    FS_IOC_ENABLE_VERITY, rather than being provided by userspace.  The
    fsverity_operations provide an interface for filesystems to read and
    write the Merkle tree from/to a filesystem-specific location.

  - Lot of refactoring, cleanups, and documentation improvements.

  - Many simplifications, such as simplifying the fsverity_descriptor
    format, dropping CRC-32 support, and limiting the salt size.

  - ext4 and f2fs now store an xattr that gives the location of the
    fsverity_descriptor, so loading it is more straightforward.

  - f2fs no longer counts the verity metadata in the on-disk i_size,
    making it consistent with ext4.

  - Replaced the filesystem-specific fs-verity kconfig options with
    CONFIG_FS_VERITY.

  - Replaced the filesystem-specific verity bit checks with IS_VERITY().

Changed since v1:

  - Added documentation file.

  - Require write permission for FS_IOC_ENABLE_VERITY, rather than
    CAP_SYS_ADMIN.

  - Eliminated dependency on CONFIG_BLOCK and clarified that filesystems
    can verify a page at a time rather than a bio at a time.

  - Fixed conditions for verifying holes.

  - ext4 now only allows fs-verity on extent-based files.

  - Eliminated most of the assumptions that the verity metadata is
    stored beyond EOF, in case filesystems want to do things
    differently.

  - Other cleanups.

Eric Biggers (17):
  fs-verity: add a documentation file
  fs-verity: add MAINTAINERS file entry
  fs-verity: add UAPI header
  fs: uapi: define verity bit for FS_IOC_GETFLAGS
  fs-verity: add Kconfig and the helper functions for hashing
  fs-verity: add inode and superblock fields
  fs-verity: add the hook for file ->open()
  fs-verity: add the hook for file ->setattr()
  fs-verity: add data verification hooks for ->readpages()
  fs-verity: implement FS_IOC_ENABLE_VERITY ioctl
  fs-verity: implement FS_IOC_MEASURE_VERITY ioctl
  fs-verity: add SHA-512 support
  fs-verity: support builtin file signatures
  ext4: add basic fs-verity support
  ext4: add fs-verity read support
  ext4: update on-disk format documentation for fs-verity
  f2fs: add fs-verity support

 Documentation/filesystems/ext4/inodes.rst   |   6 +-
 Documentation/filesystems/ext4/overview.rst |   1 +
 Documentation/filesystems/ext4/super.rst    |   2 +
 Documentation/filesystems/ext4/verity.rst   |  41 ++
 Documentation/filesystems/fsverity.rst      | 725 ++++++++++++++++++++
 Documentation/filesystems/index.rst         |   1 +
 Documentation/ioctl/ioctl-number.txt        |   1 +
 MAINTAINERS                                 |  12 +
 fs/Kconfig                                  |   2 +
 fs/Makefile                                 |   1 +
 fs/ext4/Makefile                            |   1 +
 fs/ext4/ext4.h                              |  23 +-
 fs/ext4/file.c                              |   4 +
 fs/ext4/inode.c                             |  48 +-
 fs/ext4/ioctl.c                             |  12 +
 fs/ext4/readpage.c                          | 207 +++++-
 fs/ext4/super.c                             |  18 +-
 fs/ext4/sysfs.c                             |   6 +
 fs/ext4/verity.c                            | 364 ++++++++++
 fs/f2fs/Makefile                            |   1 +
 fs/f2fs/data.c                              |  72 +-
 fs/f2fs/f2fs.h                              |  23 +-
 fs/f2fs/file.c                              |  40 ++
 fs/f2fs/inode.c                             |   5 +-
 fs/f2fs/super.c                             |   3 +
 fs/f2fs/sysfs.c                             |  11 +
 fs/f2fs/verity.c                            | 245 +++++++
 fs/f2fs/xattr.h                             |   2 +
 fs/verity/Kconfig                           |  55 ++
 fs/verity/Makefile                          |  10 +
 fs/verity/enable.c                          | 355 ++++++++++
 fs/verity/fsverity_private.h                | 185 +++++
 fs/verity/hash_algs.c                       | 279 ++++++++
 fs/verity/init.c                            |  61 ++
 fs/verity/measure.c                         |  57 ++
 fs/verity/open.c                            | 356 ++++++++++
 fs/verity/signature.c                       | 159 +++++
 fs/verity/verify.c                          | 281 ++++++++
 include/linux/fs.h                          |  11 +
 include/linux/fsverity.h                    | 209 ++++++
 include/uapi/linux/fs.h                     |   1 +
 include/uapi/linux/fsverity.h               |  40 ++
 42 files changed, 3875 insertions(+), 61 deletions(-)
 create mode 100644 Documentation/filesystems/ext4/verity.rst
 create mode 100644 Documentation/filesystems/fsverity.rst
 create mode 100644 fs/ext4/verity.c
 create mode 100644 fs/f2fs/verity.c
 create mode 100644 fs/verity/Kconfig
 create mode 100644 fs/verity/Makefile
 create mode 100644 fs/verity/enable.c
 create mode 100644 fs/verity/fsverity_private.h
 create mode 100644 fs/verity/hash_algs.c
 create mode 100644 fs/verity/init.c
 create mode 100644 fs/verity/measure.c
 create mode 100644 fs/verity/open.c
 create mode 100644 fs/verity/signature.c
 create mode 100644 fs/verity/verify.c
 create mode 100644 include/linux/fsverity.h
 create mode 100644 include/uapi/linux/fsverity.h

-- 
2.22.0

^ permalink raw reply

* [PATCH v6 01/17] fs-verity: add a documentation file
From: Eric Biggers @ 2019-07-01 15:32 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190701153237.1777-1-ebiggers@kernel.org>

From: Eric Biggers <ebiggers@google.com>

Add a documentation file for fs-verity, covering:

- Introduction
- Use cases
- User API
    - FS_IOC_ENABLE_VERITY
    - FS_IOC_MEASURE_VERITY
    - FS_IOC_GETFLAGS
- Accessing verity files
- File measurement computation
    - Merkle tree
    - fs-verity descriptor
- Built-in signature verification
- Filesystem support
    - ext4
    - f2fs
- Implementation details
    - Verifying data
        - Pagecache
        - Block device based filesystems
- Userspace utility
- Tests
- FAQ

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 Documentation/filesystems/fsverity.rst | 725 +++++++++++++++++++++++++
 Documentation/filesystems/index.rst    |   1 +
 2 files changed, 726 insertions(+)
 create mode 100644 Documentation/filesystems/fsverity.rst

diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
new file mode 100644
index 000000000000..3a7a44ba7bb7
--- /dev/null
+++ b/Documentation/filesystems/fsverity.rst
@@ -0,0 +1,725 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _fsverity:
+
+=======================================================
+fs-verity: read-only file-based authenticity protection
+=======================================================
+
+Introduction
+============
+
+fs-verity (``fs/verity/``) is a support layer that filesystems can
+hook into to support transparent integrity and authenticity protection
+of read-only files.  Currently, it is supported by the ext4 and f2fs
+filesystems.  Like fscrypt, not too much filesystem-specific code is
+needed to support fs-verity.
+
+fs-verity is similar to `dm-verity
+<https://www.kernel.org/doc/Documentation/device-mapper/verity.txt>`_
+but works on files rather than block devices.  On regular files on
+filesystems supporting fs-verity, userspace can execute an ioctl that
+causes the filesystem to build a Merkle tree for the file and persist
+it to a filesystem-specific location associated with the file.
+
+After this, the file is made readonly, and all reads from the file are
+automatically verified against the file's Merkle tree.  Reads of any
+corrupted data, including mmap reads, will fail.
+
+Userspace can use another ioctl to retrieve the root hash (actually
+the "file measurement", which is a hash that includes the root hash)
+that fs-verity is enforcing for the file.  This ioctl executes in
+constant time, regardless of the file size.
+
+fs-verity is essentially a way to hash a file in constant time,
+subject to the caveat that reads which would violate the hash will
+fail at runtime.
+
+Use cases
+=========
+
+By itself, the base fs-verity feature only provides integrity
+protection, i.e. detection of accidental (non-malicious) corruption.
+
+However, because fs-verity makes retrieving the file hash extremely
+efficient, it's primarily meant to be used as a tool to support
+authentication (detection of malicious modifications) or auditing
+(logging file hashes before use).
+
+Trusted userspace code (e.g. operating system code running on a
+read-only partition that is itself authenticated by dm-verity) can
+authenticate the contents of an fs-verity file by using the
+`FS_IOC_MEASURE_VERITY`_ ioctl to retrieve its hash, then verifying a
+digital signature of it.
+
+A standard file hash could be used instead of fs-verity.  However,
+this is inefficient if the file is large and only a small portion may
+be accessed.  This is often the case for Android application package
+(APK) files, for example.  These typically contain many translations,
+classes, and other resources that are infrequently or even never
+accessed on a particular device.  It would be slow and wasteful to
+read and hash the entire file before starting the application.
+
+Unlike an ahead-of-time hash, fs-verity also re-verifies data each
+time it's paged in.  This ensures that malicious disk firmware can't
+undetectably change the contents of the file at runtime.
+
+fs-verity does not replace or obsolete dm-verity.  dm-verity should
+still be used on read-only filesystems.  fs-verity is for files that
+must live on a read-write filesystem because they are independently
+updated and potentially user-installed, so dm-verity cannot be used.
+
+The base fs-verity feature is a hashing mechanism only; actually
+authenticating the files is up to userspace.  However, to meet some
+users' needs, fs-verity optionally supports a simple signature
+verification mechanism where users can configure the kernel to require
+that all fs-verity files be signed by a key loaded into a keyring; see
+`Built-in signature verification`_.  Support for fs-verity file hashes
+in IMA (Integrity Measurement Architecture) policies is also planned.
+
+User API
+========
+
+FS_IOC_ENABLE_VERITY
+--------------------
+
+The FS_IOC_ENABLE_VERITY ioctl enables fs-verity on a file.  It takes
+in a pointer to a :c:type:`struct fsverity_enable_arg`, defined as
+follows::
+
+    struct fsverity_enable_arg {
+            __u32 version;
+            __u32 hash_algorithm;
+            __u32 block_size;
+            __u32 salt_size;
+            __u64 salt_ptr;
+            __u32 sig_size;
+            __u32 __reserved1;
+            __u64 sig_ptr;
+            __u64 __reserved2[11];
+    };
+
+This structure contains the parameters of the Merkle tree to build for
+the file, and optionally contains a signature.  It must be initialized
+as follows:
+
+- ``version`` must be 1.
+- ``hash_algorithm`` must be the identifier for the hash algorithm to
+  use for the Merkle tree, such as FS_VERITY_HASH_ALG_SHA256.  See
+  ``include/uapi/linux/fsverity.h`` for the list of possible values.
+- ``block_size`` must be the Merkle tree block size.  Currently, this
+  must be equal to the system page size, which is usually 4096 bytes.
+  Other sizes may be supported in the future.  This value is not
+  necessarily the same as the filesystem block size.
+- ``salt_size`` is the size of the salt in bytes, or 0 if no salt is
+  provided.  The salt is a value that is prepended to every hashed
+  block; it can be used to personalize the hashing for a particular
+  file or device.  Currently the maximum salt size is 32 bytes.
+- ``salt_ptr`` is the pointer to the salt, or NULL if no salt is
+  provided.
+- ``sig_size`` is the size of the signature in bytes, or 0 if no
+  signature is provided.  Currently the signature is (somewhat
+  arbitrarily) limited to 16128 bytes.  See `Built-in signature
+  verification`_ for more information.
+- ``sig_ptr``  is the pointer to the signature, or NULL if no
+  signature is provided.
+- All reserved fields must be zeroed.
+
+FS_IOC_ENABLE_VERITY causes the filesystem to build a Merkle tree for
+the file and persist it to a filesystem-specific location associated
+with the file, then mark the file as a verity file.  This ioctl may
+take a long time to execute on large files, and it is interruptible by
+fatal signals.
+
+FS_IOC_ENABLE_VERITY checks for write access to the inode.  However,
+it must be executed on an O_RDONLY file descriptor and no processes
+can have the file open for writing.  Attempts to open the file for
+writing while this ioctl is executing will fail with ETXTBSY.  (This
+is necessary to guarantee that no writable file descriptors will exist
+after verity is enabled, and to guarantee that the file's contents are
+stable while the Merkle tree is being built over it.)
+
+On success, FS_IOC_ENABLE_VERITY returns 0, and the file becomes a
+verity file.  On failure (including the case of interruption by a
+fatal signal), no changes are made to the file.
+
+FS_IOC_ENABLE_VERITY can fail with the following errors:
+
+- ``EACCES``: the process does not have write access to the file
+- ``EBADMSG``: the signature is malformed
+- ``EEXIST``: the file already has verity enabled
+- ``EFAULT``: the caller provided inaccessible memory
+- ``EINTR``: the operation was interrupted by a fatal signal
+- ``EINVAL``: unsupported version, hash algorithm, or block size; or
+  reserved bits are set; or the file descriptor refers to neither a
+  regular file nor a directory.
+- ``EISDIR``: the file descriptor refers to a directory
+- ``EKEYREJECTED``: the signature doesn't match the file
+- ``EMSGSIZE``: the salt or signature is too long
+- ``ENOENT``: fs-verity recognizes the hash algorithm, but it's not
+  available in the kernel's crypto API as currently configured (e.g.
+  for SHA-512, missing CONFIG_CRYPTO_SHA512).
+- ``ENOKEY``: the fs-verity keyring doesn't contain the certificate
+  needed to verify the signature
+- ``ENOTTY``: this type of filesystem does not implement fs-verity
+- ``EOPNOTSUPP``: the kernel was not configured with fs-verity
+  support; or the filesystem superblock has not had the 'verity'
+  feature enabled on it; or the filesystem does not support fs-verity
+  on this file.  (See `Filesystem support`_.)
+- ``EPERM``: the file is append-only; or, a signature is required and
+  one was not provided.
+- ``EROFS``: the filesystem is read-only
+- ``ETXTBSY``: someone has the file open for writing.  This can be the
+  caller's file descriptor, another open file descriptor, or the file
+  reference held by a writable memory map.
+
+FS_IOC_MEASURE_VERITY
+---------------------
+
+The FS_IOC_MEASURE_VERITY ioctl retrieves the measurement of a verity
+file.  The file measurement is a digest that cryptographically
+identifies the file contents that are being enforced on reads.
+
+This ioctl takes in a pointer to a variable-length structure::
+
+    struct fsverity_digest {
+            __u16 digest_algorithm;
+            __u16 digest_size; /* input/output */
+            __u8 digest[];
+    };
+
+``digest_size`` is an input/output field.  On input, it must be
+initialized to the number of bytes allocated for the variable-length
+``digest`` field.
+
+On success, 0 is returned and the kernel fills in the structure as
+follows:
+
+- ``digest_algorithm`` will be the hash algorithm used for the file
+  measurement.  It will match ``fsverity_enable_arg::hash_algorithm``.
+- ``digest_size`` will be the size of the digest in bytes, e.g. 32
+  for SHA-256.  (This can be redundant with ``digest_algorithm``.)
+- ``digest`` will be the actual bytes of the digest.
+
+FS_IOC_MEASURE_VERITY is guaranteed to execute in constant time,
+regardless of the size of the file.
+
+FS_IOC_MEASURE_VERITY can fail with the following errors:
+
+- ``EFAULT``: the caller provided inaccessible memory
+- ``ENODATA``: the file is not a verity file
+- ``ENOTTY``: this type of filesystem does not implement fs-verity
+- ``EOPNOTSUPP``: the kernel was not configured with fs-verity
+  support, or the filesystem superblock has not had the 'verity'
+  feature enabled on it.  (See `Filesystem support`_.)
+- ``EOVERFLOW``: the digest is longer than the specified
+  ``digest_size`` bytes.  Try providing a larger buffer.
+
+FS_IOC_GETFLAGS
+---------------
+
+The existing ioctl FS_IOC_GETFLAGS (which isn't specific to fs-verity)
+can also be used to check whether a file has fs-verity enabled or not.
+To do so, check for FS_VERITY_FL (0x00100000) in the returned flags.
+
+The verity flag is not settable via FS_IOC_SETFLAGS.  You must use
+FS_IOC_ENABLE_VERITY instead, since parameters must be provided.
+
+Accessing verity files
+======================
+
+Applications can transparently access a verity file just like a
+non-verity one, with the following exceptions:
+
+- Verity files are readonly.  They cannot be opened for writing or
+  truncate()d, even if the file mode bits allow it.  Attempts to do
+  one of these things will fail with EPERM.  However, changes to
+  metadata such as owner, mode, timestamps, and xattrs are still
+  allowed, since these are not measured by fs-verity.  Verity files
+  can also still be renamed, deleted, and linked to.
+
+- Direct I/O is not supported on verity files.  Attempts to use direct
+  I/O on such files will fall back to buffered I/O.
+
+- DAX (Direct Access) is not supported on verity files, because this
+  would circumvent the data verification.
+
+- Reads of data that doesn't match the verity Merkle tree will fail
+  with EIO (for read()) or SIGBUS (for mmap() reads).
+
+- If the sysctl "fs.verity.require_signatures" is set to 1 and the
+  file's verity measurement is not signed by a key in the fs-verity
+  keyring, then opening the file will fail.  See `Built-in signature
+  verification`_.
+
+Direct access to the Merkle tree is not supported.  Therefore, if a
+verity file is copied, or is backed up and restored, then it will lose
+its "verity"-ness.  fs-verity is primarily meant for files like
+executables that are managed by a package manager.
+
+File measurement computation
+============================
+
+This section describes how fs-verity hashes the file contents using a
+Merkle tree to produce the "file measurement" which cryptographically
+identifies the file contents.  This algorithm is the same for all
+filesystems that support fs-verity.
+
+Userspace only needs to be aware of this algorithm if it needs to
+compute the file measurement itself, e.g. in order to sign the file.
+
+.. _fsverity_merkle_tree:
+
+Merkle tree
+-----------
+
+The file contents is divided into blocks, where the block size is
+configurable but is usually 4096 bytes.  The end of the last block is
+zero-padded if needed.  Each block is then hashed, producing the first
+level of hashes.  Then, the hashes in this first level are grouped
+into 'blocksize'-byte blocks (zero-padding the ends as needed) and
+these blocks are hashed, producing the second level of hashes.  This
+proceeds up the tree until only a single block remains.  The hash of
+this block is the "Merkle tree root hash".
+
+If the file fits in one block and is nonempty, then the "Merkle tree
+root hash" is simply the hash of the single data block.  If the file
+is empty, then the "Merkle tree root hash" is all zeroes.
+
+The "blocks" here are not necessarily the same as "filesystem blocks".
+
+If a salt was specified, then it's zero-padded to the closest multiple
+of the input size of the hash algorithm's compression function, e.g.
+64 bytes for SHA-256 or 128 bytes for SHA-512.  The padded salt is
+prepended to every data or Merkle tree block that is hashed.
+
+The purpose of the block padding is to cause every hash to be taken
+over the same amount of data, which simplifies the implementation and
+keeps open more possibilities for hardware acceleration.  The purpose
+of the salt padding is to make the salting "free" when the salted hash
+state is precomputed, then imported for each hash.
+
+Example: in the recommended configuration of SHA-256 and 4K blocks,
+128 hash values fit in each block.  Thus, each level of the Merkle
+tree is approximately 128 times smaller than the previous, and for
+large files the Merkle tree's size converges to approximately 1/127 of
+the original file size.  However, for small files, the padding is
+significant, making the space overhead proportionally more.
+
+.. _fsverity_descriptor:
+
+fs-verity descriptor
+--------------------
+
+By itself, the Merkle tree root hash is ambiguous.  For example, it
+can't a distinguish a large file from a small second file whose data
+is exactly the top-level hash block of the first file.  Ambiguities
+also arise from the convention of padding to the next block boundary.
+
+To solve this problem, the verity file measurement is actually
+computed as a hash of the following structure, which contains the
+Merkle tree root hash as well as other fields such as the file size::
+
+    struct fsverity_descriptor {
+            __u8 version;           /* must be 1 */
+            __u8 hash_algorithm;    /* Merkle tree hash algorithm */
+            __u8 log_blocksize;     /* log2 of size of data and tree blocks */
+            __u8 salt_size;         /* size of salt in bytes; 0 if none */
+            __le32 sig_size;        /* must be 0 */
+            __le64 data_size;       /* size of file the Merkle tree is built over */
+            __u8 root_hash[64];     /* Merkle tree root hash */
+            __u8 salt[32];          /* salt prepended to each hashed block */
+            __u8 __reserved[144];   /* must be 0's */
+    };
+
+Note that the ``sig_size`` field must be set to 0 for the purpose of
+computing the file measurement, even if a signature was provided (or
+will be provided) to `FS_IOC_ENABLE_VERITY`_.
+
+Built-in signature verification
+===============================
+
+With CONFIG_FS_VERITY_BUILTIN_SIGNATURES=y, fs-verity supports putting
+a portion of an authentication policy (see `Use cases`_) in the
+kernel.  Specifically, it adds support for:
+
+1. At fs-verity module initialization time, a keyring ".fs-verity" is
+   created.  The root user can add trusted X.509 certificates to this
+   keyring using the add_key() system call, then (when done)
+   optionally use keyctl_restrict_keyring() to prevent additional
+   certificates from being added.
+
+2. `FS_IOC_ENABLE_VERITY`_ accepts a pointer to a PKCS#7 formatted
+   detached signature in DER format of the file measurement.  On
+   success, this signature is persisted alongside the Merkle tree.
+   Then, any time the file is opened, the kernel will verify the
+   file's actual measurement against this signature, using the
+   certificates in the ".fs-verity" keyring.
+
+3. A new sysctl "fs.verity.require_signatures" is made available.
+   When set to 1, the kernel requires that all verity files have a
+   correctly signed file measurement as described in (2).
+
+File measurements must be signed in the following format, which is
+similar to the structure used by `FS_IOC_MEASURE_VERITY`_::
+
+    struct fsverity_signed_digest {
+            char magic[8];                  /* must be "FSVerity" */
+            __le16 digest_algorithm;
+            __le16 digest_size;
+            __u8 digest[];
+    };
+
+fs-verity's built-in signature verification support is meant as a
+relatively simple mechanism that can be used to provide some level of
+authenticity protection for verity files, as an alternative to doing
+the signature verification in userspace or using IMA-appraisal.
+However, with this mechanism, userspace programs still need to check
+that the verity bit is set, and there is no protection against verity
+files being swapped around.
+
+Filesystem support
+==================
+
+fs-verity is currently supported by the ext4 and f2fs filesystems.
+The CONFIG_FS_VERITY kconfig option must be enabled to use fs-verity
+on either filesystem.
+
+``include/linux/fsverity.h`` declares the interface between the
+``fs/verity/`` support layer and filesystems.  Briefly, filesystems
+must provide an ``fsverity_operations`` structure that provides
+methods to read and write the verity metadata to a filesystem-specific
+location, including the Merkle tree blocks and
+``fsverity_descriptor``.  Filesystems must also call functions in
+``fs/verity/`` at certain times, such as when a file is opened or when
+pages have been read into the pagecache.  (See `Verifying data`_.)
+
+ext4
+----
+
+ext4 supports fs-verity since Linux TODO and e2fsprogs v1.45.2.
+
+To create verity files on an ext4 filesystem, the filesystem must have
+been formatted with ``-O verity`` or had ``tune2fs -O verity`` run on
+it.  "verity" is an RO_COMPAT filesystem feature, so once set, old
+kernels will only be able to mount the filesystem readonly, and old
+versions of e2fsck will be unable to check the filesystem.  Moreover,
+currently ext4 only supports mounting a filesystem with the "verity"
+feature when its block size is equal to PAGE_SIZE (often 4096 bytes).
+
+ext4 sets the EXT4_VERITY_FL on-disk inode flag on verity files.  It
+can only be set by `FS_IOC_ENABLE_VERITY`_, and it cannot be cleared.
+
+ext4 also supports encryption, which can be used simultaneously with
+fs-verity.  In this case, the plaintext data is verified rather than
+the ciphertext.  This is necessary in order to make the file
+measurement meaningful, since every file is encrypted differently.
+
+ext4 stores the verity metadata (Merkle tree and fsverity_descriptor)
+past the end of the file, starting at the first 64K boundary beyond
+i_size.  This approach works because (a) verity files are readonly,
+and (b) pages fully beyond i_size aren't visible to userspace but can
+be read/written internally by ext4 with only some relatively small
+changes to ext4.  This approach avoids having to depend on the
+EA_INODE feature and on rearchitecturing ext4's xattr support to
+support paging multi-gigabyte xattrs into memory, and to support
+encrypting xattrs.  Note that the verity metadata *must* be encrypted
+when the file is, since it contains hashes of the plaintext data.
+
+Currently, ext4 verity only supports the case where the Merkle tree
+block size, filesystem block size, and page size are all the same.  It
+also only supports extent-based files.
+
+f2fs
+----
+
+f2fs supports fs-verity since Linux TODO and f2fs-tools v1.11.0.
+
+To create verity files on an f2fs filesystem, the filesystem must have
+been formatted with ``-O verity``.
+
+f2fs sets the FADVISE_VERITY_BIT on-disk inode flag on verity files.
+It can only be set by `FS_IOC_ENABLE_VERITY`_, and it cannot be
+cleared.
+
+Like ext4, f2fs stores the verity metadata (Merkle tree and
+fsverity_descriptor) past the end of the file, starting at the first
+64K boundary beyond i_size.  See explanation for ext4 above.
+Moreover, f2fs supports at most 4096 bytes of xattr entries per inode
+which wouldn't be enough for even a single Merkle tree block.
+
+Currently, f2fs verity only supports a Merkle tree block size of 4096.
+Also, f2fs doesn't support enabling verity on files that currently
+have atomic or volatile writes pending.
+
+Implementation details
+======================
+
+Verifying data
+--------------
+
+fs-verity ensures that all reads of a verity file's data are verified,
+regardless of which syscall is used to do the read (e.g. mmap(),
+read(), pread()) and regardless of whether it's the first read or a
+later read (unless the later read can return cached data that was
+already verified).  Below, we describe how filesystems implement this.
+
+Pagecache
+~~~~~~~~~
+
+For filesystems using Linux's pagecache, the ``->readpage()`` and
+``->readpages()`` methods must be modified to verify pages before they
+are marked Uptodate.  Merely hooking ``->read_iter()`` would be
+insufficient, since ``->read_iter()`` is not used for memory maps.
+
+Therefore, fs/verity/ provides a function fsverity_verify_page() which
+verifies a page that has been read into the pagecache of a verity
+inode, but is still locked and not Uptodate, so it's not yet readable
+by userspace.  As needed to do the verification,
+fsverity_verify_page() will call back into the filesystem to read
+Merkle tree pages via fsverity_operations::read_merkle_tree_page().
+
+fsverity_verify_page() returns false if verification failed; in this
+case, the filesystem must not set the page Uptodate.  Following this,
+as per the usual Linux pagecache behavior, attempts by userspace to
+read() from the part of the file containing the page will fail with
+EIO, and accesses to the page within a memory map will raise SIGBUS.
+
+fsverity_verify_page() currently only supports the case where the
+Merkle tree block size is equal to PAGE_SIZE (often 4096 bytes).
+
+In principle, fsverity_verify_page() verifies the entire path in the
+Merkle tree from the data page to the root hash.  However, for
+efficiency the filesystem may cache the hash pages.  Therefore,
+fsverity_verify_page() only ascends the tree reading hash pages until
+an already-verified hash page is seen, as indicated by the PageChecked
+bit being set.  It then verifies the path to that page.
+
+This optimization, which is also used by dm-verity, results in
+excellent sequential read performance.  This is because usually (e.g.
+127 in 128 times for 4K blocks and SHA-256) the hash page from the
+bottom level of the tree will already be cached and checked from
+reading a previous data page.  However, random reads perform worse.
+
+Block device based filesystems
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Block device based filesystems (e.g. ext4 and f2fs) in Linux also use
+the pagecache, so the above subsection applies too.  However, they
+also usually read many pages from a file at once, grouped into a
+structure called a "bio".  To make it easier for these types of
+filesystems to support fs-verity, fs/verity/ also provides a function
+fsverity_verify_bio() which verifies all pages in a bio.
+
+ext4 and f2fs also support encryption.  If a verity file is also
+encrypted, the pages must be decrypted before being verified.  To
+support this, these filesystems allocate a "post-read context" for
+each bio and store it in ``->bi_private``::
+
+    struct bio_post_read_ctx {
+           struct bio *bio;
+           struct work_struct work;
+           unsigned int cur_step;
+           unsigned int enabled_steps;
+    };
+
+``enabled_steps`` is a bitmask that specifies whether decryption,
+verity, or both is enabled.  After the bio completes, for each needed
+postprocessing step the filesystem enqueues the bio_post_read_ctx on a
+workqueue, and then the workqueue work does the decryption or
+verification.  Finally, pages where no decryption or verity error
+occurred are marked Uptodate, and the pages are unlocked.
+
+Files on ext4 and f2fs may contain holes.  Normally, ``->readpages()``
+simply zeroes holes and sets the corresponding pages Uptodate; no bios
+are issued.  To prevent this case from bypassing fs-verity, these
+filesystems use fsverity_verify_page() to verify hole pages.
+
+ext4 and f2fs disable direct I/O on verity files, since otherwise
+direct I/O would bypass fs-verity.  (They also do the same for
+encrypted files.)
+
+Userspace utility
+=================
+
+This document focuses on the kernel, but a userspace utility for
+fs-verity can be found at:
+
+	https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/fsverity-utils.git
+
+See the README.md file in the fsverity-utils source tree for details,
+including examples of setting up fs-verity protected files.
+
+Tests
+=====
+
+To test fs-verity, use xfstests.  For example, using `kvm-xfstests
+<https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md>`_::
+
+    kvm-xfstests -c ext4,f2fs -g verity
+
+FAQ
+===
+
+This section answers frequently asked questions about fs-verity that
+weren't already directly answered in other parts of this document.
+
+:Q: Why isn't fs-verity part of IMA?
+:A: fs-verity and IMA (Integrity Measurement Architecture) have
+    different focuses.  fs-verity is a filesystem-level mechanism for
+    hashing individual files using a Merkle tree.  In contrast, IMA
+    specifies a system-wide policy that specifies which files are
+    hashed and what to do with those hashes, such as log them,
+    authenticate them, or add them to a measurement list.
+
+    IMA is planned to support the fs-verity hashing mechanism as an
+    alternative to doing full file hashes, for people who want the
+    performance and security benefits of the Merkle tree based hash.
+    But it doesn't make sense to force all uses of fs-verity to be
+    through IMA.  As a standalone filesystem feature, fs-verity
+    already meets many users' needs, and it's testable like other
+    filesystem features e.g. with xfstests.
+
+:Q: Isn't fs-verity useless because the attacker can just modify the
+    hashes in the Merkle tree, which is stored on-disk?
+:A: To verify the authenticity of an fs-verity file you must verify
+    the authenticity of the "file measurement", which is basically the
+    root hash of the Merkle tree.  See `Use cases`_.
+
+:Q: Isn't fs-verity useless because the attacker can just replace a
+    verity file with a non-verity one?
+:A: See `Use cases`_.  In the initial use case, it's really trusted
+    userspace code that authenticates the files; fs-verity is just a
+    tool to do this job efficiently and securely.  The trusted
+    userspace code will consider non-verity files to be inauthentic.
+
+:Q: Why does the Merkle tree need to be stored on-disk?  Couldn't you
+    store just the root hash?
+:A: If the Merkle tree wasn't stored on-disk, then you'd have to
+    compute the entire tree when the file is first accessed, even if
+    just one byte is being read.  This is a fundamental consequence of
+    how Merkle tree hashing works.  To verify a leaf node, you need to
+    verify the whole path to the root hash, including the root node
+    (the thing which the root hash is a hash of).  But if the root
+    node isn't stored on-disk, you have to compute it by hashing its
+    children, and so on until you've actually hashed the entire file.
+
+    That defeats most of the point of doing a Merkle tree-based hash,
+    since if you have to hash the whole file ahead of time anyway,
+    then you could simply do sha256(file) instead.  That would be much
+    simpler, and a bit faster too.
+
+    It's true that an in-memory Merkle tree could still provide the
+    advantage of verification on every read rather than just on the
+    first read.  However, it would be inefficient because every time a
+    hash page gets evicted (you can't pin the entire Merkle tree into
+    memory, since it may be very large), in order to restore it you
+    again need to hash everything below it in the tree.  This again
+    defeats most of the point of doing a Merkle tree-based hash, since
+    a single block read could trigger re-hashing gigabytes of data.
+
+:Q: But couldn't you store just the leaf nodes and compute the rest?
+:A: See previous answer; this really just moves up one level, since
+    one could alternatively interpret the data blocks as being the
+    leaf nodes of the Merkle tree.  It's true that the tree can be
+    computed much faster if the leaf level is stored rather than just
+    the data, but that's only because each level is less than 1% the
+    size of the level below (assuming the recommended settings of
+    SHA-256 and 4K blocks).  For the exact same reason, by storing
+    "just the leaf nodes" you'd already be storing over 99% of the
+    tree, so you might as well simply store the whole tree.
+
+:Q: Can the Merkle tree be built ahead of time, e.g. distributed as
+    part of a package that is installed to many computers?
+:A: This isn't currently supported.  It was part of the original
+    design, but was removed to simplify the kernel UAPI and because it
+    wasn't a critical use case.  Files are usually installed once and
+    used many times, and cryptographic hashing is somewhat fast on
+    most modern processors.
+
+:Q: Why doesn't fs-verity support writes?
+:A: Write support would be very difficult and would require a
+    completely different design, so it's well outside the scope of
+    fs-verity.  Write support would require:
+
+    - A way to maintain consistency between the data and hashes,
+      including all levels of hashes, since corruption after a crash
+      (especially of potentially the entire file!) is unacceptable.
+      The main options for solving this are data journalling,
+      copy-on-write, and log-structured volume.  But it's very hard to
+      retrofit existing filesystems with new consistency mechanisms.
+      Data journalling is available on ext4, but is very slow.
+
+    - Rebuilding the the Merkle tree after every write, which would be
+      extremely inefficient.  Alternatively, a different authenticated
+      dictionary structure such as an "authenticated skiplist" could
+      be used.  However, this would be far more complex.
+
+    Compare it to dm-verity vs. dm-integrity.  dm-verity is very
+    simple: the kernel just verifies read-only data against a
+    read-only Merkle tree.  In contrast, dm-integrity supports writes
+    but is slow, is much more complex, and doesn't actually support
+    full-device authentication since it authenticates each sector
+    independently, i.e. there is no "root hash".  It doesn't really
+    make sense for the same device-mapper target to support these two
+    very different cases; the same applies to fs-verity.
+
+:Q: Since verity files are immutable, why isn't the immutable bit set?
+:A: The existing "immutable" bit (FS_IMMUTABLE_FL) already has a
+    specific set of semantics which not only make the file contents
+    read-only, but also prevent the file from being deleted, renamed,
+    linked to, or having its owner or mode changed.  These extra
+    properties are unwanted for fs-verity, so reusing the immutable
+    bit isn't appropriate.
+
+:Q: Why does the API use ioctls instead of setxattr() and getxattr()?
+:A: Abusing the xattr interface for basically arbitrary syscalls is
+    heavily frowned upon by most of the Linux filesystem developers.
+    An xattr should really just be an xattr on-disk, not an API to
+    e.g. magically trigger construction of a Merkle tree.
+
+:Q: Does fs-verity support remote filesystems?
+:A: Only ext4 and f2fs support is implemented currently, but in
+    principle any filesystem that can store per-file verity metadata
+    can support fs-verity, regardless of whether it's local or remote.
+    Some filesystems may have fewer options of where to store the
+    verity metadata; one possibility is to store it past the end of
+    the file and "hide" it from userspace by manipulating i_size.  The
+    data verification functions provided by ``fs/verity/`` also assume
+    that the filesystem uses the Linux pagecache, but both local and
+    remote filesystems normally do so.
+
+:Q: Why is anything filesystem-specific at all?  Shouldn't fs-verity
+    be implemented entirely at the VFS level?
+:A: There are many reasons why this is not possible or would be very
+    difficult, including the following:
+
+    - To prevent bypassing verification, pages must not be marked
+      Uptodate until they've been verified.  Currently, each
+      filesystem is responsible for marking pages Uptodate via
+      ``->readpages()``.  Therefore, currently it's not possible for
+      the VFS to do the verification on its own.  Changing this would
+      require significant changes to the VFS and all filesystems.
+
+    - It would require defining a filesystem-independent way to store
+      the verity metadata.  Extended attributes don't work for this
+      because (a) the Merkle tree may be gigabytes, but many
+      filesystems assume that all xattrs fit into a single 4K
+      filesystem block, and (b) ext4 and f2fs encryption doesn't
+      encrypt xattrs, yet the Merkle tree *must* be encrypted when the
+      file contents are, because it stores hashes of the plaintext
+      file contents.
+
+      So the verity metadata would have to be stored in an actual
+      file.  Using a separate file would be very ugly, since the
+      metadata is fundamentally part of the file to be protected, and
+      it could cause problems where users could delete the real file
+      but not the metadata file or vice versa.  On the other hand,
+      having it be in the same file would break applications unless
+      filesystems' notion of i_size were divorced from the VFS's,
+      which would be complex and require changes to all filesystems.
+
+    - It's desirable that FS_IOC_ENABLE_VERITY uses the filesystem's
+      transaction mechanism so that either the file ends up with
+      verity enabled, or no changes were made.  Allowing intermediate
+      states to occur after a crash may cause problems.
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 1131c34d77f6..416c7f0e123a 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -31,6 +31,7 @@ filesystem implementations.
 
    journalling
    fscrypt
+   fsverity
 
 Filesystem-specific documentation
 =================================
-- 
2.22.0

^ permalink raw reply related

* [PATCH v6 02/17] fs-verity: add MAINTAINERS file entry
From: Eric Biggers @ 2019-07-01 15:32 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190701153237.1777-1-ebiggers@kernel.org>

From: Eric Biggers <ebiggers@google.com>

fs-verity will be jointly maintained by Eric Biggers and Theodore Ts'o.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 MAINTAINERS | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index a6954776a37e..655065116f92 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6505,6 +6505,18 @@ S:	Maintained
 F:	fs/notify/
 F:	include/linux/fsnotify*.h
 
+FSVERITY: READ-ONLY FILE-BASED AUTHENTICITY PROTECTION
+M:	Eric Biggers <ebiggers@kernel.org>
+M:	Theodore Y. Ts'o <tytso@mit.edu>
+L:	linux-fscrypt@vger.kernel.org
+Q:	https://patchwork.kernel.org/project/linux-fscrypt/list/
+T:	git git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt.git fsverity
+S:	Supported
+F:	fs/verity/
+F:	include/linux/fsverity.h
+F:	include/uapi/linux/fsverity.h
+F:	Documentation/filesystems/fsverity.rst
+
 FUJITSU LAPTOP EXTRAS
 M:	Jonathan Woithe <jwoithe@just42.net>
 L:	platform-driver-x86@vger.kernel.org
-- 
2.22.0

^ permalink raw reply related

* [PATCH v6 03/17] fs-verity: add UAPI header
From: Eric Biggers @ 2019-07-01 15:32 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190701153237.1777-1-ebiggers@kernel.org>

From: Eric Biggers <ebiggers@google.com>

Add the UAPI header for fs-verity, including two ioctls:

- FS_IOC_ENABLE_VERITY
- FS_IOC_MEASURE_VERITY

These ioctls are documented in the "User API" section of
Documentation/filesystems/fsverity.rst.

Examples of using these ioctls can be found in fsverity-utils
(https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/fsverity-utils.git).

I've also written xfstests that test these ioctls
(https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/xfstests-dev.git/log/?h=fsverity).

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 Documentation/ioctl/ioctl-number.txt |  1 +
 include/uapi/linux/fsverity.h        | 39 ++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)
 create mode 100644 include/uapi/linux/fsverity.h

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index c9558146ac58..21767c81e86d 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -225,6 +225,7 @@ Code  Seq#(hex)	Include File		Comments
 'f'	00-0F	fs/ext4/ext4.h		conflict!
 'f'	00-0F	linux/fs.h		conflict!
 'f'	00-0F	fs/ocfs2/ocfs2_fs.h	conflict!
+'f'	81-8F	linux/fsverity.h
 'g'	00-0F	linux/usb/gadgetfs.h
 'g'	20-2F	linux/usb/g_printer.h
 'h'	00-7F				conflict! Charon filesystem
diff --git a/include/uapi/linux/fsverity.h b/include/uapi/linux/fsverity.h
new file mode 100644
index 000000000000..57d1d7fc0c34
--- /dev/null
+++ b/include/uapi/linux/fsverity.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * fs-verity user API
+ *
+ * These ioctls can be used on filesystems that support fs-verity.  See the
+ * "User API" section of Documentation/filesystems/fsverity.rst.
+ *
+ * Copyright 2019 Google LLC
+ */
+#ifndef _UAPI_LINUX_FSVERITY_H
+#define _UAPI_LINUX_FSVERITY_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+#define FS_VERITY_HASH_ALG_SHA256	1
+
+struct fsverity_enable_arg {
+	__u32 version;
+	__u32 hash_algorithm;
+	__u32 block_size;
+	__u32 salt_size;
+	__u64 salt_ptr;
+	__u32 sig_size;
+	__u32 __reserved1;
+	__u64 sig_ptr;
+	__u64 __reserved2[11];
+};
+
+struct fsverity_digest {
+	__u16 digest_algorithm;
+	__u16 digest_size; /* input/output */
+	__u8 digest[];
+};
+
+#define FS_IOC_ENABLE_VERITY	_IOW('f', 133, struct fsverity_enable_arg)
+#define FS_IOC_MEASURE_VERITY	_IOWR('f', 134, struct fsverity_digest)
+
+#endif /* _UAPI_LINUX_FSVERITY_H */
-- 
2.22.0

^ permalink raw reply related

* [PATCH v6 04/17] fs: uapi: define verity bit for FS_IOC_GETFLAGS
From: Eric Biggers @ 2019-07-01 15:32 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190701153237.1777-1-ebiggers@kernel.org>

From: Eric Biggers <ebiggers@google.com>

Add FS_VERITY_FL to the flags for FS_IOC_GETFLAGS, so that applications
can easily determine whether a file is a verity file at the same time as
they're checking other file flags.  This flag will be gettable only;
FS_IOC_SETFLAGS won't allow setting it, since an ioctl must be used
instead to provide more parameters.

This flag matches the on-disk bit that was already allocated for ext4.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 include/uapi/linux/fs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 59c71fa8c553..df261b7e0587 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -306,6 +306,7 @@ struct fscrypt_key {
 #define FS_TOPDIR_FL			0x00020000 /* Top of directory hierarchies*/
 #define FS_HUGE_FILE_FL			0x00040000 /* Reserved for ext4 */
 #define FS_EXTENT_FL			0x00080000 /* Extents */
+#define FS_VERITY_FL			0x00100000 /* Verity protected inode */
 #define FS_EA_INODE_FL			0x00200000 /* Inode used for large EA */
 #define FS_EOFBLOCKS_FL			0x00400000 /* Reserved for ext4 */
 #define FS_NOCOW_FL			0x00800000 /* Do not cow file */
-- 
2.22.0

^ permalink raw reply related

* [PATCH v6 05/17] fs-verity: add Kconfig and the helper functions for hashing
From: Eric Biggers @ 2019-07-01 15:32 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190701153237.1777-1-ebiggers@kernel.org>

From: Eric Biggers <ebiggers@google.com>

Add the beginnings of the fs/verity/ support layer, including the
Kconfig option and various helper functions for hashing.  To start, only
SHA-256 is supported, but other hash algorithms can easily be added.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/Kconfig                   |   2 +
 fs/Makefile                  |   1 +
 fs/verity/Kconfig            |  38 +++++
 fs/verity/Makefile           |   4 +
 fs/verity/fsverity_private.h |  88 +++++++++++
 fs/verity/hash_algs.c        | 274 +++++++++++++++++++++++++++++++++++
 fs/verity/init.c             |  41 ++++++
 7 files changed, 448 insertions(+)
 create mode 100644 fs/verity/Kconfig
 create mode 100644 fs/verity/Makefile
 create mode 100644 fs/verity/fsverity_private.h
 create mode 100644 fs/verity/hash_algs.c
 create mode 100644 fs/verity/init.c

diff --git a/fs/Kconfig b/fs/Kconfig
index f1046cf6ad85..4b66dafbdc7b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -113,6 +113,8 @@ config MANDATORY_FILE_LOCKING
 
 source "fs/crypto/Kconfig"
 
+source "fs/verity/Kconfig"
+
 source "fs/notify/Kconfig"
 
 source "fs/quota/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index c9aea23aba56..fe7f2c07f482 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -34,6 +34,7 @@ obj-$(CONFIG_AIO)               += aio.o
 obj-$(CONFIG_IO_URING)		+= io_uring.o
 obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
+obj-$(CONFIG_FS_VERITY)		+= verity/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
 obj-$(CONFIG_BINFMT_AOUT)	+= binfmt_aout.o
diff --git a/fs/verity/Kconfig b/fs/verity/Kconfig
new file mode 100644
index 000000000000..c2bca0b01ecf
--- /dev/null
+++ b/fs/verity/Kconfig
@@ -0,0 +1,38 @@
+# SPDX-License-Identifier: GPL-2.0
+
+config FS_VERITY
+	bool "FS Verity (read-only file-based authenticity protection)"
+	select CRYPTO
+	# SHA-256 is selected as it's intended to be the default hash algorithm.
+	# To avoid bloat, other wanted algorithms must be selected explicitly.
+	select CRYPTO_SHA256
+	help
+	  This option enables fs-verity.  fs-verity is the dm-verity
+	  mechanism implemented at the file level.  On supported
+	  filesystems (currently EXT4 and F2FS), userspace can use an
+	  ioctl to enable verity for a file, which causes the filesystem
+	  to build a Merkle tree for the file.  The filesystem will then
+	  transparently verify any data read from the file against the
+	  Merkle tree.  The file is also made read-only.
+
+	  This serves as an integrity check, but the availability of the
+	  Merkle tree root hash also allows efficiently supporting
+	  various use cases where normally the whole file would need to
+	  be hashed at once, such as: (a) auditing (logging the file's
+	  hash), or (b) authenticity verification (comparing the hash
+	  against a known good value, e.g. from a digital signature).
+
+	  fs-verity is especially useful on large files where not all
+	  the contents may actually be needed.  Also, fs-verity verifies
+	  data each time it is paged back in, which provides better
+	  protection against malicious disks vs. an ahead-of-time hash.
+
+	  If unsure, say N.
+
+config FS_VERITY_DEBUG
+	bool "FS Verity debugging"
+	depends on FS_VERITY
+	help
+	  Enable debugging messages related to fs-verity by default.
+
+	  Say N unless you are an fs-verity developer.
diff --git a/fs/verity/Makefile b/fs/verity/Makefile
new file mode 100644
index 000000000000..398f3f85fa18
--- /dev/null
+++ b/fs/verity/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_FS_VERITY) += hash_algs.o \
+			   init.o
diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
new file mode 100644
index 000000000000..9697aaebb5dc
--- /dev/null
+++ b/fs/verity/fsverity_private.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * fs-verity: read-only file-based authenticity protection
+ *
+ * Copyright 2019 Google LLC
+ */
+
+#ifndef _FSVERITY_PRIVATE_H
+#define _FSVERITY_PRIVATE_H
+
+#ifdef CONFIG_FS_VERITY_DEBUG
+#define DEBUG
+#endif
+
+#define pr_fmt(fmt) "fs-verity: " fmt
+
+#include <crypto/sha.h>
+#include <linux/fs.h>
+#include <uapi/linux/fsverity.h>
+
+struct ahash_request;
+
+/*
+ * Implementation limit: maximum depth of the Merkle tree.  For now 8 is plenty;
+ * it's enough for over U64_MAX bytes of data using SHA-256 and 4K blocks.
+ */
+#define FS_VERITY_MAX_LEVELS		8
+
+/*
+ * Largest digest size among all hash algorithms supported by fs-verity.
+ * Currently assumed to be <= size of fsverity_descriptor::root_hash.
+ */
+#define FS_VERITY_MAX_DIGEST_SIZE	SHA256_DIGEST_SIZE
+
+/* A hash algorithm supported by fs-verity */
+struct fsverity_hash_alg {
+	struct crypto_ahash *tfm; /* hash tfm, allocated on demand */
+	const char *name;	  /* crypto API name, e.g. sha256 */
+	unsigned int digest_size; /* digest size in bytes, e.g. 32 for SHA-256 */
+	unsigned int block_size;  /* block size in bytes, e.g. 64 for SHA-256 */
+};
+
+/* Merkle tree parameters: hash algorithm, initial hash state, and topology */
+struct merkle_tree_params {
+	const struct fsverity_hash_alg *hash_alg; /* the hash algorithm */
+	const u8 *hashstate;		/* initial hash state or NULL */
+	unsigned int digest_size;	/* same as hash_alg->digest_size */
+	unsigned int block_size;	/* size of data and tree blocks */
+	unsigned int hashes_per_block;	/* number of hashes per tree block */
+	unsigned int log_blocksize;	/* log2(block_size) */
+	unsigned int log_arity;		/* log2(hashes_per_block) */
+	unsigned int num_levels;	/* number of levels in Merkle tree */
+	u64 tree_size;			/* Merkle tree size in bytes */
+
+	/*
+	 * Starting block index for each tree level, ordered from leaf level (0)
+	 * to root level ('num_levels - 1')
+	 */
+	u64 level_start[FS_VERITY_MAX_LEVELS];
+};
+
+/* hash_algs.c */
+
+extern struct fsverity_hash_alg fsverity_hash_algs[];
+
+const struct fsverity_hash_alg *fsverity_get_hash_alg(const struct inode *inode,
+						      unsigned int num);
+const u8 *fsverity_prepare_hash_state(const struct fsverity_hash_alg *alg,
+				      const u8 *salt, size_t salt_size);
+int fsverity_hash_page(const struct merkle_tree_params *params,
+		       const struct inode *inode,
+		       struct ahash_request *req, struct page *page, u8 *out);
+int fsverity_hash_buffer(const struct fsverity_hash_alg *alg,
+			 const void *data, size_t size, u8 *out);
+void __init fsverity_check_hash_algs(void);
+
+/* init.c */
+
+extern void __printf(3, 4) __cold
+fsverity_msg(const struct inode *inode, const char *level,
+	     const char *fmt, ...);
+
+#define fsverity_warn(inode, fmt, ...)		\
+	fsverity_msg((inode), KERN_WARNING, fmt, ##__VA_ARGS__)
+#define fsverity_err(inode, fmt, ...)		\
+	fsverity_msg((inode), KERN_ERR, fmt, ##__VA_ARGS__)
+
+#endif /* _FSVERITY_PRIVATE_H */
diff --git a/fs/verity/hash_algs.c b/fs/verity/hash_algs.c
new file mode 100644
index 000000000000..c0457915ca10
--- /dev/null
+++ b/fs/verity/hash_algs.c
@@ -0,0 +1,274 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * fs/verity/hash_algs.c: fs-verity hash algorithms
+ *
+ * Copyright 2019 Google LLC
+ */
+
+#include "fsverity_private.h"
+
+#include <crypto/hash.h>
+#include <linux/scatterlist.h>
+
+/* The hash algorithms supported by fs-verity */
+struct fsverity_hash_alg fsverity_hash_algs[] = {
+	[FS_VERITY_HASH_ALG_SHA256] = {
+		.name = "sha256",
+		.digest_size = SHA256_DIGEST_SIZE,
+		.block_size = SHA256_BLOCK_SIZE,
+	},
+};
+
+/**
+ * fsverity_get_hash_alg() - validate and prepare a hash algorithm
+ * @inode: optional inode for logging purposes
+ * @num: the hash algorithm number
+ *
+ * Get the struct fsverity_hash_alg for the given hash algorithm number, and
+ * ensure it has a hash transform ready to go.  The hash transforms are
+ * allocated on-demand so that we don't waste resources unnecessarily, and
+ * because the crypto modules may be initialized later than fs/verity/.
+ *
+ * Return: pointer to the hash alg on success, else an ERR_PTR()
+ */
+const struct fsverity_hash_alg *fsverity_get_hash_alg(const struct inode *inode,
+						      unsigned int num)
+{
+	struct fsverity_hash_alg *alg;
+	struct crypto_ahash *tfm;
+	int err;
+
+	if (num >= ARRAY_SIZE(fsverity_hash_algs) ||
+	    !fsverity_hash_algs[num].name) {
+		fsverity_warn(inode, "Unknown hash algorithm number: %u", num);
+		return ERR_PTR(-EINVAL);
+	}
+	alg = &fsverity_hash_algs[num];
+
+	/* pairs with cmpxchg() below */
+	tfm = READ_ONCE(alg->tfm);
+	if (likely(tfm != NULL))
+		return alg;
+	/*
+	 * Using the shash API would make things a bit simpler, but the ahash
+	 * API is preferable as it allows the use of crypto accelerators.
+	 */
+	tfm = crypto_alloc_ahash(alg->name, 0, 0);
+	if (IS_ERR(tfm)) {
+		if (PTR_ERR(tfm) == -ENOENT)
+			fsverity_warn(inode,
+				      "Missing crypto API support for hash algorithm \"%s\"",
+				      alg->name);
+		else
+			fsverity_err(inode,
+				     "Error allocating hash algorithm \"%s\": %ld",
+				     alg->name, PTR_ERR(tfm));
+		return ERR_CAST(tfm);
+	}
+
+	err = -EINVAL;
+	if (WARN_ON(alg->digest_size != crypto_ahash_digestsize(tfm)))
+		goto err_free_tfm;
+	if (WARN_ON(alg->block_size != crypto_ahash_blocksize(tfm)))
+		goto err_free_tfm;
+
+	pr_info("%s using implementation \"%s\"\n",
+		alg->name, crypto_ahash_driver_name(tfm));
+
+	/* pairs with READ_ONCE() above */
+	if (cmpxchg(&alg->tfm, NULL, tfm) != NULL)
+		crypto_free_ahash(tfm);
+
+	return alg;
+
+err_free_tfm:
+	crypto_free_ahash(tfm);
+	return ERR_PTR(err);
+}
+
+/**
+ * fsverity_prepare_hash_state() - precompute the initial hash state
+ * @alg: hash algorithm
+ * @salt: a salt which is to be prepended to all data to be hashed
+ * @salt_size: salt size in bytes, possibly 0
+ *
+ * Return: NULL if the salt is empty, otherwise the kmalloc()'ed precomputed
+ *	   initial hash state on success or an ERR_PTR() on failure.
+ */
+const u8 *fsverity_prepare_hash_state(const struct fsverity_hash_alg *alg,
+				      const u8 *salt, size_t salt_size)
+{
+	u8 *hashstate = NULL;
+	struct ahash_request *req = NULL;
+	u8 *padded_salt = NULL;
+	size_t padded_salt_size;
+	struct scatterlist sg;
+	DECLARE_CRYPTO_WAIT(wait);
+	int err;
+
+	if (salt_size == 0)
+		return NULL;
+
+	hashstate = kmalloc(crypto_ahash_statesize(alg->tfm), GFP_KERNEL);
+	if (!hashstate)
+		return ERR_PTR(-ENOMEM);
+
+	req = ahash_request_alloc(alg->tfm, GFP_KERNEL);
+	if (!req) {
+		err = -ENOMEM;
+		goto err_free;
+	}
+
+	/*
+	 * Zero-pad the salt to the next multiple of the input size of the hash
+	 * algorithm's compression function, e.g. 64 bytes for SHA-256 or 128
+	 * bytes for SHA-512.  This ensures that the hash algorithm won't have
+	 * any bytes buffered internally after processing the salt, thus making
+	 * salted hashing just as fast as unsalted hashing.
+	 */
+	padded_salt_size = round_up(salt_size, alg->block_size);
+	padded_salt = kzalloc(padded_salt_size, GFP_KERNEL);
+	if (!padded_salt) {
+		err = -ENOMEM;
+		goto err_free;
+	}
+	memcpy(padded_salt, salt, salt_size);
+
+	sg_init_one(&sg, padded_salt, padded_salt_size);
+	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP |
+					CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, &wait);
+	ahash_request_set_crypt(req, &sg, NULL, padded_salt_size);
+
+	err = crypto_wait_req(crypto_ahash_init(req), &wait);
+	if (err)
+		goto err_free;
+
+	err = crypto_wait_req(crypto_ahash_update(req), &wait);
+	if (err)
+		goto err_free;
+
+	err = crypto_ahash_export(req, hashstate);
+	if (err)
+		goto err_free;
+out:
+	ahash_request_free(req);
+	kfree(padded_salt);
+	return hashstate;
+
+err_free:
+	kfree(hashstate);
+	hashstate = ERR_PTR(err);
+	goto out;
+}
+
+/**
+ * fsverity_hash_page() - hash a single data or hash page
+ * @params: the Merkle tree's parameters
+ * @inode: inode for which the hashing is being done
+ * @req: preallocated hash request
+ * @page: the page to hash
+ * @out: output digest, size 'params->digest_size' bytes
+ *
+ * Hash a single data or hash block, assuming block_size == PAGE_SIZE.
+ * The hash is salted if a salt is specified in the Merkle tree parameters.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int fsverity_hash_page(const struct merkle_tree_params *params,
+		       const struct inode *inode,
+		       struct ahash_request *req, struct page *page, u8 *out)
+{
+	struct scatterlist sg;
+	DECLARE_CRYPTO_WAIT(wait);
+	int err;
+
+	if (WARN_ON(params->block_size != PAGE_SIZE))
+		return -EINVAL;
+
+	sg_init_table(&sg, 1);
+	sg_set_page(&sg, page, PAGE_SIZE, 0);
+	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP |
+					CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, &wait);
+	ahash_request_set_crypt(req, &sg, out, PAGE_SIZE);
+
+	if (params->hashstate) {
+		err = crypto_ahash_import(req, params->hashstate);
+		if (err) {
+			fsverity_err(inode,
+				     "Error %d importing hash state", err);
+			return err;
+		}
+		err = crypto_ahash_finup(req);
+	} else {
+		err = crypto_ahash_digest(req);
+	}
+
+	err = crypto_wait_req(err, &wait);
+	if (err)
+		fsverity_err(inode, "Error %d computing page hash", err);
+	return err;
+}
+
+/**
+ * fsverity_hash_buffer() - hash some data
+ * @alg: the hash algorithm to use
+ * @data: the data to hash
+ * @size: size of data to hash, in bytes
+ * @out: output digest, size 'alg->digest_size' bytes
+ *
+ * Hash some data which is located in physically contiguous memory (i.e. memory
+ * allocated by kmalloc(), not by vmalloc()).  No salt is used.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int fsverity_hash_buffer(const struct fsverity_hash_alg *alg,
+			 const void *data, size_t size, u8 *out)
+{
+	struct ahash_request *req;
+	struct scatterlist sg;
+	DECLARE_CRYPTO_WAIT(wait);
+	int err;
+
+	req = ahash_request_alloc(alg->tfm, GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	sg_init_one(&sg, data, size);
+	ahash_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP |
+					CRYPTO_TFM_REQ_MAY_BACKLOG,
+				   crypto_req_done, &wait);
+	ahash_request_set_crypt(req, &sg, out, size);
+
+	err = crypto_wait_req(crypto_ahash_digest(req), &wait);
+
+	ahash_request_free(req);
+	return err;
+}
+
+void __init fsverity_check_hash_algs(void)
+{
+	size_t i;
+
+	/*
+	 * Sanity check the hash algorithms (could be a build-time check, but
+	 * they're in an array)
+	 */
+	for (i = 0; i < ARRAY_SIZE(fsverity_hash_algs); i++) {
+		const struct fsverity_hash_alg *alg = &fsverity_hash_algs[i];
+
+		if (!alg->name)
+			continue;
+
+		BUG_ON(alg->digest_size > FS_VERITY_MAX_DIGEST_SIZE);
+
+		/*
+		 * For efficiency, the implementation currently assumes the
+		 * digest and block sizes are powers of 2.  This limitation can
+		 * be lifted if the code is updated to handle other values.
+		 */
+		BUG_ON(!is_power_of_2(alg->digest_size));
+		BUG_ON(!is_power_of_2(alg->block_size));
+	}
+}
diff --git a/fs/verity/init.c b/fs/verity/init.c
new file mode 100644
index 000000000000..40076bbe452a
--- /dev/null
+++ b/fs/verity/init.c
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * fs/verity/init.c: fs-verity module initialization and logging
+ *
+ * Copyright 2019 Google LLC
+ */
+
+#include "fsverity_private.h"
+
+#include <linux/ratelimit.h>
+
+void fsverity_msg(const struct inode *inode, const char *level,
+		  const char *fmt, ...)
+{
+	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
+	struct va_format vaf;
+	va_list args;
+
+	if (!__ratelimit(&rs))
+		return;
+
+	va_start(args, fmt);
+	vaf.fmt = fmt;
+	vaf.va = &args;
+	if (inode)
+		printk("%sfs-verity (%s, inode %lu): %pV\n",
+		       level, inode->i_sb->s_id, inode->i_ino, &vaf);
+	else
+		printk("%sfs-verity: %pV\n", level, &vaf);
+	va_end(args);
+}
+
+static int __init fsverity_init(void)
+{
+	fsverity_check_hash_algs();
+
+	pr_debug("Initialized fs-verity\n");
+	return 0;
+}
+late_initcall(fsverity_init)
-- 
2.22.0

^ permalink raw reply related

* [PATCH v6 06/17] fs-verity: add inode and superblock fields
From: Eric Biggers @ 2019-07-01 15:32 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190701153237.1777-1-ebiggers@kernel.org>

From: Eric Biggers <ebiggers@google.com>

Analogous to fs/crypto/, add fields to the VFS inode and superblock for
use by the fs/verity/ support layer:

- ->s_vop: points to the fsverity_operations if the filesystem supports
  fs-verity, otherwise is NULL.

- ->i_verity_info: points to cached fs-verity information for the inode
  after someone opens it, otherwise is NULL.

- S_VERITY: bit in ->i_flags that identifies verity inodes, even when
  they haven't been opened yet and thus still have NULL ->i_verity_info.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 include/linux/fs.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f7fdfe93e25d..a80a192cdcf2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -64,6 +64,8 @@ struct workqueue_struct;
 struct iov_iter;
 struct fscrypt_info;
 struct fscrypt_operations;
+struct fsverity_info;
+struct fsverity_operations;
 struct fs_context;
 struct fs_parameter_description;
 
@@ -723,6 +725,10 @@ struct inode {
 	struct fscrypt_info	*i_crypt_info;
 #endif
 
+#ifdef CONFIG_FS_VERITY
+	struct fsverity_info	*i_verity_info;
+#endif
+
 	void			*i_private; /* fs or device private pointer */
 } __randomize_layout;
 
@@ -1429,6 +1435,9 @@ struct super_block {
 	const struct xattr_handler **s_xattr;
 #ifdef CONFIG_FS_ENCRYPTION
 	const struct fscrypt_operations	*s_cop;
+#endif
+#ifdef CONFIG_FS_VERITY
+	const struct fsverity_operations *s_vop;
 #endif
 	struct hlist_bl_head	s_roots;	/* alternate root dentries for NFS */
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
@@ -1964,6 +1973,7 @@ struct super_operations {
 #endif
 #define S_ENCRYPTED	16384	/* Encrypted file (using fs/crypto/) */
 #define S_CASEFOLD	32768	/* Casefolded file */
+#define S_VERITY	65536	/* Verity file (using fs/verity/) */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -2005,6 +2015,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
 #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
 #define IS_ENCRYPTED(inode)	((inode)->i_flags & S_ENCRYPTED)
 #define IS_CASEFOLDED(inode)	((inode)->i_flags & S_CASEFOLD)
+#define IS_VERITY(inode)	((inode)->i_flags & S_VERITY)
 
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
-- 
2.22.0

^ permalink raw reply related

* [PATCH v6 07/17] fs-verity: add the hook for file ->open()
From: Eric Biggers @ 2019-07-01 15:32 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190701153237.1777-1-ebiggers@kernel.org>

From: Eric Biggers <ebiggers@google.com>

Add the fsverity_file_open() function, which prepares an fs-verity file
to be read from.  If not already done, it loads the fs-verity descriptor
from the filesystem and sets up an fsverity_info structure for the inode
which describes the Merkle tree and contains the file measurement.  It
also denies all attempts to open verity files for writing.

This commit also begins the include/linux/fsverity.h header, which
declares the interface between fs/verity/ and filesystems.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/verity/Makefile           |   3 +-
 fs/verity/fsverity_private.h |  54 +++++-
 fs/verity/init.c             |   6 +
 fs/verity/open.c             | 318 +++++++++++++++++++++++++++++++++++
 include/linux/fsverity.h     |  71 ++++++++
 5 files changed, 449 insertions(+), 3 deletions(-)
 create mode 100644 fs/verity/open.c
 create mode 100644 include/linux/fsverity.h

diff --git a/fs/verity/Makefile b/fs/verity/Makefile
index 398f3f85fa18..e6a8951c493a 100644
--- a/fs/verity/Makefile
+++ b/fs/verity/Makefile
@@ -1,4 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
 
 obj-$(CONFIG_FS_VERITY) += hash_algs.o \
-			   init.o
+			   init.o \
+			   open.o
diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
index 9697aaebb5dc..c79746ff335e 100644
--- a/fs/verity/fsverity_private.h
+++ b/fs/verity/fsverity_private.h
@@ -15,8 +15,7 @@
 #define pr_fmt(fmt) "fs-verity: " fmt
 
 #include <crypto/sha.h>
-#include <linux/fs.h>
-#include <uapi/linux/fsverity.h>
+#include <linux/fsverity.h>
 
 struct ahash_request;
 
@@ -59,6 +58,40 @@ struct merkle_tree_params {
 	u64 level_start[FS_VERITY_MAX_LEVELS];
 };
 
+/**
+ * fsverity_info - cached verity metadata for an inode
+ *
+ * When a verity file is first opened, an instance of this struct is allocated
+ * and stored in ->i_verity_info; it remains until the inode is evicted.  It
+ * caches information about the Merkle tree that's needed to efficiently verify
+ * data read from the file.  It also caches the file measurement.  The Merkle
+ * tree pages themselves are not cached here, but the filesystem may cache them.
+ */
+struct fsverity_info {
+	struct merkle_tree_params tree_params;
+	u8 root_hash[FS_VERITY_MAX_DIGEST_SIZE];
+	u8 measurement[FS_VERITY_MAX_DIGEST_SIZE];
+	const struct inode *inode;
+};
+
+/*
+ * Merkle tree properties.  The file measurement is the hash of this structure.
+ */
+struct fsverity_descriptor {
+	__u8 version;		/* must be 1 */
+	__u8 hash_algorithm;	/* Merkle tree hash algorithm */
+	__u8 log_blocksize;	/* log2 of size of data and tree blocks */
+	__u8 salt_size;		/* size of salt in bytes; 0 if none */
+	__le32 sig_size;	/* reserved, must be 0 */
+	__le64 data_size;	/* size of file the Merkle tree is built over */
+	__u8 root_hash[64];	/* Merkle tree root hash */
+	__u8 salt[32];		/* salt prepended to each hashed block */
+	__u8 __reserved[144];	/* must be 0's */
+};
+
+/* Arbitrary limit to bound the kmalloc() size.  Can be changed. */
+#define FS_VERITY_MAX_DESCRIPTOR_SIZE	16384
+
 /* hash_algs.c */
 
 extern struct fsverity_hash_alg fsverity_hash_algs[];
@@ -85,4 +118,21 @@ fsverity_msg(const struct inode *inode, const char *level,
 #define fsverity_err(inode, fmt, ...)		\
 	fsverity_msg((inode), KERN_ERR, fmt, ##__VA_ARGS__)
 
+/* open.c */
+
+int fsverity_init_merkle_tree_params(struct merkle_tree_params *params,
+				     const struct inode *inode,
+				     unsigned int hash_algorithm,
+				     unsigned int log_blocksize,
+				     const u8 *salt, size_t salt_size);
+
+struct fsverity_info *fsverity_create_info(const struct inode *inode,
+					   const void *desc, size_t desc_size);
+
+void fsverity_set_info(struct inode *inode, struct fsverity_info *vi);
+
+void fsverity_free_info(struct fsverity_info *vi);
+
+int __init fsverity_init_info_cache(void);
+
 #endif /* _FSVERITY_PRIVATE_H */
diff --git a/fs/verity/init.c b/fs/verity/init.c
index 40076bbe452a..fff1fd634335 100644
--- a/fs/verity/init.c
+++ b/fs/verity/init.c
@@ -33,8 +33,14 @@ void fsverity_msg(const struct inode *inode, const char *level,
 
 static int __init fsverity_init(void)
 {
+	int err;
+
 	fsverity_check_hash_algs();
 
+	err = fsverity_init_info_cache();
+	if (err)
+		return err;
+
 	pr_debug("Initialized fs-verity\n");
 	return 0;
 }
diff --git a/fs/verity/open.c b/fs/verity/open.c
new file mode 100644
index 000000000000..8013f77f907e
--- /dev/null
+++ b/fs/verity/open.c
@@ -0,0 +1,318 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * fs/verity/open.c: opening fs-verity files
+ *
+ * Copyright 2019 Google LLC
+ */
+
+#include "fsverity_private.h"
+
+#include <linux/slab.h>
+
+static struct kmem_cache *fsverity_info_cachep;
+
+/**
+ * fsverity_init_merkle_tree_params() - initialize Merkle tree parameters
+ * @params: the parameters struct to initialize
+ * @inode: the inode for which the Merkle tree is being built
+ * @hash_algorithm: number of hash algorithm to use
+ * @log_blocksize: log base 2 of block size to use
+ * @salt: pointer to salt (optional)
+ * @salt_size: size of salt, possibly 0
+ *
+ * Validate the hash algorithm and block size, then compute the tree topology
+ * (num levels, num blocks in each level, etc.) and initialize @params.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int fsverity_init_merkle_tree_params(struct merkle_tree_params *params,
+				     const struct inode *inode,
+				     unsigned int hash_algorithm,
+				     unsigned int log_blocksize,
+				     const u8 *salt, size_t salt_size)
+{
+	const struct fsverity_hash_alg *hash_alg;
+	int err;
+	u64 blocks;
+	u64 offset;
+	int level;
+
+	memset(params, 0, sizeof(*params));
+
+	hash_alg = fsverity_get_hash_alg(inode, hash_algorithm);
+	if (IS_ERR(hash_alg))
+		return PTR_ERR(hash_alg);
+	params->hash_alg = hash_alg;
+	params->digest_size = hash_alg->digest_size;
+
+	params->hashstate = fsverity_prepare_hash_state(hash_alg, salt,
+							salt_size);
+	if (IS_ERR(params->hashstate)) {
+		err = PTR_ERR(params->hashstate);
+		params->hashstate = NULL;
+		fsverity_err(inode, "Error %d preparing hash state", err);
+		goto out_err;
+	}
+
+	if (log_blocksize != PAGE_SHIFT) {
+		fsverity_warn(inode, "Unsupported log_blocksize: %u",
+			      log_blocksize);
+		err = -EINVAL;
+		goto out_err;
+	}
+	params->log_blocksize = log_blocksize;
+	params->block_size = 1 << log_blocksize;
+
+	if (WARN_ON(!is_power_of_2(params->digest_size))) {
+		err = -EINVAL;
+		goto out_err;
+	}
+	if (params->block_size < 2 * params->digest_size) {
+		fsverity_warn(inode,
+			      "Merkle tree block size (%u) too small for hash algorithm \"%s\"",
+			      params->block_size, hash_alg->name);
+		err = -EINVAL;
+		goto out_err;
+	}
+	params->log_arity = params->log_blocksize - ilog2(params->digest_size);
+	params->hashes_per_block = 1 << params->log_arity;
+
+	pr_debug("Merkle tree uses %s with %u-byte blocks (%u hashes/block), salt=%*phN\n",
+		 hash_alg->name, params->block_size, params->hashes_per_block,
+		 (int)salt_size, salt);
+
+	/*
+	 * Compute the number of levels in the Merkle tree and create a map from
+	 * level to the starting block of that level.  Level 'num_levels - 1' is
+	 * the root and is stored first.  Level 0 is the level directly "above"
+	 * the data blocks and is stored last.
+	 */
+
+	/* Compute number of levels and the number of blocks in each level */
+	blocks = (inode->i_size + params->block_size - 1) >> log_blocksize;
+	pr_debug("Data is %lld bytes (%llu blocks)\n", inode->i_size, blocks);
+	while (blocks > 1) {
+		if (params->num_levels >= FS_VERITY_MAX_LEVELS) {
+			fsverity_err(inode, "Too many levels in Merkle tree");
+			err = -EINVAL;
+			goto out_err;
+		}
+		blocks = (blocks + params->hashes_per_block - 1) >>
+			 params->log_arity;
+		/* temporarily using level_start[] to store blocks in level */
+		params->level_start[params->num_levels++] = blocks;
+	}
+
+	/* Compute the starting block of each level */
+	offset = 0;
+	for (level = (int)params->num_levels - 1; level >= 0; level--) {
+		blocks = params->level_start[level];
+		params->level_start[level] = offset;
+		pr_debug("Level %d is %llu blocks starting at index %llu\n",
+			 level, blocks, offset);
+		offset += blocks;
+	}
+
+	params->tree_size = offset << log_blocksize;
+	return 0;
+
+out_err:
+	kfree(params->hashstate);
+	memset(params, 0, sizeof(*params));
+	return err;
+}
+
+/* Compute the file measurement by hashing the fsverity_descriptor. */
+static int compute_file_measurement(const struct fsverity_hash_alg *hash_alg,
+				    const struct fsverity_descriptor *desc,
+				    u8 *measurement)
+{
+	return fsverity_hash_buffer(hash_alg, desc, sizeof(*desc), measurement);
+}
+
+/*
+ * Validate the given fsverity_descriptor and create a new fsverity_info from
+ * it.
+ */
+struct fsverity_info *fsverity_create_info(const struct inode *inode,
+					   const void *_desc, size_t desc_size)
+{
+	const struct fsverity_descriptor *desc = _desc;
+	struct fsverity_info *vi;
+	int err;
+
+	if (desc_size < sizeof(*desc)) {
+		fsverity_err(inode, "Unrecognized descriptor size: %zu bytes",
+			     desc_size);
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (desc->version != 1) {
+		fsverity_err(inode, "Unrecognized descriptor version: %u",
+			     desc->version);
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (desc->sig_size ||
+	    memchr_inv(desc->__reserved, 0, sizeof(desc->__reserved))) {
+		fsverity_err(inode, "Reserved bits set in descriptor");
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (desc->salt_size > sizeof(desc->salt)) {
+		fsverity_err(inode, "Invalid salt_size: %u", desc->salt_size);
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (le64_to_cpu(desc->data_size) != inode->i_size) {
+		fsverity_err(inode,
+			     "Wrong data_size: %llu (desc) != %lld (inode)",
+			     le64_to_cpu(desc->data_size), inode->i_size);
+		return ERR_PTR(-EINVAL);
+	}
+
+	vi = kmem_cache_zalloc(fsverity_info_cachep, GFP_KERNEL);
+	if (!vi)
+		return ERR_PTR(-ENOMEM);
+	vi->inode = inode;
+
+	err = fsverity_init_merkle_tree_params(&vi->tree_params, inode,
+					       desc->hash_algorithm,
+					       desc->log_blocksize,
+					       desc->salt, desc->salt_size);
+	if (err) {
+		fsverity_err(inode,
+			     "Error %d initializing Merkle tree parameters",
+			     err);
+		goto out;
+	}
+
+	memcpy(vi->root_hash, desc->root_hash, vi->tree_params.digest_size);
+
+	err = compute_file_measurement(vi->tree_params.hash_alg, desc,
+				       vi->measurement);
+	if (err) {
+		fsverity_err(inode, "Error %d computing file measurement", err);
+		goto out;
+	}
+	pr_debug("Computed file measurement: %s:%*phN\n",
+		 vi->tree_params.hash_alg->name,
+		 vi->tree_params.digest_size, vi->measurement);
+out:
+	if (err) {
+		fsverity_free_info(vi);
+		vi = ERR_PTR(err);
+	}
+	return vi;
+}
+
+void fsverity_set_info(struct inode *inode, struct fsverity_info *vi)
+{
+	/*
+	 * Multiple processes may race to set ->i_verity_info, so use cmpxchg.
+	 * This pairs with the READ_ONCE() in fsverity_get_info().
+	 */
+	if (cmpxchg(&inode->i_verity_info, NULL, vi) != NULL)
+		fsverity_free_info(vi);
+}
+
+void fsverity_free_info(struct fsverity_info *vi)
+{
+	if (!vi)
+		return;
+	kfree(vi->tree_params.hashstate);
+	kmem_cache_free(fsverity_info_cachep, vi);
+}
+
+/* Ensure the inode has an ->i_verity_info */
+static int ensure_verity_info(struct inode *inode)
+{
+	struct fsverity_info *vi = fsverity_get_info(inode);
+	struct fsverity_descriptor *desc;
+	int res;
+
+	if (vi)
+		return 0;
+
+	res = inode->i_sb->s_vop->get_verity_descriptor(inode, NULL, 0);
+	if (res < 0) {
+		fsverity_err(inode,
+			     "Error %d getting verity descriptor size", res);
+		return res;
+	}
+	if (res > FS_VERITY_MAX_DESCRIPTOR_SIZE) {
+		fsverity_err(inode, "Verity descriptor is too large (%d bytes)",
+			     res);
+		return -EMSGSIZE;
+	}
+	desc = kmalloc(res, GFP_KERNEL);
+	if (!desc)
+		return -ENOMEM;
+	res = inode->i_sb->s_vop->get_verity_descriptor(inode, desc, res);
+	if (res < 0) {
+		fsverity_err(inode, "Error %d reading verity descriptor", res);
+		goto out_free_desc;
+	}
+
+	vi = fsverity_create_info(inode, desc, res);
+	if (IS_ERR(vi)) {
+		res = PTR_ERR(vi);
+		goto out_free_desc;
+	}
+
+	fsverity_set_info(inode, vi);
+	res = 0;
+out_free_desc:
+	kfree(desc);
+	return res;
+}
+
+/**
+ * fsverity_file_open() - prepare to open a verity file
+ * @inode: the inode being opened
+ * @filp: the struct file being set up
+ *
+ * When opening a verity file, deny the open if it is for writing.  Otherwise,
+ * set up the inode's ->i_verity_info if not already done.
+ *
+ * When combined with fscrypt, this must be called after fscrypt_file_open().
+ * Otherwise, we won't have the key set up to decrypt the verity metadata.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int fsverity_file_open(struct inode *inode, struct file *filp)
+{
+	if (!IS_VERITY(inode))
+		return 0;
+
+	if (filp->f_mode & FMODE_WRITE) {
+		pr_debug("Denying opening verity file (ino %lu) for write\n",
+			 inode->i_ino);
+		return -EPERM;
+	}
+
+	return ensure_verity_info(inode);
+}
+EXPORT_SYMBOL_GPL(fsverity_file_open);
+
+/**
+ * fsverity_cleanup_inode() - free the inode's verity info, if present
+ *
+ * Filesystems must call this on inode eviction to free ->i_verity_info.
+ */
+void fsverity_cleanup_inode(struct inode *inode)
+{
+	fsverity_free_info(inode->i_verity_info);
+	inode->i_verity_info = NULL;
+}
+EXPORT_SYMBOL_GPL(fsverity_cleanup_inode);
+
+int __init fsverity_init_info_cache(void)
+{
+	fsverity_info_cachep = KMEM_CACHE_USERCOPY(fsverity_info,
+						   SLAB_RECLAIM_ACCOUNT,
+						   measurement);
+	if (!fsverity_info_cachep)
+		return -ENOMEM;
+	return 0;
+}
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
new file mode 100644
index 000000000000..09b04dab6452
--- /dev/null
+++ b/include/linux/fsverity.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * fs-verity: read-only file-based authenticity protection
+ *
+ * This header declares the interface between the fs/verity/ support layer and
+ * filesystems that support fs-verity.
+ *
+ * Copyright 2019 Google LLC
+ */
+
+#ifndef _LINUX_FSVERITY_H
+#define _LINUX_FSVERITY_H
+
+#include <linux/fs.h>
+#include <uapi/linux/fsverity.h>
+
+/* Verity operations for filesystems */
+struct fsverity_operations {
+
+	/**
+	 * Get the verity descriptor of the given inode.
+	 *
+	 * @inode: an inode with the S_VERITY flag set
+	 * @buf: buffer in which to place the verity descriptor
+	 * @bufsize: size of @buf, or 0 to retrieve the size only
+	 *
+	 * If bufsize == 0, then the size of the verity descriptor is returned.
+	 * Otherwise the verity descriptor is written to 'buf' and its actual
+	 * size is returned; -ERANGE is returned if it's too large.  This may be
+	 * called by multiple processes concurrently on the same inode.
+	 *
+	 * Return: the size on success, -errno on failure
+	 */
+	int (*get_verity_descriptor)(struct inode *inode, void *buf,
+				     size_t bufsize);
+};
+
+#ifdef CONFIG_FS_VERITY
+
+static inline struct fsverity_info *fsverity_get_info(const struct inode *inode)
+{
+	/* pairs with the cmpxchg() in fsverity_set_info() */
+	return READ_ONCE(inode->i_verity_info);
+}
+
+/* open.c */
+
+extern int fsverity_file_open(struct inode *inode, struct file *filp);
+extern void fsverity_cleanup_inode(struct inode *inode);
+
+#else /* !CONFIG_FS_VERITY */
+
+static inline struct fsverity_info *fsverity_get_info(const struct inode *inode)
+{
+	return NULL;
+}
+
+/* open.c */
+
+static inline int fsverity_file_open(struct inode *inode, struct file *filp)
+{
+	return IS_VERITY(inode) ? -EOPNOTSUPP : 0;
+}
+
+static inline void fsverity_cleanup_inode(struct inode *inode)
+{
+}
+
+#endif	/* !CONFIG_FS_VERITY */
+
+#endif	/* _LINUX_FSVERITY_H */
-- 
2.22.0

^ permalink raw reply related

* [PATCH v6 08/17] fs-verity: add the hook for file ->setattr()
From: Eric Biggers @ 2019-07-01 15:32 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190701153237.1777-1-ebiggers@kernel.org>

From: Eric Biggers <ebiggers@google.com>

Add a function fsverity_prepare_setattr() which filesystems that support
fs-verity must call to deny truncates of verity files.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/verity/open.c         | 21 +++++++++++++++++++++
 include/linux/fsverity.h |  7 +++++++
 2 files changed, 28 insertions(+)

diff --git a/fs/verity/open.c b/fs/verity/open.c
index 8013f77f907e..2cb2fe8082bf 100644
--- a/fs/verity/open.c
+++ b/fs/verity/open.c
@@ -295,6 +295,27 @@ int fsverity_file_open(struct inode *inode, struct file *filp)
 }
 EXPORT_SYMBOL_GPL(fsverity_file_open);
 
+/**
+ * fsverity_prepare_setattr() - prepare to change a verity inode's attributes
+ * @dentry: dentry through which the inode is being changed
+ * @attr: attributes to change
+ *
+ * Verity files are immutable, so deny truncates.  This isn't covered by the
+ * open-time check because sys_truncate() takes a path, not a file descriptor.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int fsverity_prepare_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	if (IS_VERITY(d_inode(dentry)) && (attr->ia_valid & ATTR_SIZE)) {
+		pr_debug("Denying truncate of verity file (ino %lu)\n",
+			 d_inode(dentry)->i_ino);
+		return -EPERM;
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(fsverity_prepare_setattr);
+
 /**
  * fsverity_cleanup_inode() - free the inode's verity info, if present
  *
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index 09b04dab6452..cbd0f84e1620 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -46,6 +46,7 @@ static inline struct fsverity_info *fsverity_get_info(const struct inode *inode)
 /* open.c */
 
 extern int fsverity_file_open(struct inode *inode, struct file *filp);
+extern int fsverity_prepare_setattr(struct dentry *dentry, struct iattr *attr);
 extern void fsverity_cleanup_inode(struct inode *inode);
 
 #else /* !CONFIG_FS_VERITY */
@@ -62,6 +63,12 @@ static inline int fsverity_file_open(struct inode *inode, struct file *filp)
 	return IS_VERITY(inode) ? -EOPNOTSUPP : 0;
 }
 
+static inline int fsverity_prepare_setattr(struct dentry *dentry,
+					   struct iattr *attr)
+{
+	return IS_VERITY(d_inode(dentry)) ? -EOPNOTSUPP : 0;
+}
+
 static inline void fsverity_cleanup_inode(struct inode *inode)
 {
 }
-- 
2.22.0

^ permalink raw reply related

* [PATCH v6 09/17] fs-verity: add data verification hooks for ->readpages()
From: Eric Biggers @ 2019-07-01 15:32 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190701153237.1777-1-ebiggers@kernel.org>

From: Eric Biggers <ebiggers@google.com>

Add functions that verify data pages that have been read from a
fs-verity file, against that file's Merkle tree.  These will be called
from filesystems' ->readpage() and ->readpages() methods.

Since data verification can block, a workqueue is provided for these
methods to enqueue verification work from their bio completion callback.

See the "Verifying data" section of
Documentation/filesystems/fsverity.rst for more information.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/verity/Makefile           |   3 +-
 fs/verity/fsverity_private.h |   5 +
 fs/verity/init.c             |   8 +
 fs/verity/open.c             |   6 +
 fs/verity/verify.c           | 275 +++++++++++++++++++++++++++++++++++
 include/linux/fsverity.h     |  56 +++++++
 6 files changed, 352 insertions(+), 1 deletion(-)
 create mode 100644 fs/verity/verify.c

diff --git a/fs/verity/Makefile b/fs/verity/Makefile
index e6a8951c493a..7fa628cd5eba 100644
--- a/fs/verity/Makefile
+++ b/fs/verity/Makefile
@@ -2,4 +2,5 @@
 
 obj-$(CONFIG_FS_VERITY) += hash_algs.o \
 			   init.o \
-			   open.o
+			   open.o \
+			   verify.o
diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
index c79746ff335e..eaa2b3b93bbf 100644
--- a/fs/verity/fsverity_private.h
+++ b/fs/verity/fsverity_private.h
@@ -134,5 +134,10 @@ void fsverity_set_info(struct inode *inode, struct fsverity_info *vi);
 void fsverity_free_info(struct fsverity_info *vi);
 
 int __init fsverity_init_info_cache(void);
+void __init fsverity_exit_info_cache(void);
+
+/* verify.c */
+
+int __init fsverity_init_workqueue(void);
 
 #endif /* _FSVERITY_PRIVATE_H */
diff --git a/fs/verity/init.c b/fs/verity/init.c
index fff1fd634335..b593805aafcc 100644
--- a/fs/verity/init.c
+++ b/fs/verity/init.c
@@ -41,7 +41,15 @@ static int __init fsverity_init(void)
 	if (err)
 		return err;
 
+	err = fsverity_init_workqueue();
+	if (err)
+		goto err_exit_info_cache;
+
 	pr_debug("Initialized fs-verity\n");
 	return 0;
+
+err_exit_info_cache:
+	fsverity_exit_info_cache();
+	return err;
 }
 late_initcall(fsverity_init)
diff --git a/fs/verity/open.c b/fs/verity/open.c
index 2cb2fe8082bf..3636a1ed8e2c 100644
--- a/fs/verity/open.c
+++ b/fs/verity/open.c
@@ -337,3 +337,9 @@ int __init fsverity_init_info_cache(void)
 		return -ENOMEM;
 	return 0;
 }
+
+void __init fsverity_exit_info_cache(void)
+{
+	kmem_cache_destroy(fsverity_info_cachep);
+	fsverity_info_cachep = NULL;
+}
diff --git a/fs/verity/verify.c b/fs/verity/verify.c
new file mode 100644
index 000000000000..62ab8f6a8ea1
--- /dev/null
+++ b/fs/verity/verify.c
@@ -0,0 +1,275 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * fs/verity/verify.c: data verification functions, i.e. hooks for ->readpages()
+ *
+ * Copyright 2019 Google LLC
+ */
+
+#include "fsverity_private.h"
+
+#include <crypto/hash.h>
+#include <linux/bio.h>
+#include <linux/ratelimit.h>
+
+static struct workqueue_struct *fsverity_read_workqueue;
+
+/**
+ * hash_at_level() - compute the location of the block's hash at the given level
+ *
+ * @params:	(in) the Merkle tree parameters
+ * @dindex:	(in) the index of the data block being verified
+ * @level:	(in) the level of hash we want (0 is leaf level)
+ * @hindex:	(out) the index of the hash block containing the wanted hash
+ * @hoffset:	(out) the byte offset to the wanted hash within the hash block
+ */
+static void hash_at_level(const struct merkle_tree_params *params,
+			  pgoff_t dindex, unsigned int level, pgoff_t *hindex,
+			  unsigned int *hoffset)
+{
+	pgoff_t position;
+
+	/* Offset of the hash within the level's region, in hashes */
+	position = dindex >> (level * params->log_arity);
+
+	/* Index of the hash block in the tree overall */
+	*hindex = params->level_start[level] + (position >> params->log_arity);
+
+	/* Offset of the wanted hash (in bytes) within the hash block */
+	*hoffset = (position & ((1 << params->log_arity) - 1)) <<
+		   (params->log_blocksize - params->log_arity);
+}
+
+/* Extract a hash from a hash page */
+static void extract_hash(struct page *hpage, unsigned int hoffset,
+			 unsigned int hsize, u8 *out)
+{
+	void *virt = kmap_atomic(hpage);
+
+	memcpy(out, virt + hoffset, hsize);
+	kunmap_atomic(virt);
+}
+
+static inline int cmp_hashes(const struct fsverity_info *vi,
+			     const u8 *want_hash, const u8 *real_hash,
+			     pgoff_t index, int level)
+{
+	const unsigned int hsize = vi->tree_params.digest_size;
+
+	if (memcmp(want_hash, real_hash, hsize) == 0)
+		return 0;
+
+	fsverity_err(vi->inode,
+		     "FILE CORRUPTED! index=%lu, level=%d, want_hash=%s:%*phN, real_hash=%s:%*phN",
+		     index, level,
+		     vi->tree_params.hash_alg->name, hsize, want_hash,
+		     vi->tree_params.hash_alg->name, hsize, real_hash);
+	return -EBADMSG;
+}
+
+/*
+ * Verify a single data page against the file's Merkle tree.
+ *
+ * In principle, we need to verify the entire path to the root node.  However,
+ * for efficiency the filesystem may cache the hash pages.  Therefore we need
+ * only ascend the tree until an already-verified page is seen, as indicated by
+ * the PageChecked bit being set; then verify the path to that page.
+ *
+ * This code currently only supports the case where the verity block size is
+ * equal to PAGE_SIZE.  Doing otherwise would be possible but tricky, since we
+ * wouldn't be able to use the PageChecked bit.
+ *
+ * Note that multiple processes may race to verify a hash page and mark it
+ * Checked, but it doesn't matter; the result will be the same either way.
+ *
+ * Return: true if the page is valid, else false.
+ */
+static bool verify_page(struct inode *inode, const struct fsverity_info *vi,
+			struct ahash_request *req, struct page *data_page)
+{
+	const struct merkle_tree_params *params = &vi->tree_params;
+	const unsigned int hsize = params->digest_size;
+	const pgoff_t index = data_page->index;
+	int level;
+	u8 _want_hash[FS_VERITY_MAX_DIGEST_SIZE];
+	const u8 *want_hash;
+	u8 real_hash[FS_VERITY_MAX_DIGEST_SIZE];
+	struct page *hpages[FS_VERITY_MAX_LEVELS];
+	unsigned int hoffsets[FS_VERITY_MAX_LEVELS];
+	int err;
+
+	if (WARN_ON_ONCE(!PageLocked(data_page) || PageUptodate(data_page)))
+		return false;
+
+	pr_debug_ratelimited("Verifying data page %lu...\n", index);
+
+	/*
+	 * Starting at the leaf level, ascend the tree saving hash pages along
+	 * the way until we find a verified hash page, indicated by PageChecked;
+	 * or until we reach the root.
+	 */
+	for (level = 0; level < params->num_levels; level++) {
+		pgoff_t hindex;
+		unsigned int hoffset;
+		struct page *hpage;
+
+		hash_at_level(params, index, level, &hindex, &hoffset);
+
+		pr_debug_ratelimited("Level %d: hindex=%lu, hoffset=%u\n",
+				     level, hindex, hoffset);
+
+		hpage = inode->i_sb->s_vop->read_merkle_tree_page(inode,
+								  hindex);
+		if (IS_ERR(hpage)) {
+			err = PTR_ERR(hpage);
+			fsverity_err(inode,
+				     "Error %d reading Merkle tree page %lu",
+				     err, hindex);
+			goto out;
+		}
+
+		if (PageChecked(hpage)) {
+			extract_hash(hpage, hoffset, hsize, _want_hash);
+			want_hash = _want_hash;
+			put_page(hpage);
+			pr_debug_ratelimited("Hash page already checked, want %s:%*phN\n",
+					     params->hash_alg->name,
+					     hsize, want_hash);
+			goto descend;
+		}
+		pr_debug_ratelimited("Hash page not yet checked\n");
+		hpages[level] = hpage;
+		hoffsets[level] = hoffset;
+	}
+
+	want_hash = vi->root_hash;
+	pr_debug("Want root hash: %s:%*phN\n",
+		 params->hash_alg->name, hsize, want_hash);
+descend:
+	/* Descend the tree verifying hash pages */
+	for (; level > 0; level--) {
+		struct page *hpage = hpages[level - 1];
+		unsigned int hoffset = hoffsets[level - 1];
+
+		err = fsverity_hash_page(params, inode, req, hpage, real_hash);
+		if (err)
+			goto out;
+		err = cmp_hashes(vi, want_hash, real_hash, index, level - 1);
+		if (err)
+			goto out;
+		SetPageChecked(hpage);
+		extract_hash(hpage, hoffset, hsize, _want_hash);
+		want_hash = _want_hash;
+		put_page(hpage);
+		pr_debug("Verified hash page at level %d, now want %s:%*phN\n",
+			 level - 1, params->hash_alg->name, hsize, want_hash);
+	}
+
+	/* Finally, verify the data page */
+	err = fsverity_hash_page(params, inode, req, data_page, real_hash);
+	if (err)
+		goto out;
+	err = cmp_hashes(vi, want_hash, real_hash, index, -1);
+out:
+	for (; level > 0; level--)
+		put_page(hpages[level - 1]);
+
+	return err == 0;
+}
+
+/**
+ * fsverity_verify_page() - verify a data page
+ *
+ * Verify a page that has just been read from a verity file.  The page must be a
+ * pagecache page that is still locked and not yet uptodate.
+ *
+ * Return: true if the page is valid, else false.
+ */
+bool fsverity_verify_page(struct page *page)
+{
+	struct inode *inode = page->mapping->host;
+	const struct fsverity_info *vi = inode->i_verity_info;
+	struct ahash_request *req;
+	bool valid;
+
+	req = ahash_request_alloc(vi->tree_params.hash_alg->tfm, GFP_NOFS);
+	if (unlikely(!req))
+		return false;
+
+	valid = verify_page(inode, vi, req, page);
+
+	ahash_request_free(req);
+
+	return valid;
+}
+EXPORT_SYMBOL_GPL(fsverity_verify_page);
+
+#ifdef CONFIG_BLOCK
+/**
+ * fsverity_verify_bio() - verify a 'read' bio that has just completed
+ *
+ * Verify a set of pages that have just been read from a verity file.  The pages
+ * must be pagecache pages that are still locked and not yet uptodate.  Pages
+ * that fail verification are set to the Error state.  Verification is skipped
+ * for pages already in the Error state, e.g. due to fscrypt decryption failure.
+ *
+ * This is a helper function for use by the ->readpages() method of filesystems
+ * that issue bios to read data directly into the page cache.  Filesystems that
+ * populate the page cache without issuing bios (e.g. non block-based
+ * filesystems) must instead call fsverity_verify_page() directly on each page.
+ * All filesystems must also call fsverity_verify_page() on holes.
+ */
+void fsverity_verify_bio(struct bio *bio)
+{
+	struct inode *inode = bio_first_page_all(bio)->mapping->host;
+	const struct fsverity_info *vi = inode->i_verity_info;
+	struct ahash_request *req;
+	struct bio_vec *bv;
+	struct bvec_iter_all iter_all;
+
+	req = ahash_request_alloc(vi->tree_params.hash_alg->tfm, GFP_NOFS);
+	if (unlikely(!req)) {
+		bio_for_each_segment_all(bv, bio, iter_all)
+			SetPageError(bv->bv_page);
+		return;
+	}
+
+	bio_for_each_segment_all(bv, bio, iter_all) {
+		struct page *page = bv->bv_page;
+
+		if (!PageError(page) && !verify_page(inode, vi, req, page))
+			SetPageError(page);
+	}
+
+	ahash_request_free(req);
+}
+EXPORT_SYMBOL_GPL(fsverity_verify_bio);
+#endif /* CONFIG_BLOCK */
+
+/**
+ * fsverity_enqueue_verify_work() - enqueue work on the fs-verity workqueue
+ *
+ * Enqueue verification work for asynchronous processing.
+ */
+void fsverity_enqueue_verify_work(struct work_struct *work)
+{
+	queue_work(fsverity_read_workqueue, work);
+}
+EXPORT_SYMBOL_GPL(fsverity_enqueue_verify_work);
+
+int __init fsverity_init_workqueue(void)
+{
+	/*
+	 * Use an unbound workqueue to allow bios to be verified in parallel
+	 * even when they happen to complete on the same CPU.  This sacrifices
+	 * locality, but it's worthwhile since hashing is CPU-intensive.
+	 *
+	 * Also use a high-priority workqueue to prioritize verification work,
+	 * which blocks reads from completing, over regular application tasks.
+	 */
+	fsverity_read_workqueue = alloc_workqueue("fsverity_read_queue",
+						  WQ_UNBOUND | WQ_HIGHPRI,
+						  num_online_cpus());
+	if (!fsverity_read_workqueue)
+		return -ENOMEM;
+	return 0;
+}
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index cbd0f84e1620..95c257cd7ff0 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -33,6 +33,23 @@ struct fsverity_operations {
 	 */
 	int (*get_verity_descriptor)(struct inode *inode, void *buf,
 				     size_t bufsize);
+
+	/**
+	 * Read a Merkle tree page of the given inode.
+	 *
+	 * @inode: the inode
+	 * @index: 0-based index of the page within the Merkle tree
+	 *
+	 * This can be called at any time on an open verity file, as well as
+	 * between ->begin_enable_verity() and ->end_enable_verity().  It may be
+	 * called by multiple processes concurrently, even with the same page.
+	 *
+	 * Note that this must retrieve a *page*, not necessarily a *block*.
+	 *
+	 * Return: the page on success, ERR_PTR() on failure
+	 */
+	struct page *(*read_merkle_tree_page)(struct inode *inode,
+					      pgoff_t index);
 };
 
 #ifdef CONFIG_FS_VERITY
@@ -49,6 +66,12 @@ extern int fsverity_file_open(struct inode *inode, struct file *filp);
 extern int fsverity_prepare_setattr(struct dentry *dentry, struct iattr *attr);
 extern void fsverity_cleanup_inode(struct inode *inode);
 
+/* verify.c */
+
+extern bool fsverity_verify_page(struct page *page);
+extern void fsverity_verify_bio(struct bio *bio);
+extern void fsverity_enqueue_verify_work(struct work_struct *work);
+
 #else /* !CONFIG_FS_VERITY */
 
 static inline struct fsverity_info *fsverity_get_info(const struct inode *inode)
@@ -73,6 +96,39 @@ static inline void fsverity_cleanup_inode(struct inode *inode)
 {
 }
 
+/* verify.c */
+
+static inline bool fsverity_verify_page(struct page *page)
+{
+	WARN_ON(1);
+	return false;
+}
+
+static inline void fsverity_verify_bio(struct bio *bio)
+{
+	WARN_ON(1);
+}
+
+static inline void fsverity_enqueue_verify_work(struct work_struct *work)
+{
+	WARN_ON(1);
+}
+
 #endif	/* !CONFIG_FS_VERITY */
 
+/**
+ * fsverity_active() - do reads from the inode need to go through fs-verity?
+ *
+ * This checks whether ->i_verity_info has been set.
+ *
+ * Filesystems call this from ->readpages() to check whether the pages need to
+ * be verified or not.  Don't use IS_VERITY() for this purpose; it's subject to
+ * a race condition where the file is being read concurrently with
+ * FS_IOC_ENABLE_VERITY completing.  (S_VERITY is set before ->i_verity_info.)
+ */
+static inline bool fsverity_active(const struct inode *inode)
+{
+	return fsverity_get_info(inode) != NULL;
+}
+
 #endif	/* _LINUX_FSVERITY_H */
-- 
2.22.0

^ permalink raw reply related

* [PATCH v6 10/17] fs-verity: implement FS_IOC_ENABLE_VERITY ioctl
From: Eric Biggers @ 2019-07-01 15:32 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190701153237.1777-1-ebiggers@kernel.org>

From: Eric Biggers <ebiggers@google.com>

Add a function for filesystems to call to implement the
FS_IOC_ENABLE_VERITY ioctl.  This ioctl enables fs-verity on a file.

See the "FS_IOC_ENABLE_VERITY" section of
Documentation/filesystems/fsverity.rst for the documentation.

Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/verity/Makefile       |   3 +-
 fs/verity/enable.c       | 341 +++++++++++++++++++++++++++++++++++++++
 include/linux/fsverity.h |  64 ++++++++
 3 files changed, 407 insertions(+), 1 deletion(-)
 create mode 100644 fs/verity/enable.c

diff --git a/fs/verity/Makefile b/fs/verity/Makefile
index 7fa628cd5eba..04b37475fd28 100644
--- a/fs/verity/Makefile
+++ b/fs/verity/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-$(CONFIG_FS_VERITY) += hash_algs.o \
+obj-$(CONFIG_FS_VERITY) += enable.o \
+			   hash_algs.o \
 			   init.o \
 			   open.o \
 			   verify.o
diff --git a/fs/verity/enable.c b/fs/verity/enable.c
new file mode 100644
index 000000000000..782b2911463e
--- /dev/null
+++ b/fs/verity/enable.c
@@ -0,0 +1,341 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * fs/verity/enable.c: ioctl to enable verity on a file
+ *
+ * Copyright 2019 Google LLC
+ */
+
+#include "fsverity_private.h"
+
+#include <crypto/hash.h>
+#include <linux/mount.h>
+#include <linux/pagemap.h>
+#include <linux/sched/signal.h>
+#include <linux/uaccess.h>
+
+static int build_merkle_tree_level(struct inode *inode, unsigned int level,
+				   u64 num_blocks_to_hash,
+				   const struct merkle_tree_params *params,
+				   u8 *pending_hashes,
+				   struct ahash_request *req)
+{
+	const struct fsverity_operations *vops = inode->i_sb->s_vop;
+	unsigned int pending_size = 0;
+	u64 dst_block_num;
+	u64 i;
+	int err;
+
+	if (WARN_ON(params->block_size != PAGE_SIZE)) /* checked earlier too */
+		return -EINVAL;
+
+	if (level < params->num_levels) {
+		dst_block_num = params->level_start[level];
+	} else {
+		if (WARN_ON(num_blocks_to_hash != 1))
+			return -EINVAL;
+		dst_block_num = 0; /* unused */
+	}
+
+	for (i = 0; i < num_blocks_to_hash; i++) {
+		struct page *src_page;
+
+		if ((pgoff_t)i % 10000 == 0 || i + 1 == num_blocks_to_hash)
+			pr_debug("Hashing block %llu of %llu for level %u\n",
+				 i + 1, num_blocks_to_hash, level);
+
+		if (level == 0)
+			/* Leaf: hashing a data block */
+			src_page = read_mapping_page(inode->i_mapping, i, NULL);
+		else
+			/* Non-leaf: hashing hash block from level below */
+			src_page = vops->read_merkle_tree_page(inode,
+					params->level_start[level - 1] + i);
+		if (IS_ERR(src_page)) {
+			err = PTR_ERR(src_page);
+			fsverity_err(inode,
+				     "Error %d reading Merkle tree page %llu",
+				     err, params->level_start[level - 1] + i);
+			return err;
+		}
+
+		err = fsverity_hash_page(params, inode, req, src_page,
+					 &pending_hashes[pending_size]);
+		put_page(src_page);
+		if (err)
+			return err;
+		pending_size += params->digest_size;
+
+		if (level == params->num_levels) /* Root hash? */
+			return 0;
+
+		if (pending_size + params->digest_size > params->block_size ||
+		    i + 1 == num_blocks_to_hash) {
+			/* Flush the pending hash block */
+			memset(&pending_hashes[pending_size], 0,
+			       params->block_size - pending_size);
+			err = vops->write_merkle_tree_block(inode,
+					pending_hashes,
+					dst_block_num,
+					params->log_blocksize);
+			if (err) {
+				fsverity_err(inode,
+					     "Error %d writing Merkle tree block %llu",
+					     err, dst_block_num);
+				return err;
+			}
+			dst_block_num++;
+			pending_size = 0;
+		}
+
+		if (fatal_signal_pending(current))
+			return -EINTR;
+		cond_resched();
+	}
+	return 0;
+}
+
+/*
+ * Build the Merkle tree for the given inode using the given parameters, and
+ * return the root hash in @root_hash.
+ *
+ * The tree is written to a filesystem-specific location as determined by the
+ * ->write_merkle_tree_block() method.  However, the blocks that comprise the
+ * tree are the same for all filesystems.
+ */
+static int build_merkle_tree(struct inode *inode,
+			     const struct merkle_tree_params *params,
+			     u8 *root_hash)
+{
+	u8 *pending_hashes;
+	struct ahash_request *req;
+	u64 blocks;
+	unsigned int level;
+	int err = -ENOMEM;
+
+	if (inode->i_size == 0) {
+		/* Empty file is a special case; root hash is all 0's */
+		memset(root_hash, 0, params->digest_size);
+		return 0;
+	}
+
+	pending_hashes = kmalloc(params->block_size, GFP_KERNEL);
+	req = ahash_request_alloc(params->hash_alg->tfm, GFP_KERNEL);
+	if (!pending_hashes || !req)
+		goto out;
+
+	/*
+	 * Build each level of the Merkle tree, starting at the leaf level
+	 * (level 0) and ascending to the root node (level 'num_levels - 1').
+	 * Then at the end (level 'num_levels'), calculate the root hash.
+	 */
+	blocks = (inode->i_size + params->block_size - 1) >>
+		 params->log_blocksize;
+	for (level = 0; level <= params->num_levels; level++) {
+		err = build_merkle_tree_level(inode, level, blocks, params,
+					      pending_hashes, req);
+		if (err)
+			goto out;
+		blocks = (blocks + params->hashes_per_block - 1) >>
+			 params->log_arity;
+	}
+	memcpy(root_hash, pending_hashes, params->digest_size);
+	err = 0;
+out:
+	kfree(pending_hashes);
+	ahash_request_free(req);
+	return err;
+}
+
+static int enable_verity(struct file *filp,
+			 const struct fsverity_enable_arg *arg)
+{
+	struct inode *inode = file_inode(filp);
+	const struct fsverity_operations *vops = inode->i_sb->s_vop;
+	struct merkle_tree_params params = { };
+	struct fsverity_descriptor *desc;
+	size_t desc_size = sizeof(*desc);
+	struct fsverity_info *vi;
+	int err;
+
+	/* Start initializing the fsverity_descriptor */
+	desc = kzalloc(desc_size, GFP_KERNEL);
+	if (!desc)
+		return -ENOMEM;
+	desc->version = 1;
+	desc->hash_algorithm = arg->hash_algorithm;
+	desc->log_blocksize = ilog2(arg->block_size);
+
+	/* Get the salt if the user provided one */
+	if (arg->salt_size &&
+	    copy_from_user(desc->salt,
+			   (const u8 __user *)(uintptr_t)arg->salt_ptr,
+			   arg->salt_size)) {
+		err = -EFAULT;
+		goto out;
+	}
+	desc->salt_size = arg->salt_size;
+
+	desc->data_size = cpu_to_le64(inode->i_size);
+
+	pr_debug("Building Merkle tree...\n");
+
+	/* Prepare the Merkle tree parameters */
+	err = fsverity_init_merkle_tree_params(&params, inode,
+					       arg->hash_algorithm,
+					       desc->log_blocksize,
+					       desc->salt, desc->salt_size);
+	if (err)
+		goto out;
+
+	/* Tell the filesystem that verity is being enabled on the file */
+	err = vops->begin_enable_verity(filp);
+	if (err)
+		goto out;
+
+	/* Build the Merkle tree */
+	BUILD_BUG_ON(sizeof(desc->root_hash) < FS_VERITY_MAX_DIGEST_SIZE);
+	err = build_merkle_tree(inode, &params, desc->root_hash);
+	if (err) {
+		fsverity_err(inode, "Error %d building Merkle tree", err);
+		goto rollback;
+	}
+	pr_debug("Done building Merkle tree.  Root hash is %s:%*phN\n",
+		 params.hash_alg->name, params.digest_size, desc->root_hash);
+
+	/*
+	 * Create the fsverity_info.  Don't bother trying to save work by
+	 * reusing the merkle_tree_params from above.  Instead, just create the
+	 * fsverity_info from the fsverity_descriptor as if it were just loaded
+	 * from disk.  This is simpler, and it serves as an extra check that the
+	 * metadata we're writing is valid before actually enabling verity.
+	 */
+	vi = fsverity_create_info(inode, desc, desc_size);
+	if (IS_ERR(vi)) {
+		err = PTR_ERR(vi);
+		goto rollback;
+	}
+
+	/* Tell the filesystem to finish enabling verity on the file */
+	err = vops->end_enable_verity(filp, desc, desc_size, params.tree_size);
+	if (err) {
+		fsverity_err(inode, "%ps() failed with err %d",
+			     vops->end_enable_verity, err);
+		fsverity_free_info(vi);
+	} else if (WARN_ON(!IS_VERITY(inode))) {
+		err = -EINVAL;
+		fsverity_free_info(vi);
+	} else {
+		/* Successfully enabled verity */
+
+		/*
+		 * Readers can start using ->i_verity_info immediately, so it
+		 * can't be rolled back once set.  So don't set it until just
+		 * after the filesystem has successfully enabled verity.
+		 */
+		fsverity_set_info(inode, vi);
+	}
+out:
+	kfree(params.hashstate);
+	kfree(desc);
+	return err;
+
+rollback:
+	(void)vops->end_enable_verity(filp, NULL, 0, params.tree_size);
+	goto out;
+}
+
+/**
+ * fsverity_ioctl_enable() - enable verity on a file
+ *
+ * Enable fs-verity on a file.  See the "FS_IOC_ENABLE_VERITY" section of
+ * Documentation/filesystems/fsverity.rst for the documentation.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int fsverity_ioctl_enable(struct file *filp, const void __user *uarg)
+{
+	struct inode *inode = file_inode(filp);
+	struct fsverity_enable_arg arg;
+	int err;
+
+	if (copy_from_user(&arg, uarg, sizeof(arg)))
+		return -EFAULT;
+
+	if (arg.version != 1)
+		return -EINVAL;
+
+	if (arg.__reserved1 ||
+	    memchr_inv(arg.__reserved2, 0, sizeof(arg.__reserved2)))
+		return -EINVAL;
+
+	if (arg.block_size != PAGE_SIZE)
+		return -EINVAL;
+
+	if (arg.salt_size > FIELD_SIZEOF(struct fsverity_descriptor, salt))
+		return -EMSGSIZE;
+
+	if (arg.sig_size)
+		return -EINVAL;
+
+	/*
+	 * Require a regular file with write access.  But the actual fd must
+	 * still be readonly so that we can lock out all writers.  This is
+	 * needed to guarantee that no writable fds exist to the file once it
+	 * has verity enabled, and to stabilize the data being hashed.
+	 */
+
+	err = inode_permission(inode, MAY_WRITE);
+	if (err)
+		return err;
+
+	if (IS_APPEND(inode))
+		return -EPERM;
+
+	if (S_ISDIR(inode->i_mode))
+		return -EISDIR;
+
+	if (!S_ISREG(inode->i_mode))
+		return -EINVAL;
+
+	err = mnt_want_write_file(filp);
+	if (err) /* -EROFS */
+		return err;
+
+	err = deny_write_access(filp);
+	if (err) /* -ETXTBSY */
+		goto out_drop_write;
+
+	inode_lock(inode);
+
+	if (IS_VERITY(inode)) {
+		err = -EEXIST;
+		goto out_unlock;
+	}
+
+	err = enable_verity(filp, &arg);
+	if (err)
+		goto out_unlock;
+
+	/*
+	 * Some pages of the file may have been evicted from pagecache after
+	 * being used in the Merkle tree construction, then read into pagecache
+	 * again by another process reading from the file concurrently.  Since
+	 * these pages didn't undergo verification against the file measurement
+	 * which fs-verity now claims to be enforcing, we have to wipe the
+	 * pagecache to ensure that all future reads are verified.
+	 */
+	filemap_write_and_wait(inode->i_mapping);
+	invalidate_inode_pages2(inode->i_mapping);
+
+	/*
+	 * allow_write_access() is needed to pair with deny_write_access().
+	 * Regardless, the filesystem won't allow writing to verity files.
+	 */
+out_unlock:
+	inode_unlock(inode);
+	allow_write_access(filp);
+out_drop_write:
+	mnt_drop_write_file(filp);
+	return err;
+}
+EXPORT_SYMBOL_GPL(fsverity_ioctl_enable);
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index 95c257cd7ff0..b0b1854a9450 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -17,6 +17,42 @@
 /* Verity operations for filesystems */
 struct fsverity_operations {
 
+	/**
+	 * Begin enabling verity on the given file.
+	 *
+	 * @filp: a readonly file descriptor for the file
+	 *
+	 * The filesystem must do any needed filesystem-specific preparations
+	 * for enabling verity, e.g. evicting inline data.
+	 *
+	 * i_rwsem is held for write.
+	 *
+	 * Return: 0 on success, -errno on failure
+	 */
+	int (*begin_enable_verity)(struct file *filp);
+
+	/**
+	 * End enabling verity on the given file.
+	 *
+	 * @filp: a readonly file descriptor for the file
+	 * @desc: the verity descriptor to write, or NULL on failure
+	 * @desc_size: size of verity descriptor, or 0 on failure
+	 * @merkle_tree_size: total bytes the Merkle tree took up
+	 *
+	 * If desc == NULL, then enabling verity failed and the filesystem only
+	 * must do any necessary cleanups.  Else, it must also store the given
+	 * verity descriptor to a fs-specific location associated with the inode
+	 * and do any fs-specific actions needed to mark the inode as a verity
+	 * inode, e.g. setting a bit in the on-disk inode.  The filesystem is
+	 * also responsible for setting the S_VERITY flag in the VFS inode.
+	 *
+	 * i_rwsem is held for write.
+	 *
+	 * Return: 0 on success, -errno on failure
+	 */
+	int (*end_enable_verity)(struct file *filp, const void *desc,
+				 size_t desc_size, u64 merkle_tree_size);
+
 	/**
 	 * Get the verity descriptor of the given inode.
 	 *
@@ -50,6 +86,22 @@ struct fsverity_operations {
 	 */
 	struct page *(*read_merkle_tree_page)(struct inode *inode,
 					      pgoff_t index);
+
+	/**
+	 * Write a Merkle tree block to the given inode.
+	 *
+	 * @inode: the inode for which the Merkle tree is being built
+	 * @buf: block to write
+	 * @index: 0-based index of the block within the Merkle tree
+	 * @log_blocksize: log base 2 of the Merkle tree block size
+	 *
+	 * This is only called between ->begin_enable_verity() and
+	 * ->end_enable_verity().  i_rwsem is held for write.
+	 *
+	 * Return: 0 on success, -errno on failure
+	 */
+	int (*write_merkle_tree_block)(struct inode *inode, const void *buf,
+				       u64 index, int log_blocksize);
 };
 
 #ifdef CONFIG_FS_VERITY
@@ -60,6 +112,10 @@ static inline struct fsverity_info *fsverity_get_info(const struct inode *inode)
 	return READ_ONCE(inode->i_verity_info);
 }
 
+/* enable.c */
+
+extern int fsverity_ioctl_enable(struct file *filp, const void __user *arg);
+
 /* open.c */
 
 extern int fsverity_file_open(struct inode *inode, struct file *filp);
@@ -79,6 +135,14 @@ static inline struct fsverity_info *fsverity_get_info(const struct inode *inode)
 	return NULL;
 }
 
+/* enable.c */
+
+static inline int fsverity_ioctl_enable(struct file *filp,
+					const void __user *arg)
+{
+	return -EOPNOTSUPP;
+}
+
 /* open.c */
 
 static inline int fsverity_file_open(struct inode *inode, struct file *filp)
-- 
2.22.0

^ permalink raw reply related

* [PATCH v6 11/17] fs-verity: implement FS_IOC_MEASURE_VERITY ioctl
From: Eric Biggers @ 2019-07-01 15:32 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190701153237.1777-1-ebiggers@kernel.org>

From: Eric Biggers <ebiggers@google.com>

Add a function for filesystems to call to implement the
FS_IOC_MEASURE_VERITY ioctl.  This ioctl retrieves the file measurement
that fs-verity calculated for the given file and is enforcing for reads;
i.e., reads that don't match this hash will fail.  This ioctl can be
used for authentication or logging of file measurements in userspace.

See the "FS_IOC_MEASURE_VERITY" section of
Documentation/filesystems/fsverity.rst for the documentation.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/verity/Makefile       |  1 +
 fs/verity/measure.c      | 57 ++++++++++++++++++++++++++++++++++++++++
 include/linux/fsverity.h | 11 ++++++++
 3 files changed, 69 insertions(+)
 create mode 100644 fs/verity/measure.c

diff --git a/fs/verity/Makefile b/fs/verity/Makefile
index 04b37475fd28..6f7675ae0a31 100644
--- a/fs/verity/Makefile
+++ b/fs/verity/Makefile
@@ -3,5 +3,6 @@
 obj-$(CONFIG_FS_VERITY) += enable.o \
 			   hash_algs.o \
 			   init.o \
+			   measure.o \
 			   open.o \
 			   verify.o
diff --git a/fs/verity/measure.c b/fs/verity/measure.c
new file mode 100644
index 000000000000..05049b68c745
--- /dev/null
+++ b/fs/verity/measure.c
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * fs/verity/measure.c: ioctl to get a verity file's measurement
+ *
+ * Copyright 2019 Google LLC
+ */
+
+#include "fsverity_private.h"
+
+#include <linux/uaccess.h>
+
+/**
+ * fsverity_ioctl_measure() - get a verity file's measurement
+ *
+ * Retrieve the file measurement that the kernel is enforcing for reads from a
+ * verity file.  See the "FS_IOC_MEASURE_VERITY" section of
+ * Documentation/filesystems/fsverity.rst for the documentation.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int fsverity_ioctl_measure(struct file *filp, void __user *_uarg)
+{
+	const struct inode *inode = file_inode(filp);
+	struct fsverity_digest __user *uarg = _uarg;
+	const struct fsverity_info *vi;
+	const struct fsverity_hash_alg *hash_alg;
+	struct fsverity_digest arg;
+
+	vi = fsverity_get_info(inode);
+	if (!vi)
+		return -ENODATA; /* not a verity file */
+	hash_alg = vi->tree_params.hash_alg;
+
+	/*
+	 * The user specifies the digest_size their buffer has space for; we can
+	 * return the digest if it fits in the available space.  We write back
+	 * the actual size, which may be shorter than the user-specified size.
+	 */
+
+	if (get_user(arg.digest_size, &uarg->digest_size))
+		return -EFAULT;
+	if (arg.digest_size < hash_alg->digest_size)
+		return -EOVERFLOW;
+
+	memset(&arg, 0, sizeof(arg));
+	arg.digest_algorithm = hash_alg - fsverity_hash_algs;
+	arg.digest_size = hash_alg->digest_size;
+
+	if (copy_to_user(uarg, &arg, sizeof(arg)))
+		return -EFAULT;
+
+	if (copy_to_user(uarg->digest, vi->measurement, hash_alg->digest_size))
+		return -EFAULT;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(fsverity_ioctl_measure);
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index b0b1854a9450..9ebb97c174c7 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -116,6 +116,10 @@ static inline struct fsverity_info *fsverity_get_info(const struct inode *inode)
 
 extern int fsverity_ioctl_enable(struct file *filp, const void __user *arg);
 
+/* measure.c */
+
+extern int fsverity_ioctl_measure(struct file *filp, void __user *arg);
+
 /* open.c */
 
 extern int fsverity_file_open(struct inode *inode, struct file *filp);
@@ -143,6 +147,13 @@ static inline int fsverity_ioctl_enable(struct file *filp,
 	return -EOPNOTSUPP;
 }
 
+/* measure.c */
+
+static inline int fsverity_ioctl_measure(struct file *filp, void __user *arg)
+{
+	return -EOPNOTSUPP;
+}
+
 /* open.c */
 
 static inline int fsverity_file_open(struct inode *inode, struct file *filp)
-- 
2.22.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox