* [PATCH 02/17] fsinfo: Add fsinfo() syscall to query filesystem information [ver #20]
From: David Howells @ 2020-07-24 13:35 UTC (permalink / raw)
To: torvalds, viro
Cc: linux-api, dhowells, raven, mszeredi, christian, jannh,
darrick.wong, kzak, jlayton, linux-api, linux-fsdevel,
linux-security-module, linux-kernel
In-Reply-To: <159559768062.2144584.13583793543173131929.stgit@warthog.procyon.org.uk>
Add a system call to allow filesystem information to be queried. A request
value can be given to indicate the desired attribute. Support is provided
for enumerating multi-value attributes.
===============
NEW SYSTEM CALL
===============
The new system call looks like:
int ret = fsinfo(int dfd,
const char *pathname,
const struct fsinfo_params *params,
size_t params_size,
void *result_buffer,
size_t result_buf_size);
The params parameter optionally points to a block of parameters:
struct fsinfo_params {
__u64 resolve_flags;
__u32 at_flags;
__u32 flags;
__u32 request;
__u32 Nth;
__u32 Mth;
};
If params is NULL, the default is that params->request is
FSINFO_ATTR_STATFS and all the other fields are 0. params_size indicates
the size of the parameter struct. If the parameter block is short compared
to what the kernel expects, the missing length will be set to 0; if the
parameter block is longer, an error will be given if the excess is not all
zeros.
The object to be queried is specified as follows - part param->flags
indicates the type of reference:
(1) FSINFO_FLAGS_QUERY_PATH - dfd, pathname and at_flags indicate a
filesystem object to query.
There is no separate system call providing an analogue of lstat() -
AT_SYMLINK_NOFOLLOW should be set in at_flags instead.
AT_NO_AUTOMOUNT can also be used to an allow automount point to be
queried without triggering it.
RESOLVE_* flags can also be set in resolve_flags to further restrict
the patchwalk.
(2) FSINFO_FLAGS_QUERY_FD - dfd indicates a file descriptor pointing to
the filesystem object to query. pathname should be NULL.
(3) FSINFO_FLAGS_QUERY_MOUNT - pathname indicates the numeric ID of the
mountpoint to query as a string. dfd is used to constrain which
mounts can be accessed. If dfd is AT_FDCWD, the mount must be within
the subtree rooted at chroot, otherwise the mount must be within the
subtree rooted at the directory specified by dfd.
(4) In the future FSINFO_FLAGS_QUERY_FSCONTEXT will be added - dfd will
indicate a context handle fd obtained from fsopen() or fspick(),
allowing that to be queried before the target superblock is attached
to the filesystem or even created.
params->request indicates the attribute/attributes to be queried. This can
be one of:
FSINFO_ATTR_STATFS - statfs-style info
FSINFO_ATTR_IDS - Filesystem IDs
FSINFO_ATTR_LIMITS - Filesystem limits
FSINFO_ATTR_SUPPORTS - Support for statx, ioctl, etc.
FSINFO_ATTR_TIMESTAMP_INFO - Inode timestamp info
FSINFO_ATTR_VOLUME_ID - Volume ID (string)
FSINFO_ATTR_VOLUME_UUID - Volume UUID
FSINFO_ATTR_VOLUME_NAME - Volume name (string)
FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO - Information about attr Nth
FSINFO_ATTR_FSINFO_ATTRIBUTES - List of supported attrs
Some attributes (such as the servers backing a network filesystem) can have
multiple values. These can be enumerated by setting params->Nth and
params->Mth to 0, 1, ... until ENODATA is returned.
result_buffer and result_buf_size point to the reply buffer. The buffer is
filled up to the specified size, even if this means truncating the reply.
The size of the full reply is returned, irrespective of the amount data
that was copied. In future versions, this will allow extra fields to be
tacked on to the end of the reply, but anyone not expecting them will only
get the subset they're expecting. If either buffer of result_buf_size are
0, no copy will take place and the data size will be returned.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---
arch/alpha/kernel/syscalls/syscall.tbl | 1
arch/arm/tools/syscall.tbl | 1
arch/arm64/include/asm/unistd32.h | 2
arch/ia64/kernel/syscalls/syscall.tbl | 1
arch/m68k/kernel/syscalls/syscall.tbl | 1
arch/microblaze/kernel/syscalls/syscall.tbl | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 1
arch/mips/kernel/syscalls/syscall_n64.tbl | 1
arch/mips/kernel/syscalls/syscall_o32.tbl | 1
arch/parisc/kernel/syscalls/syscall.tbl | 1
arch/powerpc/kernel/syscalls/syscall.tbl | 1
arch/s390/kernel/syscalls/syscall.tbl | 1
arch/sh/kernel/syscalls/syscall.tbl | 1
arch/sparc/kernel/syscalls/syscall.tbl | 1
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
arch/xtensa/kernel/syscalls/syscall.tbl | 1
fs/Kconfig | 7
fs/Makefile | 1
fs/fsinfo.c | 596 +++++++++++++++++++++++++
include/linux/fs.h | 4
include/linux/fsinfo.h | 74 +++
include/linux/syscalls.h | 4
include/uapi/asm-generic/unistd.h | 4
include/uapi/linux/fsinfo.h | 189 ++++++++
kernel/sys_ni.c | 1
samples/vfs/Makefile | 2
samples/vfs/test-fsinfo.c | 646 +++++++++++++++++++++++++++
28 files changed, 1544 insertions(+), 2 deletions(-)
create mode 100644 fs/fsinfo.c
create mode 100644 include/linux/fsinfo.h
create mode 100644 include/uapi/linux/fsinfo.h
create mode 100644 samples/vfs/test-fsinfo.c
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index b6cf8403da35..984abd1ac058 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
548 common pidfd_getfd sys_pidfd_getfd
549 common faccessat2 sys_faccessat2
550 common watch_mount sys_watch_mount
+551 common fsinfo sys_fsinfo
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 27cc1f53f4a0..bd791f91f5bb 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -453,3 +453,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common watch_mount sys_watch_mount
+441 common fsinfo sys_fsinfo
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 4f9cf98cdf0f..bd78eb2c487a 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -887,6 +887,8 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
__SYSCALL(__NR_faccessat2, sys_faccessat2)
#define __NR_watch_mount 440
__SYSCALL(__NR_watch_mount, sys_watch_mount)
+#define __NR_fsinfo 441
+__SYSCALL(__NR_fsinfo, sys_fsinfo)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index fc6d87903781..09d144487b7d 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -360,3 +360,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common watch_mount sys_watch_mount
+441 common fsinfo sys_fsinfo
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index c671aa0e4d25..1bdc26af3c54 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common watch_mount sys_watch_mount
+441 common fsinfo sys_fsinfo
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 65cc53f129ef..fb8543122904 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -445,3 +445,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common watch_mount sys_watch_mount
+441 common fsinfo sys_fsinfo
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 7f034a239930..b8362bd6bd4a 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -378,3 +378,4 @@
438 n32 pidfd_getfd sys_pidfd_getfd
439 n32 faccessat2 sys_faccessat2
440 n32 watch_mount sys_watch_mount
+441 n32 fsinfo sys_fsinfo
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index d39b90de3642..60ca4091d378 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -354,3 +354,4 @@
438 n64 pidfd_getfd sys_pidfd_getfd
439 n64 faccessat2 sys_faccessat2
440 n64 watch_mount sys_watch_mount
+441 n64 fsinfo sys_fsinfo
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 09f426cb45b1..07aea9379ca0 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -427,3 +427,4 @@
438 o32 pidfd_getfd sys_pidfd_getfd
439 o32 faccessat2 sys_faccessat2
440 o32 watch_mount sys_watch_mount
+441 o32 fsinfo sys_fsinfo
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 52ff3454baa1..f8060767f11a 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -437,3 +437,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common watch_mount sys_watch_mount
+441 common fsinfo sys_fsinfo
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 10b7ed3c7a1b..3036bf1336d2 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -529,3 +529,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common watch_mount sys_watch_mount
+441 common fsinfo sys_fsinfo
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 86f317bf52df..c0a111fdb3ce 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2 sys_faccessat2
440 common watch_mount sys_watch_mount sys_watch_mount
+441 common fsinfo sys_fsinfo sys_fsinfo
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 0bb0f0b372c7..03b55c32441f 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common watch_mount sys_watch_mount
+441 common fsinfo sys_fsinfo
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 369ab65c1e9a..a0144db9fb8c 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -485,3 +485,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common watch_mount sys_watch_mount
+441 common fsinfo sys_fsinfo
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index e760ba92c58d..edf90a2be0b9 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -444,3 +444,4 @@
438 i386 pidfd_getfd sys_pidfd_getfd
439 i386 faccessat2 sys_faccessat2
440 i386 watch_mount sys_watch_mount
+441 i386 fsinfo sys_fsinfo
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5b58621d4f75..ab0eda639d67 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -361,6 +361,7 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common watch_mount sys_watch_mount
+441 common fsinfo sys_fsinfo
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5b28ee39f70f..979013890caf 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -410,3 +410,4 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common watch_mount sys_watch_mount
+441 common fsinfo sys_fsinfo
diff --git a/fs/Kconfig b/fs/Kconfig
index 1a55e56d5c54..df76451ab49a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -15,6 +15,13 @@ config VALIDATE_FS_PARSER
Enable this to perform validation of the parameter description for a
filesystem when it is registered.
+config FSINFO
+ bool "Enable the fsinfo() system call"
+ help
+ Enable the file system information querying system call to allow
+ comprehensive information to be retrieved about a filesystem,
+ superblock or mount object.
+
if BLOCK
config FS_IOMAP
diff --git a/fs/Makefile b/fs/Makefile
index dd0d87e2ef19..93a7f8047585 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -55,6 +55,7 @@ obj-$(CONFIG_COREDUMP) += coredump.o
obj-$(CONFIG_SYSCTL) += drop_caches.o
obj-$(CONFIG_FHANDLE) += fhandle.o
+obj-$(CONFIG_FSINFO) += fsinfo.o
obj-y += iomap/
obj-y += quota/
diff --git a/fs/fsinfo.c b/fs/fsinfo.c
new file mode 100644
index 000000000000..7d9c73e9cbde
--- /dev/null
+++ b/fs/fsinfo.c
@@ -0,0 +1,596 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information query.
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+#include <linux/syscalls.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/statfs.h>
+#include <linux/security.h>
+#include <linux/uaccess.h>
+#include <linux/fsinfo.h>
+#include <uapi/linux/mount.h>
+#include "internal.h"
+
+/**
+ * fsinfo_opaque - Store opaque blob as an fsinfo attribute value.
+ * @s: The blob to store (may be NULL)
+ * @ctx: The parameter context
+ * @len: The length of the blob
+ */
+int fsinfo_opaque(const void *s, struct fsinfo_context *ctx, unsigned int len)
+{
+ void *p = ctx->buffer;
+ int ret = 0;
+
+ if (s) {
+ if (!ctx->want_size_only)
+ memcpy(p, s, len);
+ ret = len;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(fsinfo_opaque);
+
+/**
+ * fsinfo_string - Store a NUL-terminated string as an fsinfo attribute value.
+ * @s: The string to store (may be NULL)
+ * @ctx: The parameter context
+ */
+int fsinfo_string(const char *s, struct fsinfo_context *ctx)
+{
+ if (!s)
+ return 1;
+ return fsinfo_opaque(s, ctx, min_t(size_t, strlen(s) + 1, ctx->buf_size));
+}
+EXPORT_SYMBOL(fsinfo_string);
+
+/*
+ * Get basic filesystem stats from statfs.
+ */
+static int fsinfo_generic_statfs(struct path *path, struct fsinfo_context *ctx)
+{
+ struct fsinfo_statfs *p = ctx->buffer;
+ struct kstatfs buf;
+ int ret;
+
+ ret = vfs_statfs(path, &buf);
+ if (ret < 0)
+ return ret;
+
+ p->f_blocks.lo = buf.f_blocks;
+ p->f_bfree.lo = buf.f_bfree;
+ p->f_bavail.lo = buf.f_bavail;
+ p->f_files.lo = buf.f_files;
+ p->f_ffree.lo = buf.f_ffree;
+ p->f_favail.lo = buf.f_ffree;
+ p->f_bsize = buf.f_bsize;
+ p->f_frsize = buf.f_frsize;
+ return sizeof(*p);
+}
+
+static int fsinfo_generic_ids(struct path *path, struct fsinfo_context *ctx)
+{
+ struct fsinfo_ids *p = ctx->buffer;
+ struct super_block *sb;
+ struct kstatfs buf;
+ int ret;
+
+ ret = vfs_statfs(path, &buf);
+ if (ret < 0 && ret != -ENOSYS)
+ return ret;
+ if (ret == 0)
+ memcpy(&p->f_fsid, &buf.f_fsid, sizeof(p->f_fsid));
+
+ sb = path->dentry->d_sb;
+ p->f_fstype = sb->s_magic;
+ p->f_dev_major = MAJOR(sb->s_dev);
+ p->f_dev_minor = MINOR(sb->s_dev);
+ p->f_sb_id = sb->s_unique_id;
+ strlcpy(p->f_fs_name, sb->s_type->name, sizeof(p->f_fs_name));
+ return sizeof(*p);
+}
+
+int fsinfo_generic_limits(struct path *path, struct fsinfo_context *ctx)
+{
+ struct fsinfo_limits *p = ctx->buffer;
+ struct super_block *sb = path->dentry->d_sb;
+
+ p->max_file_size.hi = 0;
+ p->max_file_size.lo = sb->s_maxbytes;
+ p->max_ino.hi = 0;
+ p->max_ino.lo = UINT_MAX;
+ p->max_hard_links = sb->s_max_links;
+ p->max_uid = UINT_MAX;
+ p->max_gid = UINT_MAX;
+ p->max_projid = UINT_MAX;
+ p->max_filename_len = NAME_MAX;
+ p->max_symlink_len = PATH_MAX;
+ p->max_xattr_name_len = XATTR_NAME_MAX;
+ p->max_xattr_body_len = XATTR_SIZE_MAX;
+ p->max_dev_major = 0xffffff;
+ p->max_dev_minor = 0xff;
+ return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_limits);
+
+int fsinfo_generic_supports(struct path *path, struct fsinfo_context *ctx)
+{
+ struct fsinfo_supports *p = ctx->buffer;
+ struct super_block *sb = path->dentry->d_sb;
+
+ p->stx_mask = STATX_BASIC_STATS;
+ if (sb->s_d_op && sb->s_d_op->d_automount)
+ p->stx_attributes |= STATX_ATTR_AUTOMOUNT;
+ return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_supports);
+
+static const struct fsinfo_timestamp_info fsinfo_default_timestamp_info = {
+ .atime = {
+ .minimum = S64_MIN,
+ .maximum = S64_MAX,
+ .gran_mantissa = 1,
+ .gran_exponent = 0,
+ },
+ .mtime = {
+ .minimum = S64_MIN,
+ .maximum = S64_MAX,
+ .gran_mantissa = 1,
+ .gran_exponent = 0,
+ },
+ .ctime = {
+ .minimum = S64_MIN,
+ .maximum = S64_MAX,
+ .gran_mantissa = 1,
+ .gran_exponent = 0,
+ },
+ .btime = {
+ .minimum = S64_MIN,
+ .maximum = S64_MAX,
+ .gran_mantissa = 1,
+ .gran_exponent = 0,
+ },
+};
+
+int fsinfo_generic_timestamp_info(struct path *path, struct fsinfo_context *ctx)
+{
+ struct fsinfo_timestamp_info *p = ctx->buffer;
+ struct super_block *sb = path->dentry->d_sb;
+ s8 exponent;
+
+ *p = fsinfo_default_timestamp_info;
+
+ if (sb->s_time_gran < 1000000000) {
+ if (sb->s_time_gran < 1000)
+ exponent = -9;
+ else if (sb->s_time_gran < 1000000)
+ exponent = -6;
+ else
+ exponent = -3;
+
+ p->atime.gran_exponent = exponent;
+ p->mtime.gran_exponent = exponent;
+ p->ctime.gran_exponent = exponent;
+ p->btime.gran_exponent = exponent;
+ }
+
+ return sizeof(*p);
+}
+EXPORT_SYMBOL(fsinfo_generic_timestamp_info);
+
+static int fsinfo_generic_volume_uuid(struct path *path, struct fsinfo_context *ctx)
+{
+ struct fsinfo_volume_uuid *p = ctx->buffer;
+ struct super_block *sb = path->dentry->d_sb;
+
+ memcpy(p, &sb->s_uuid, sizeof(*p));
+ return sizeof(*p);
+}
+
+static int fsinfo_generic_volume_id(struct path *path, struct fsinfo_context *ctx)
+{
+ return fsinfo_string(path->dentry->d_sb->s_id, ctx);
+}
+
+static const struct fsinfo_attribute fsinfo_common_attributes[] = {
+ FSINFO_VSTRUCT (FSINFO_ATTR_STATFS, fsinfo_generic_statfs),
+ FSINFO_VSTRUCT (FSINFO_ATTR_IDS, fsinfo_generic_ids),
+ FSINFO_VSTRUCT (FSINFO_ATTR_LIMITS, fsinfo_generic_limits),
+ FSINFO_VSTRUCT (FSINFO_ATTR_SUPPORTS, fsinfo_generic_supports),
+ FSINFO_VSTRUCT (FSINFO_ATTR_TIMESTAMP_INFO, fsinfo_generic_timestamp_info),
+ FSINFO_STRING (FSINFO_ATTR_VOLUME_ID, fsinfo_generic_volume_id),
+ FSINFO_VSTRUCT (FSINFO_ATTR_VOLUME_UUID, fsinfo_generic_volume_uuid),
+
+ FSINFO_LIST (FSINFO_ATTR_FSINFO_ATTRIBUTES, (void *)123UL),
+ FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, (void *)123UL),
+ {}
+};
+
+/*
+ * Determine an attribute's minimum buffer size and, if the buffer is large
+ * enough, get the attribute value.
+ */
+static int fsinfo_get_this_attribute(struct path *path,
+ struct fsinfo_context *ctx,
+ const struct fsinfo_attribute *attr)
+{
+ int buf_size;
+
+ if (ctx->Nth != 0 && !(attr->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)))
+ return -ENODATA;
+ if (ctx->Mth != 0 && !(attr->flags & FSINFO_FLAGS_NM))
+ return -ENODATA;
+
+ switch (attr->type) {
+ case FSINFO_TYPE_VSTRUCT:
+ ctx->clear_tail = true;
+ buf_size = attr->size;
+ break;
+ case FSINFO_TYPE_STRING:
+ case FSINFO_TYPE_OPAQUE:
+ case FSINFO_TYPE_LIST:
+ buf_size = 4096;
+ break;
+ default:
+ return -ENOPKG;
+ }
+
+ if (ctx->buf_size < buf_size)
+ return buf_size;
+
+ return attr->get(path, ctx);
+}
+
+static void fsinfo_attributes_insert(struct fsinfo_context *ctx,
+ const struct fsinfo_attribute *attr)
+{
+ __u32 *p = ctx->buffer;
+ unsigned int i;
+
+ if (ctx->usage >= ctx->buf_size ||
+ ctx->buf_size - ctx->usage < sizeof(__u32)) {
+ ctx->usage += sizeof(__u32);
+ return;
+ }
+
+ for (i = 0; i < ctx->usage / sizeof(__u32); i++)
+ if (p[i] == attr->attr_id)
+ return;
+
+ p[i] = attr->attr_id;
+ ctx->usage += sizeof(__u32);
+}
+
+static int fsinfo_list_attributes(struct path *path,
+ struct fsinfo_context *ctx,
+ const struct fsinfo_attribute *attributes)
+{
+ const struct fsinfo_attribute *a;
+
+ for (a = attributes; a->get; a++)
+ fsinfo_attributes_insert(ctx, a);
+ return -EOPNOTSUPP; /* We want to go through all the lists */
+}
+
+static int fsinfo_get_attribute_info(struct path *path,
+ struct fsinfo_context *ctx,
+ const struct fsinfo_attribute *attributes)
+{
+ const struct fsinfo_attribute *a;
+ struct fsinfo_attribute_info *p = ctx->buffer;
+
+ if (!ctx->buf_size)
+ return sizeof(*p);
+
+ for (a = attributes; a->get; a++) {
+ if (a->attr_id == ctx->Nth) {
+ p->attr_id = a->attr_id;
+ p->type = a->type;
+ p->flags = a->flags;
+ p->size = a->size;
+ p->size = a->size;
+ return sizeof(*p);
+ }
+ }
+ return -EOPNOTSUPP; /* We want to go through all the lists */
+}
+
+/**
+ * fsinfo_get_attribute - Look up and handle an attribute
+ * @path: The object to query
+ * @params: Parameters to define a request and place to store result
+ * @attributes: List of attributes to search.
+ *
+ * Look through a list of attributes for one that matches the requested
+ * attribute then call the handler for it.
+ */
+int fsinfo_get_attribute(struct path *path, struct fsinfo_context *ctx,
+ const struct fsinfo_attribute *attributes)
+{
+ const struct fsinfo_attribute *a;
+
+ switch (ctx->requested_attr) {
+ case FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO:
+ return fsinfo_get_attribute_info(path, ctx, attributes);
+ case FSINFO_ATTR_FSINFO_ATTRIBUTES:
+ return fsinfo_list_attributes(path, ctx, attributes);
+ default:
+ for (a = attributes; a->get; a++)
+ if (a->attr_id == ctx->requested_attr)
+ return fsinfo_get_this_attribute(path, ctx, a);
+ return -EOPNOTSUPP;
+ }
+}
+EXPORT_SYMBOL(fsinfo_get_attribute);
+
+/**
+ * generic_fsinfo - Handle an fsinfo attribute generically
+ * @path: The object to query
+ * @params: Parameters to define a request and place to store result
+ */
+static int fsinfo_call(struct path *path, struct fsinfo_context *ctx)
+{
+ int ret;
+
+ if (path->dentry->d_sb->s_op->fsinfo) {
+ ret = path->dentry->d_sb->s_op->fsinfo(path, ctx);
+ if (ret != -EOPNOTSUPP)
+ return ret;
+ }
+ ret = fsinfo_get_attribute(path, ctx, fsinfo_common_attributes);
+ if (ret != -EOPNOTSUPP)
+ return ret;
+
+ switch (ctx->requested_attr) {
+ case FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO:
+ return -ENODATA;
+ case FSINFO_ATTR_FSINFO_ATTRIBUTES:
+ return ctx->usage;
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
+/**
+ * vfs_fsinfo - Retrieve filesystem information
+ * @path: The object to query
+ * @params: Parameters to define a request and place to store result
+ *
+ * Get an attribute on a filesystem or an object within a filesystem. The
+ * filesystem attribute to be queried is indicated by @ctx->requested_attr, and
+ * if it's a multi-valued attribute, the particular value is selected by
+ * @ctx->Nth and then @ctx->Mth.
+ *
+ * For common attributes, a value may be fabricated if it is not supported by
+ * the filesystem.
+ *
+ * On success, the size of the attribute's value is returned (0 is a valid
+ * size). A buffer will have been allocated and will be pointed to by
+ * @ctx->buffer. The caller must free this with kvfree().
+ *
+ * Errors can also be returned: -ENOMEM if a buffer cannot be allocated, -EPERM
+ * or -EACCES if permission is denied by the LSM, -EOPNOTSUPP if an attribute
+ * doesn't exist for the specified object or -ENODATA if the attribute exists,
+ * but the Nth,Mth value does not exist. -EMSGSIZE indicates that the value is
+ * unmanageable internally and -ENOPKG indicates other internal failure.
+ *
+ * Errors such as -EIO may also come from attempts to access media or servers
+ * to obtain the requested information if it's not immediately to hand.
+ *
+ * [*] Note that the caller may set @ctx->want_size_only if it only wants the
+ * size of the value and not the data. If this is set, a buffer may not be
+ * allocated under some circumstances. This is intended for size query by
+ * userspace.
+ *
+ * [*] Note that @ctx->clear_tail will be returned set if the data should be
+ * padded out with zeros when writing it to userspace.
+ */
+static int vfs_fsinfo(struct path *path, struct fsinfo_context *ctx)
+{
+ struct dentry *dentry = path->dentry;
+ int ret;
+
+ ret = security_sb_statfs(dentry);
+ if (ret)
+ return ret;
+
+ /* Call the handler to find out the buffer size required. */
+ ctx->buf_size = 0;
+ ret = fsinfo_call(path, ctx);
+ if (ret < 0 || ctx->want_size_only)
+ return ret;
+ ctx->buf_size = ret;
+
+ do {
+ /* Allocate a buffer of the requested size. */
+ if (ctx->buf_size > INT_MAX)
+ return -EMSGSIZE;
+ ctx->buffer = kvzalloc(ctx->buf_size, GFP_KERNEL);
+ if (!ctx->buffer)
+ return -ENOMEM;
+
+ ctx->usage = 0;
+ ctx->skip = 0;
+ ret = fsinfo_call(path, ctx);
+ if (IS_ERR_VALUE((long)ret))
+ return ret;
+ if ((unsigned int)ret <= ctx->buf_size)
+ return ret; /* It fitted */
+
+ /* We need to resize the buffer */
+ ctx->buf_size = roundup(ret, PAGE_SIZE);
+ kvfree(ctx->buffer);
+ ctx->buffer = NULL;
+ } while (!signal_pending(current));
+
+ return -ERESTARTSYS;
+}
+
+static int vfs_fsinfo_path(int dfd, const char __user *pathname,
+ const struct fsinfo_params *up,
+ struct fsinfo_context *ctx)
+{
+ struct path path;
+ unsigned lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+ int ret = -EINVAL;
+
+ if (up->resolve_flags & ~VALID_RESOLVE_FLAGS)
+ return -EINVAL;
+ if (up->at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
+ AT_EMPTY_PATH))
+ return -EINVAL;
+
+ if (up->resolve_flags & RESOLVE_NO_XDEV)
+ lookup_flags |= LOOKUP_NO_XDEV;
+ if (up->resolve_flags & RESOLVE_NO_MAGICLINKS)
+ lookup_flags |= LOOKUP_NO_MAGICLINKS;
+ if (up->resolve_flags & RESOLVE_NO_SYMLINKS)
+ lookup_flags |= LOOKUP_NO_SYMLINKS;
+ if (up->resolve_flags & RESOLVE_BENEATH)
+ lookup_flags |= LOOKUP_BENEATH;
+ if (up->resolve_flags & RESOLVE_IN_ROOT)
+ lookup_flags |= LOOKUP_IN_ROOT;
+ if (up->at_flags & AT_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (up->at_flags & AT_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (up->at_flags & AT_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+
+retry:
+ ret = user_path_at(dfd, pathname, lookup_flags, &path);
+ if (ret)
+ goto out;
+
+ ret = vfs_fsinfo(&path, ctx);
+ path_put(&path);
+ if (retry_estale(ret, lookup_flags)) {
+ lookup_flags |= LOOKUP_REVAL;
+ goto retry;
+ }
+out:
+ return ret;
+}
+
+static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_context *ctx)
+{
+ struct fd f = fdget_raw(fd);
+ int ret = -EBADF;
+
+ if (f.file) {
+ ret = vfs_fsinfo(&f.file->f_path, ctx);
+ fdput(f);
+ }
+ return ret;
+}
+
+/**
+ * sys_fsinfo - System call to get filesystem information
+ * @dfd: Base directory to pathwalk from or fd referring to filesystem.
+ * @pathname: Filesystem to query or NULL.
+ * @params: Parameters to define request (NULL: FSINFO_ATTR_STATFS).
+ * @params_size: Size of parameter buffer.
+ * @result_buffer: Result buffer.
+ * @result_buf_size: Size of result buffer.
+ *
+ * Get information on a filesystem. The filesystem attribute to be queried is
+ * indicated by @_params->request, and some of the attributes can have multiple
+ * values, indexed by @_params->Nth and @_params->Mth. If @_params is NULL,
+ * then the 0th fsinfo_attr_statfs attribute is queried. If an attribute does
+ * not exist, EOPNOTSUPP is returned; if the Nth,Mth value does not exist,
+ * ENODATA is returned.
+ *
+ * On success, the size of the attribute's value is returned. If
+ * @result_buf_size is 0 or @result_buffer is NULL, only the size is returned.
+ * If the size of the value is larger than @result_buf_size, it will be
+ * truncated by the copy. If the size of the value is smaller than
+ * @result_buf_size then the excess buffer space will be cleared. The full
+ * size of the value will be returned, irrespective of how much data is
+ * actually placed in the buffer.
+ */
+SYSCALL_DEFINE6(fsinfo,
+ int, dfd,
+ const char __user *, pathname,
+ const struct fsinfo_params __user *, params,
+ size_t, params_size,
+ void __user *, result_buffer,
+ size_t, result_buf_size)
+{
+ struct fsinfo_context ctx;
+ struct fsinfo_params user_params;
+ unsigned int result_size;
+ void *r;
+ int ret;
+
+ if ((!params && params_size) ||
+ ( params && !params_size) ||
+ (!result_buffer && result_buf_size) ||
+ ( result_buffer && !result_buf_size))
+ return -EINVAL;
+ if (result_buf_size > UINT_MAX)
+ return -EOVERFLOW;
+
+ memset(&ctx, 0, sizeof(ctx));
+ ctx.requested_attr = FSINFO_ATTR_STATFS;
+ ctx.flags = FSINFO_FLAGS_QUERY_PATH;
+ ctx.want_size_only = (result_buf_size == 0);
+
+ if (params) {
+ ret = copy_struct_from_user(&user_params, sizeof(user_params),
+ params, params_size);
+ if (ret < 0)
+ return ret;
+ if (user_params.flags & ~FSINFO_FLAGS_QUERY_MASK)
+ return -EINVAL;
+ ctx.flags = user_params.flags;
+ ctx.requested_attr = user_params.request;
+ ctx.Nth = user_params.Nth;
+ ctx.Mth = user_params.Mth;
+ }
+
+ switch (ctx.flags & FSINFO_FLAGS_QUERY_MASK) {
+ case FSINFO_FLAGS_QUERY_PATH:
+ ret = vfs_fsinfo_path(dfd, pathname, &user_params, &ctx);
+ break;
+ case FSINFO_FLAGS_QUERY_FD:
+ if (pathname)
+ return -EINVAL;
+ ret = vfs_fsinfo_fd(dfd, &ctx);
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (ret < 0)
+ goto error;
+
+ r = ctx.buffer + ctx.skip;
+ result_size = min_t(size_t, ret, result_buf_size);
+ if (result_size > 0 &&
+ copy_to_user(result_buffer, r, result_size) != 0) {
+ ret = -EFAULT;
+ goto error;
+ }
+
+ /* Clear any part of the buffer that we won't fill if we're putting a
+ * struct in there. Strings, opaque objects and arrays are expected to
+ * be variable length.
+ */
+ if (ctx.clear_tail &&
+ result_buf_size > result_size &&
+ clear_user(result_buffer + result_size,
+ result_buf_size - result_size) != 0) {
+ ret = -EFAULT;
+ goto error;
+ }
+
+error:
+ kvfree(ctx.buffer);
+ return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 28a29356eace..3284f497de0a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -68,6 +68,7 @@ struct fsverity_info;
struct fsverity_operations;
struct fs_context;
struct fs_parameter_spec;
+struct fsinfo_context;
extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -1963,6 +1964,9 @@ struct super_operations {
int (*thaw_super) (struct super_block *);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
+#ifdef CONFIG_FSINFO
+ int (*fsinfo)(struct path *, struct fsinfo_context *);
+#endif
int (*remount_fs) (struct super_block *, int *, char *);
void (*umount_begin) (struct super_block *);
diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
new file mode 100644
index 000000000000..a811d69b02ff
--- /dev/null
+++ b/include/linux/fsinfo.h
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Filesystem information query
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#ifndef _LINUX_FSINFO_H
+#define _LINUX_FSINFO_H
+
+#ifdef CONFIG_FSINFO
+
+#include <uapi/linux/fsinfo.h>
+
+struct path;
+
+#define FSINFO_NORMAL_ATTR_MAX_SIZE 4096
+
+struct fsinfo_context {
+ __u32 flags; /* [in] FSINFO_FLAGS_* */
+ __u32 requested_attr; /* [in] What is being asking for */
+ __u32 Nth; /* [in] Instance of it (some may have multiple) */
+ __u32 Mth; /* [in] Subinstance */
+ bool want_size_only; /* [in] Just want to know the size, not the data */
+ bool clear_tail; /* [out] T if tail of buffer should be cleared */
+ unsigned int skip; /* [out] Number of bytes to skip in buffer */
+ unsigned int usage; /* [tmp] Amount of buffer used (if large) */
+ unsigned int buf_size; /* [tmp] Size of ->buffer[] */
+ void *buffer; /* [out] The reply buffer */
+};
+
+/*
+ * A filesystem information attribute definition.
+ */
+struct fsinfo_attribute {
+ unsigned int attr_id; /* The ID of the attribute */
+ enum fsinfo_value_type type:8; /* The type of the attribute's value(s) */
+ unsigned int flags:8;
+ unsigned int size:16; /* - Value size (FSINFO_STRUCT/LIST) */
+ int (*get)(struct path *path, struct fsinfo_context *params);
+};
+
+#define __FSINFO(A, T, S, G, F) \
+ { .attr_id = A, .type = T, .flags = F, .size = S, .get = G }
+
+#define _FSINFO(A, T, S, G) __FSINFO(A, T, S, G, 0)
+#define _FSINFO_N(A, T, S, G) __FSINFO(A, T, S, G, FSINFO_FLAGS_N)
+#define _FSINFO_NM(A, T, S, G) __FSINFO(A, T, S, G, FSINFO_FLAGS_NM)
+
+#define _FSINFO_VSTRUCT(A,S,G) _FSINFO (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
+#define _FSINFO_VSTRUCT_N(A,S,G) _FSINFO_N (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
+#define _FSINFO_VSTRUCT_NM(A,S,G) _FSINFO_NM(A, FSINFO_TYPE_VSTRUCT, sizeof(S), G)
+
+#define FSINFO_VSTRUCT(A,G) _FSINFO_VSTRUCT (A, A##__STRUCT, G)
+#define FSINFO_VSTRUCT_N(A,G) _FSINFO_VSTRUCT_N (A, A##__STRUCT, G)
+#define FSINFO_VSTRUCT_NM(A,G) _FSINFO_VSTRUCT_NM(A, A##__STRUCT, G)
+#define FSINFO_STRING(A,G) _FSINFO (A, FSINFO_TYPE_STRING, 0, G)
+#define FSINFO_STRING_N(A,G) _FSINFO_N (A, FSINFO_TYPE_STRING, 0, G)
+#define FSINFO_STRING_NM(A,G) _FSINFO_NM(A, FSINFO_TYPE_STRING, 0, G)
+#define FSINFO_OPAQUE(A,G) _FSINFO (A, FSINFO_TYPE_OPAQUE, 0, G)
+#define FSINFO_LIST(A,G) _FSINFO (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G)
+#define FSINFO_LIST_N(A,G) _FSINFO_N (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G)
+
+extern int fsinfo_opaque(const void *, struct fsinfo_context *, unsigned int);
+extern int fsinfo_string(const char *, struct fsinfo_context *);
+extern int fsinfo_generic_timestamp_info(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_supports(struct path *, struct fsinfo_context *);
+extern int fsinfo_generic_limits(struct path *, struct fsinfo_context *);
+extern int fsinfo_get_attribute(struct path *, struct fsinfo_context *,
+ const struct fsinfo_attribute *);
+
+#endif /* CONFIG_FSINFO */
+
+#endif /* _LINUX_FSINFO_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 88d03fd627ab..e31ad49af4c3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -47,6 +47,7 @@ struct stat64;
struct statfs;
struct statfs64;
struct statx;
+struct fsinfo_params;
struct __sysctl_args;
struct sysinfo;
struct timespec;
@@ -1007,6 +1008,9 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
asmlinkage long sys_watch_mount(int dfd, const char __user *path,
unsigned int at_flags, int watch_fd, int watch_id);
+asmlinkage long sys_fsinfo(int dfd, const char __user *pathname,
+ const struct fsinfo_params __user *params, size_t params_size,
+ void __user *result_buffer, size_t result_buf_size);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index fcdca8c7d30a..801e6baebd50 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -859,9 +859,11 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
__SYSCALL(__NR_faccessat2, sys_faccessat2)
#define __NR_watch_mount 440
__SYSCALL(__NR_watch_mount, sys_watch_mount)
+#define __NR_fsinfo 442
+__SYSCALL(__NR_fsinfo, sys_fsinfo)
#undef __NR_syscalls
-#define __NR_syscalls 441
+#define __NR_syscalls 443
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
new file mode 100644
index 000000000000..65892239ba86
--- /dev/null
+++ b/include/uapi/linux/fsinfo.h
@@ -0,0 +1,189 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* fsinfo() definitions.
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+#ifndef _UAPI_LINUX_FSINFO_H
+#define _UAPI_LINUX_FSINFO_H
+
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/openat2.h>
+
+/*
+ * The filesystem attributes that can be requested. Note that some attributes
+ * may have multiple instances which can be switched in the parameter block.
+ */
+#define FSINFO_ATTR_STATFS 0x00 /* statfs()-style state */
+#define FSINFO_ATTR_IDS 0x01 /* Filesystem IDs */
+#define FSINFO_ATTR_LIMITS 0x02 /* Filesystem limits */
+#define FSINFO_ATTR_SUPPORTS 0x03 /* What's supported in statx, iocflags, ... */
+#define FSINFO_ATTR_TIMESTAMP_INFO 0x04 /* Inode timestamp info */
+#define FSINFO_ATTR_VOLUME_ID 0x05 /* Volume ID (string) */
+#define FSINFO_ATTR_VOLUME_UUID 0x06 /* Volume UUID (LE uuid) */
+#define FSINFO_ATTR_VOLUME_NAME 0x07 /* Volume name (string) */
+
+#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO 0x100 /* Information about attr N (for path) */
+#define FSINFO_ATTR_FSINFO_ATTRIBUTES 0x101 /* List of supported attrs (for path) */
+
+/*
+ * Optional fsinfo() parameter structure.
+ *
+ * If this is not given, it is assumed that fsinfo_attr_statfs instance 0,0 is
+ * desired.
+ */
+struct fsinfo_params {
+ __u64 resolve_flags; /* RESOLVE_* flags */
+ __u32 at_flags; /* AT_* flags */
+ __u32 flags; /* Flags controlling fsinfo() specifically */
+#define FSINFO_FLAGS_QUERY_MASK 0x0007 /* What object should fsinfo() query? */
+#define FSINFO_FLAGS_QUERY_PATH 0x0000 /* - path, specified by dirfd,pathname,AT_EMPTY_PATH */
+#define FSINFO_FLAGS_QUERY_FD 0x0001 /* - fd specified by dirfd */
+ __u32 request; /* ID of requested attribute */
+ __u32 Nth; /* Instance of it (some may have multiple) */
+ __u32 Mth; /* Subinstance of Nth instance */
+};
+
+enum fsinfo_value_type {
+ FSINFO_TYPE_VSTRUCT = 0, /* Version-lengthed struct (up to 4096 bytes) */
+ FSINFO_TYPE_STRING = 1, /* NUL-term var-length string (up to 4095 chars) */
+ FSINFO_TYPE_OPAQUE = 2, /* Opaque blob (unlimited size) */
+ FSINFO_TYPE_LIST = 3, /* List of ints/structs (unlimited size) */
+};
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO).
+ *
+ * This gives information about the attributes supported by fsinfo for the
+ * given path.
+ */
+struct fsinfo_attribute_info {
+ unsigned int attr_id; /* The ID of the attribute */
+ enum fsinfo_value_type type; /* The type of the attribute's value(s) */
+ unsigned int flags;
+#define FSINFO_FLAGS_N 0x01 /* - Attr has a set of values */
+#define FSINFO_FLAGS_NM 0x02 /* - Attr has a set of sets of values */
+ unsigned int size; /* - Value size (FSINFO_STRUCT/FSINFO_LIST) */
+};
+
+#define FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO__STRUCT struct fsinfo_attribute_info
+#define FSINFO_ATTR_FSINFO_ATTRIBUTES__STRUCT __u32
+
+struct fsinfo_u128 {
+#if defined(__BYTE_ORDER) ? __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
+ __u64 hi;
+ __u64 lo;
+#elif defined(__BYTE_ORDER) ? __BYTE_ORDER == __LITTLE_ENDIAN : defined(__LITTLE_ENDIAN)
+ __u64 lo;
+ __u64 hi;
+#endif
+};
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_STATFS).
+ * - This gives extended filesystem information.
+ */
+struct fsinfo_statfs {
+ struct fsinfo_u128 f_blocks; /* Total number of blocks in fs */
+ struct fsinfo_u128 f_bfree; /* Total number of free blocks */
+ struct fsinfo_u128 f_bavail; /* Number of free blocks available to ordinary user */
+ struct fsinfo_u128 f_files; /* Total number of file nodes in fs */
+ struct fsinfo_u128 f_ffree; /* Number of free file nodes */
+ struct fsinfo_u128 f_favail; /* Number of file nodes available to ordinary user */
+ __u64 f_bsize; /* Optimal block size */
+ __u64 f_frsize; /* Fragment size */
+};
+
+#define FSINFO_ATTR_STATFS__STRUCT struct fsinfo_statfs
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_IDS).
+ *
+ * List of basic identifiers as is normally found in statfs().
+ */
+struct fsinfo_ids {
+ char f_fs_name[15 + 1]; /* Filesystem name */
+ __u64 f_fsid; /* Short 64-bit Filesystem ID (as statfs) */
+ __u64 f_sb_id; /* Internal superblock ID for sbnotify()/mntnotify() */
+ __u32 f_fstype; /* Filesystem type from linux/magic.h [uncond] */
+ __u32 f_dev_major; /* As st_dev_* from struct statx [uncond] */
+ __u32 f_dev_minor;
+ __u32 __padding[1];
+};
+
+#define FSINFO_ATTR_IDS__STRUCT struct fsinfo_ids
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_LIMITS).
+ *
+ * List of supported filesystem limits.
+ */
+struct fsinfo_limits {
+ struct fsinfo_u128 max_file_size; /* Maximum file size */
+ struct fsinfo_u128 max_ino; /* Maximum inode number */
+ __u64 max_uid; /* Maximum UID supported */
+ __u64 max_gid; /* Maximum GID supported */
+ __u64 max_projid; /* Maximum project ID supported */
+ __u64 max_hard_links; /* Maximum number of hard links on a file */
+ __u64 max_xattr_body_len; /* Maximum xattr content length */
+ __u32 max_xattr_name_len; /* Maximum xattr name length */
+ __u32 max_filename_len; /* Maximum filename length */
+ __u32 max_symlink_len; /* Maximum symlink content length */
+ __u32 max_dev_major; /* Maximum device major representable */
+ __u32 max_dev_minor; /* Maximum device minor representable */
+ __u32 __padding[1];
+};
+
+#define FSINFO_ATTR_LIMITS__STRUCT struct fsinfo_limits
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_SUPPORTS).
+ *
+ * What's supported in various masks, such as statx() attribute and mask bits
+ * and IOC flags.
+ */
+struct fsinfo_supports {
+ __u64 stx_attributes; /* What statx::stx_attributes are supported */
+ __u32 stx_mask; /* What statx::stx_mask bits are supported */
+ __u32 fs_ioc_getflags; /* What FS_IOC_GETFLAGS may return */
+ __u32 fs_ioc_setflags_set; /* What FS_IOC_SETFLAGS may set */
+ __u32 fs_ioc_setflags_clear; /* What FS_IOC_SETFLAGS may clear */
+ __u32 fs_ioc_fsgetxattr_xflags; /* What FS_IOC_FSGETXATTR[A] may return in fsx_xflags */
+ __u32 fs_ioc_fssetxattr_xflags_set; /* What FS_IOC_FSSETXATTR may set in fsx_xflags */
+ __u32 fs_ioc_fssetxattr_xflags_clear; /* What FS_IOC_FSSETXATTR may set in fsx_xflags */
+ __u32 win_file_attrs; /* What DOS/Windows FILE_* attributes are supported */
+};
+
+#define FSINFO_ATTR_SUPPORTS__STRUCT struct fsinfo_supports
+
+struct fsinfo_timestamp_one {
+ __s64 minimum; /* Minimum timestamp value in seconds */
+ __s64 maximum; /* Maximum timestamp value in seconds */
+ __u16 gran_mantissa; /* Granularity(secs) = mant * 10^exp */
+ __s8 gran_exponent;
+ __u8 __padding[5];
+};
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_TIMESTAMP_INFO).
+ */
+struct fsinfo_timestamp_info {
+ struct fsinfo_timestamp_one atime; /* Access time */
+ struct fsinfo_timestamp_one mtime; /* Modification time */
+ struct fsinfo_timestamp_one ctime; /* Change time */
+ struct fsinfo_timestamp_one btime; /* Birth/creation time */
+};
+
+#define FSINFO_ATTR_TIMESTAMP_INFO__STRUCT struct fsinfo_timestamp_info
+
+/*
+ * Information struct for fsinfo(FSINFO_ATTR_VOLUME_UUID).
+ */
+struct fsinfo_volume_uuid {
+ __u8 uuid[16];
+};
+
+#define FSINFO_ATTR_VOLUME_UUID__STRUCT struct fsinfo_volume_uuid
+
+#endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 3e1c5c9d2efe..f72a9e4ddc9a 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -51,6 +51,7 @@ COND_SYSCALL_COMPAT(io_pgetevents);
COND_SYSCALL(io_uring_setup);
COND_SYSCALL(io_uring_enter);
COND_SYSCALL(io_uring_register);
+COND_SYSCALL(fsinfo);
/* fs/xattr.c */
diff --git a/samples/vfs/Makefile b/samples/vfs/Makefile
index 00b6824f9237..d63af5106fc2 100644
--- a/samples/vfs/Makefile
+++ b/samples/vfs/Makefile
@@ -1,5 +1,5 @@
# SPDX-License-Identifier: GPL-2.0-only
-userprogs := test-fsmount test-statx
+userprogs := test-fsinfo test-fsmount test-statx
always-y := $(userprogs)
userccflags += -I usr/include
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
new file mode 100644
index 000000000000..934b25399ffe
--- /dev/null
+++ b/samples/vfs/test-fsinfo.c
@@ -0,0 +1,646 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Test the fsinfo() system call
+ *
+ * Copyright (C) 2020 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <math.h>
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <linux/fsinfo.h>
+#include <linux/socket.h>
+#include <sys/stat.h>
+#include <arpa/inet.h>
+
+#ifndef __NR_fsinfo
+#define __NR_fsinfo -1
+#endif
+
+static bool debug = 0;
+static bool list_last;
+
+static __attribute__((unused))
+ssize_t fsinfo(int dfd, const char *filename,
+ struct fsinfo_params *params, size_t params_size,
+ void *result_buffer, size_t result_buf_size)
+{
+ return syscall(__NR_fsinfo, dfd, filename,
+ params, params_size,
+ result_buffer, result_buf_size);
+}
+
+struct fsinfo_attribute {
+ unsigned int attr_id;
+ enum fsinfo_value_type type;
+ unsigned int size;
+ const char *name;
+ void (*dump)(void *reply, unsigned int size);
+};
+
+static const struct fsinfo_attribute fsinfo_attributes[];
+
+static ssize_t get_fsinfo(const char *, const char *, struct fsinfo_params *, void **);
+
+static void dump_hex(FILE *f, unsigned char *data, int from, int to)
+{
+ unsigned offset, col = 0;
+ bool print_offset = true;
+
+ for (offset = from; offset < to; offset++) {
+ if (print_offset) {
+ fprintf(f, "%04x: ", offset);
+ print_offset = 0;
+ }
+ fprintf(f, "%02x", data[offset]);
+ col++;
+ if ((col & 3) == 0) {
+ if ((col & 15) == 0) {
+ fprintf(f, "\n");
+ print_offset = 1;
+ } else {
+ fprintf(f, " ");
+ }
+ }
+ }
+
+ if (!print_offset)
+ fprintf(f, "\n");
+}
+
+static void dump_attribute_info(void *reply, unsigned int size)
+{
+ struct fsinfo_attribute_info *attr_info = reply;
+ const struct fsinfo_attribute *attr;
+ char type[32], val_size[32];
+
+ switch (attr_info->type) {
+ case FSINFO_TYPE_VSTRUCT: strcpy(type, "V-STRUCT"); break;
+ case FSINFO_TYPE_STRING: strcpy(type, "STRING"); break;
+ case FSINFO_TYPE_OPAQUE: strcpy(type, "OPAQUE"); break;
+ case FSINFO_TYPE_LIST: strcpy(type, "LIST"); break;
+ default:
+ sprintf(type, "type-%x", attr_info->type);
+ break;
+ }
+
+ if (attr_info->flags & FSINFO_FLAGS_N)
+ strcat(type, " x N");
+ else if (attr_info->flags & FSINFO_FLAGS_NM)
+ strcat(type, " x NM");
+
+ for (attr = fsinfo_attributes; attr->name; attr++)
+ if (attr->attr_id == attr_info->attr_id)
+ break;
+
+ if (attr_info->size)
+ sprintf(val_size, "%u", attr_info->size);
+ else
+ strcpy(val_size, "-");
+
+ printf("%8x %-12s %08x %5s %s\n",
+ attr_info->attr_id,
+ type,
+ attr_info->flags,
+ val_size,
+ attr->name ? attr->name : "");
+}
+
+static void dump_fsinfo_generic_statfs(void *reply, unsigned int size)
+{
+ struct fsinfo_statfs *f = reply;
+
+ printf("\n");
+ printf("\tblocks : n=%llu fr=%llu av=%llu\n",
+ (unsigned long long)f->f_blocks.lo,
+ (unsigned long long)f->f_bfree.lo,
+ (unsigned long long)f->f_bavail.lo);
+
+ printf("\tfiles : n=%llu fr=%llu av=%llu\n",
+ (unsigned long long)f->f_files.lo,
+ (unsigned long long)f->f_ffree.lo,
+ (unsigned long long)f->f_favail.lo);
+ printf("\tbsize : %llu\n",
+ (unsigned long long)f->f_bsize);
+ printf("\tfrsize : %llu\n",
+ (unsigned long long)f->f_frsize);
+}
+
+static void dump_fsinfo_generic_ids(void *reply, unsigned int size)
+{
+ struct fsinfo_ids *f = reply;
+
+ printf("\n");
+ printf("\tdev : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
+ printf("\tfs : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
+ printf("\tfsid : %llx\n", (unsigned long long)f->f_fsid);
+ printf("\tsbid : %llx\n", (unsigned long long)f->f_sb_id);
+}
+
+static void dump_fsinfo_generic_limits(void *reply, unsigned int size)
+{
+ struct fsinfo_limits *f = reply;
+
+ printf("\n");
+ printf("\tmax file size: %llx%016llx\n",
+ (unsigned long long)f->max_file_size.hi,
+ (unsigned long long)f->max_file_size.lo);
+ printf("\tmax ino : %llx%016llx\n",
+ (unsigned long long)f->max_ino.hi,
+ (unsigned long long)f->max_ino.lo);
+ printf("\tmax ids : u=%llx g=%llx p=%llx\n",
+ (unsigned long long)f->max_uid,
+ (unsigned long long)f->max_gid,
+ (unsigned long long)f->max_projid);
+ printf("\tmax dev : maj=%x min=%x\n",
+ f->max_dev_major, f->max_dev_minor);
+ printf("\tmax links : %llx\n",
+ (unsigned long long)f->max_hard_links);
+ printf("\tmax xattr : n=%x b=%llx\n",
+ f->max_xattr_name_len,
+ (unsigned long long)f->max_xattr_body_len);
+ printf("\tmax len : file=%x sym=%x\n",
+ f->max_filename_len, f->max_symlink_len);
+}
+
+static void dump_fsinfo_generic_supports(void *reply, unsigned int size)
+{
+ struct fsinfo_supports *f = reply;
+
+ printf("\n");
+ printf("\tstx_attr : %llx\n", (unsigned long long)f->stx_attributes);
+ printf("\tstx_mask : %x\n", f->stx_mask);
+ printf("\tfs_ioc_*flags: get=%x set=%x clr=%x\n",
+ f->fs_ioc_getflags, f->fs_ioc_setflags_set, f->fs_ioc_setflags_clear);
+ printf("\tfs_ioc_*xattr: fsx_xflags: get=%x set=%x clr=%x\n",
+ f->fs_ioc_fsgetxattr_xflags,
+ f->fs_ioc_fssetxattr_xflags_set,
+ f->fs_ioc_fssetxattr_xflags_clear);
+ printf("\twin_fattrs : %x\n", f->win_file_attrs);
+}
+
+static void print_time(struct fsinfo_timestamp_one *t, char stamp)
+{
+ printf("\t%ctime : gran=%uE%d range=%llx-%llx\n",
+ stamp,
+ t->gran_mantissa, t->gran_exponent,
+ (long long)t->minimum, (long long)t->maximum);
+}
+
+static void dump_fsinfo_generic_timestamp_info(void *reply, unsigned int size)
+{
+ struct fsinfo_timestamp_info *f = reply;
+
+ printf("\n");
+ print_time(&f->atime, 'a');
+ print_time(&f->mtime, 'm');
+ print_time(&f->ctime, 'c');
+ print_time(&f->btime, 'b');
+}
+
+static void dump_fsinfo_generic_volume_uuid(void *reply, unsigned int size)
+{
+ struct fsinfo_volume_uuid *f = reply;
+
+ printf("%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
+ "-%02x%02x%02x%02x%02x%02x\n",
+ f->uuid[ 0], f->uuid[ 1],
+ f->uuid[ 2], f->uuid[ 3],
+ f->uuid[ 4], f->uuid[ 5],
+ f->uuid[ 6], f->uuid[ 7],
+ f->uuid[ 8], f->uuid[ 9],
+ f->uuid[10], f->uuid[11],
+ f->uuid[12], f->uuid[13],
+ f->uuid[14], f->uuid[15]);
+}
+
+static void dump_string(void *reply, unsigned int size)
+{
+ char *s = reply, *p;
+ bool nl = false, last_nl = false;
+
+ p = s;
+ if (size >= 4096) {
+ size = 4096;
+ p[4092] = '.';
+ p[4093] = '.';
+ p[4094] = '.';
+ p[4095] = 0;
+ } else {
+ p[size] = 0;
+ }
+
+ for (p = s; *p; p++) {
+ if (*p == '\n') {
+ last_nl = nl = true;
+ continue;
+ }
+ last_nl = false;
+ if (!isprint(*p) && *p != '\t')
+ *p = '?';
+ }
+
+ if (nl)
+ putchar('\n');
+ printf("%s", s);
+ if (!last_nl)
+ putchar('\n');
+}
+
+#define dump_fsinfo_meta_attribute_info (void *)0x123
+#define dump_fsinfo_meta_attributes (void *)0x123
+
+/*
+ *
+ */
+#define __FSINFO(A, T, S, G, F, N) \
+ { .attr_id = A, .type = T, .size = S, .name = N, .dump = dump_##G }
+
+#define _FSINFO(A,T,S,G,N) __FSINFO(A, T, S, G, 0, N)
+#define _FSINFO_N(A,T,S,G,N) __FSINFO(A, T, S, G, FSINFO_FLAGS_N, N)
+#define _FSINFO_NM(A,T,S,G,N) __FSINFO(A, T, S, G, FSINFO_FLAGS_NM, N)
+
+#define _FSINFO_VSTRUCT(A,S,G,N) _FSINFO (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G, N)
+#define _FSINFO_VSTRUCT_N(A,S,G,N) _FSINFO_N (A, FSINFO_TYPE_VSTRUCT, sizeof(S), G, N)
+#define _FSINFO_VSTRUCT_NM(A,S,G,N) _FSINFO_NM(A, FSINFO_TYPE_VSTRUCT, sizeof(S), G, N)
+
+#define FSINFO_VSTRUCT(A,G) _FSINFO_VSTRUCT (A, A##__STRUCT, G, #A)
+#define FSINFO_VSTRUCT_N(A,G) _FSINFO_VSTRUCT_N (A, A##__STRUCT, G, #A)
+#define FSINFO_VSTRUCT_NM(A,G) _FSINFO_VSTRUCT_NM(A, A##__STRUCT, G, #A)
+#define FSINFO_STRING(A,G) _FSINFO (A, FSINFO_TYPE_STRING, 0, G, #A)
+#define FSINFO_STRING_N(A,G) _FSINFO_N (A, FSINFO_TYPE_STRING, 0, G, #A)
+#define FSINFO_STRING_NM(A,G) _FSINFO_NM(A, FSINFO_TYPE_STRING, 0, G, #A)
+#define FSINFO_OPAQUE(A,G) _FSINFO (A, FSINFO_TYPE_OPAQUE, 0, G, #A)
+#define FSINFO_LIST(A,G) _FSINFO (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G, #A)
+#define FSINFO_LIST_N(A,G) _FSINFO_N (A, FSINFO_TYPE_LIST, sizeof(A##__STRUCT), G, #A)
+
+static const struct fsinfo_attribute fsinfo_attributes[] = {
+ FSINFO_VSTRUCT (FSINFO_ATTR_STATFS, fsinfo_generic_statfs),
+ FSINFO_VSTRUCT (FSINFO_ATTR_IDS, fsinfo_generic_ids),
+ FSINFO_VSTRUCT (FSINFO_ATTR_LIMITS, fsinfo_generic_limits),
+ FSINFO_VSTRUCT (FSINFO_ATTR_SUPPORTS, fsinfo_generic_supports),
+ FSINFO_VSTRUCT (FSINFO_ATTR_TIMESTAMP_INFO, fsinfo_generic_timestamp_info),
+ FSINFO_STRING (FSINFO_ATTR_VOLUME_ID, string),
+ FSINFO_VSTRUCT (FSINFO_ATTR_VOLUME_UUID, fsinfo_generic_volume_uuid),
+ FSINFO_STRING (FSINFO_ATTR_VOLUME_NAME, string),
+ FSINFO_VSTRUCT_N(FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO, fsinfo_meta_attribute_info),
+ FSINFO_LIST (FSINFO_ATTR_FSINFO_ATTRIBUTES, fsinfo_meta_attributes),
+ {}
+};
+
+static __attribute__((noreturn))
+void bad_value(const char *what,
+ struct fsinfo_params *params,
+ const struct fsinfo_attribute *attr,
+ const struct fsinfo_attribute_info *attr_info,
+ void *reply, unsigned int size)
+{
+ printf("\n");
+ fprintf(stderr, "%s %s{%u}{%u} t=%x f=%x s=%x\n",
+ what, attr->name, params->Nth, params->Mth,
+ attr_info->type, attr_info->flags, attr_info->size);
+ fprintf(stderr, "size=%u\n", size);
+ dump_hex(stderr, reply, 0, size);
+ exit(1);
+}
+
+static void dump_value(unsigned int attr_id,
+ const struct fsinfo_attribute *attr,
+ const struct fsinfo_attribute_info *attr_info,
+ void *reply, unsigned int size)
+{
+ if (!attr || !attr->dump) {
+ printf("<no dumper>\n");
+ return;
+ }
+
+ if (attr->type == FSINFO_TYPE_VSTRUCT && size < attr->size) {
+ printf("<short data %u/%u>\n", size, attr->size);
+ return;
+ }
+
+ attr->dump(reply, size);
+}
+
+static void dump_list(unsigned int attr_id,
+ const struct fsinfo_attribute *attr,
+ const struct fsinfo_attribute_info *attr_info,
+ void *reply, unsigned int size)
+{
+ size_t elem_size = attr_info->size;
+ unsigned int ix = 0;
+
+ printf("\n");
+ if (!attr || !attr->dump) {
+ printf("<no dumper>\n");
+ return;
+ }
+
+ if (attr->type == FSINFO_TYPE_VSTRUCT && size < attr->size) {
+ printf("<short data %u/%u>\n", size, attr->size);
+ return;
+ }
+
+ list_last = false;
+ while (size >= elem_size) {
+ printf("\t[%02x] ", ix);
+ if (size == elem_size)
+ list_last = true;
+ attr->dump(reply, size);
+ reply += elem_size;
+ size -= elem_size;
+ ix++;
+ }
+}
+
+/*
+ * Call fsinfo, expanding the buffer as necessary.
+ */
+static ssize_t get_fsinfo(const char *file, const char *name,
+ struct fsinfo_params *params, void **_r)
+{
+ ssize_t ret;
+ size_t buf_size = 4096;
+ void *r;
+
+ for (;;) {
+ r = malloc(buf_size);
+ if (!r) {
+ perror("malloc");
+ exit(1);
+ }
+ memset(r, 0xbd, buf_size);
+
+ errno = 0;
+ ret = fsinfo(AT_FDCWD, file, params, sizeof(*params), r, buf_size - 1);
+ if (ret == -1)
+ goto error;
+
+ if (ret <= buf_size - 1)
+ break;
+ buf_size = (ret + 4096 - 1) & ~(4096 - 1);
+ }
+
+ if (debug)
+ printf("fsinfo(%s,%s,%u,%u) = %zd\n",
+ file, name, params->Nth, params->Mth, ret);
+
+ ((char *)r)[ret] = 0;
+ *_r = r;
+ return ret;
+
+error:
+ *_r = NULL;
+ free(r);
+ if (debug)
+ printf("fsinfo(%s,%s,%u,%u) = %m\n",
+ file, name, params->Nth, params->Mth);
+ return ret;
+}
+
+/*
+ * Try one subinstance of an attribute.
+ */
+static int try_one(const char *file, struct fsinfo_params *params,
+ const struct fsinfo_attribute_info *attr_info, bool raw)
+{
+ const struct fsinfo_attribute *attr;
+ const char *name;
+ size_t size = 4096;
+ char namebuf[32];
+ void *r;
+
+ for (attr = fsinfo_attributes; attr->name; attr++) {
+ if (attr->attr_id == params->request) {
+ name = attr->name;
+ if (strncmp(name, "fsinfo_generic_", 15) == 0)
+ name += 15;
+ goto found;
+ }
+ }
+
+ sprintf(namebuf, "<unknown-%x>", params->request);
+ name = namebuf;
+ attr = NULL;
+
+found:
+ size = get_fsinfo(file, name, params, &r);
+
+ if (size == -1) {
+ if (errno == ENODATA) {
+ if (!(attr_info->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)) &&
+ params->Nth == 0 && params->Mth == 0)
+ bad_value("Unexpected ENODATA",
+ params, attr, attr_info, r, size);
+ free(r);
+ return (params->Mth == 0) ? 2 : 1;
+ }
+ if (errno == EOPNOTSUPP) {
+ if (params->Nth > 0 || params->Mth > 0)
+ bad_value("Should return ENODATA",
+ params, attr, attr_info, r, size);
+ //printf("\e[33m%s\e[m: <not supported>\n",
+ // fsinfo_attr_names[attr]);
+ free(r);
+ return 2;
+ }
+ perror(file);
+ exit(1);
+ }
+
+ if (raw) {
+ if (size > 4096)
+ size = 4096;
+ dump_hex(stdout, r, 0, size);
+ free(r);
+ return 0;
+ }
+
+ switch (attr_info->flags & (FSINFO_FLAGS_N | FSINFO_FLAGS_NM)) {
+ case 0:
+ printf("\e[33m%s\e[m: ", name);
+ break;
+ case FSINFO_FLAGS_N:
+ printf("\e[33m%s{%u}\e[m: ", name, params->Nth);
+ break;
+ case FSINFO_FLAGS_NM:
+ printf("\e[33m%s{%u,%u}\e[m: ", name, params->Nth, params->Mth);
+ break;
+ }
+
+ switch (attr_info->type) {
+ case FSINFO_TYPE_STRING:
+ if (size == 0 || ((char *)r)[size - 1] != 0)
+ bad_value("Unterminated string",
+ params, attr, attr_info, r, size);
+ case FSINFO_TYPE_VSTRUCT:
+ case FSINFO_TYPE_OPAQUE:
+ dump_value(params->request, attr, attr_info, r, size);
+ free(r);
+ return 0;
+
+ case FSINFO_TYPE_LIST:
+ dump_list(params->request, attr, attr_info, r, size);
+ free(r);
+ return 0;
+
+ default:
+ bad_value("Fishy type", params, attr, attr_info, r, size);
+ }
+}
+
+static int cmp_u32(const void *a, const void *b)
+{
+ return *(const int *)a - *(const int *)b;
+}
+
+/*
+ *
+ */
+int main(int argc, char **argv)
+{
+ struct fsinfo_attribute_info attr_info;
+ struct fsinfo_params params = {
+ .at_flags = AT_SYMLINK_NOFOLLOW,
+ .flags = FSINFO_FLAGS_QUERY_PATH,
+ };
+ unsigned int *attrs, ret, nr, i;
+ bool meta = false;
+ int raw = 0, opt, Nth, Mth;
+
+ while ((opt = getopt(argc, argv, "Madlr"))) {
+ switch (opt) {
+ case 'M':
+ meta = true;
+ continue;
+ case 'a':
+ params.at_flags |= AT_NO_AUTOMOUNT;
+ params.flags = FSINFO_FLAGS_QUERY_PATH;
+ continue;
+ case 'd':
+ debug = true;
+ continue;
+ case 'l':
+ params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
+ params.flags = FSINFO_FLAGS_QUERY_PATH;
+ continue;
+ case 'r':
+ raw = 1;
+ continue;
+ }
+ break;
+ }
+
+ argc -= optind;
+ argv += optind;
+
+ if (argc != 1) {
+ printf("Format: test-fsinfo [-Madlr] <path>\n");
+ exit(2);
+ }
+
+ /* Retrieve a list of supported attribute IDs */
+ params.request = FSINFO_ATTR_FSINFO_ATTRIBUTES;
+ params.Nth = 0;
+ params.Mth = 0;
+ ret = get_fsinfo(argv[0], "attributes", ¶ms, (void **)&attrs);
+ if (ret == -1) {
+ fprintf(stderr, "Unable to get attribute list: %m\n");
+ exit(1);
+ }
+
+ if (ret % sizeof(attrs[0])) {
+ fprintf(stderr, "Bad length of attribute list (0x%x)\n", ret);
+ exit(2);
+ }
+
+ nr = ret / sizeof(attrs[0]);
+ qsort(attrs, nr, sizeof(attrs[0]), cmp_u32);
+
+ if (meta) {
+ printf("ATTR ID TYPE FLAGS SIZE NAME\n");
+ printf("======== ============ ======== ===== =========\n");
+ for (i = 0; i < nr; i++) {
+ params.request = FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO;
+ params.Nth = attrs[i];
+ params.Mth = 0;
+ ret = fsinfo(AT_FDCWD, argv[0],
+ ¶ms, sizeof(params),
+ &attr_info, sizeof(attr_info));
+ if (ret == -1) {
+ fprintf(stderr, "Can't get info for attribute %x: %m\n", attrs[i]);
+ exit(1);
+ }
+
+ dump_attribute_info(&attr_info, ret);
+ }
+ exit(0);
+ }
+
+ for (i = 0; i < nr; i++) {
+ params.request = FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO;
+ params.Nth = attrs[i];
+ params.Mth = 0;
+ ret = fsinfo(AT_FDCWD, argv[0],
+ ¶ms, sizeof(params),
+ &attr_info, sizeof(attr_info));
+ if (ret == -1) {
+ fprintf(stderr, "Can't get info for attribute %x: %m\n", attrs[i]);
+ exit(1);
+ }
+
+ if (attrs[i] == FSINFO_ATTR_FSINFO_ATTRIBUTE_INFO ||
+ attrs[i] == FSINFO_ATTR_FSINFO_ATTRIBUTES)
+ continue;
+
+ if (attrs[i] != attr_info.attr_id) {
+ fprintf(stderr, "ID for %03x returned %03x\n",
+ attrs[i], attr_info.attr_id);
+ break;
+ }
+ Nth = 0;
+ do {
+ Mth = 0;
+ do {
+ params.request = attrs[i];
+ params.Nth = Nth;
+ params.Mth = Mth;
+
+ switch (try_one(argv[0], ¶ms, &attr_info, raw)) {
+ case 0:
+ continue;
+ case 1:
+ goto done_M;
+ case 2:
+ goto done_N;
+ }
+ } while (++Mth < 100);
+
+ done_M:
+ if (Mth >= 100) {
+ fprintf(stderr, "Fishy: Mth %x[%u][%u]\n", attrs[i], Nth, Mth);
+ break;
+ }
+
+ } while (++Nth < 100);
+
+ done_N:
+ if (Nth >= 100) {
+ fprintf(stderr, "Fishy: Nth %x[%u]\n", attrs[i], Nth);
+ break;
+ }
+ }
+
+ return 0;
+}
^ permalink raw reply related
* [PATCH 01/17] fsinfo: Introduce a non-repeating system-unique superblock ID [ver #20]
From: David Howells @ 2020-07-24 13:34 UTC (permalink / raw)
To: torvalds, viro
Cc: dhowells, raven, mszeredi, christian, jannh, darrick.wong, kzak,
jlayton, linux-api, linux-fsdevel, linux-security-module,
linux-kernel
In-Reply-To: <159559768062.2144584.13583793543173131929.stgit@warthog.procyon.org.uk>
Introduce an (effectively) non-repeating system-unique superblock ID that
can be used to determine that two objects are in the same superblock
without needing to worry about the ID changing in the meantime (as is
possible with device IDs).
The counter could also be used to tag other features, such as mount
objects.
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/internal.h | 1 +
fs/super.c | 2 ++
include/linux/fs.h | 3 +++
3 files changed, 6 insertions(+)
diff --git a/fs/internal.h b/fs/internal.h
index 9b863a7bd708..ea60d864a8cb 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -103,6 +103,7 @@ extern struct file *alloc_empty_file_noaccount(int, const struct cred *);
/*
* super.c
*/
+extern atomic64_t vfs_unique_counter;
extern int reconfigure_super(struct fs_context *);
extern bool trylock_super(struct super_block *sb);
extern struct super_block *user_get_super(dev_t);
diff --git a/fs/super.c b/fs/super.c
index 904459b35119..21ae8afeba3a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -44,6 +44,7 @@ static int thaw_super_locked(struct super_block *sb);
static LIST_HEAD(super_blocks);
static DEFINE_SPINLOCK(sb_lock);
+atomic64_t vfs_unique_counter; /* Unique identifier counter */
static char *sb_writers_name[SB_FREEZE_LEVELS] = {
"sb_writers",
@@ -273,6 +274,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
goto fail;
if (list_lru_init_memcg(&s->s_inode_lru, &s->s_shrink))
goto fail;
+ s->s_unique_id = atomic64_inc_return(&vfs_unique_counter);
return s;
fail:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f5abba86107d..28a29356eace 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1564,6 +1564,9 @@ struct super_block {
spinlock_t s_inode_wblist_lock;
struct list_head s_inodes_wb; /* writeback inodes */
+
+ /* Superblock information */
+ u64 s_unique_id;
} __randomize_layout;
/* Helper functions so that in most cases filesystems will
^ permalink raw reply related
* [PATCH 00/17] VFS: Filesystem information [ver #20]
From: David Howells @ 2020-07-24 13:34 UTC (permalink / raw)
To: torvalds, viro
Cc: Theodore Ts'o, Eric Biggers, linux-api, Darrick J. Wong,
Jeff Layton, Carlos Maiolino, Jeff Layton, Andreas Dilger,
linux-ext4, dhowells, raven, mszeredi, christian, jannh,
darrick.wong, kzak, jlayton, linux-api, linux-fsdevel,
linux-security-module, linux-kernel
Here's a set of patches that adds a system call, fsinfo(), that allows
information about the VFS, mount topology, superblock and files to be
retrieved.
The patchset is based on top of the notifications patchset and allows event
counters implemented in the latter to be retrieved to allow overruns to be
efficiently managed.
=======
THE WHY
=======
Why do we want this?
Using /proc/mounts (or similar) has problems:
(1) Reading from it holds a global lock (namespace_sem) that prevents
mounting and unmounting. Lots of data is encoded and mangled into
text whilst the lock is held, including superblock option strings and
mount point paths. This causes performance problems when there are a
lot of mount objects in a system.
(2) Even though namespace_sem is held during a read, reading the whole
file isn't necessarily atomic with respect to mount-type operations.
If a read isn't satisfied in one go, then it may return to userspace
briefly and then continue reading some way into the file. But changes
can occur in the interval that may then go unseen.
(3) Determining what has changed means parsing and comparing consecutive
outputs of /proc/mounts.
(4) Querying a specific mount or superblock means searching through
/proc/mounts and searching by path or mount ID - but we might have an
fd we want to query.
(5) Whilst you can poll() it for events, it only tells you that something
changed in the namespace, not what or whether you can even see the
change.
To fix the notification issues, the preceding notifications patchset added
mount watch notifications whereby you can watch for notifications in a
specific mount subtree. The notification messages include the ID(s) of the
affected mounts.
To support notifications, however, we need to be able to handle overruns in
the notification queue. I added a number of event counters to struct
super_block and struct mount to allow you to pin down the changes, but
there needs to be a way to retrieve them. Exposing them through /proc
would require adding yet another /proc/mounts-type file. We could add
per-mount directories full of attributes in sysfs, but that has issues also
(see below).
Adding an extensible system call interface for retrieving filesystem
information also allows other things to be exposed:
(1) Jeff Layton's error handling changes need a way to allow error event
information to be retrieved.
(2) Bits in masks returned by things like statx() and FS_IOC_GETFLAGS are
actually 3-state { Set, Unset, Not supported }. It could be useful to
provide a way to expose information like this[*].
(3) Limits of the numerical metadata values in a filesystem[*].
(4) Filesystem capability information[*]. Filesystems don't all have the
same capabilities, and even different instances may have different
capabilities, particularly with network filesystems where the set of
may be server-dependent. Capabilities might even vary at file
granularity - though possibly such information should be conveyed
through statx() instead.
(5) ID mapping/shifting tables in use for a superblock.
(6) Filesystem-specific information. I need something for AFS so that I
can do pioctl()-emulation, thereby allowing me to implement certain of
the AFS command line utilities that query state of a particular file.
This could also have application for other filesystems, such as NFS,
CIFS and ext4.
[*] In a lot of cases these are probably invariant and can be memcpy'd
from static data.
There's a further consideration: I want to make it possible to have
fsconfig(fd, FSCONFIG_CMD_CREATE) be intercepted by a container manager
such that the manager can supervise a mount attempted inside the container.
The manager would be given an fd pointing to the fs_context struct and
would then need some way to query it (fsinfo()) and modify it (fsconfig()).
This could also be used to arbitrate user-requested mounts when containers
are not in play.
================
DESIGN DECISIONS
================
(1) Information is partitioned into sets of attributes.
(2) Attribute IDs are integers as they're fast to compare.
(3) Attribute values are typed (struct, list of structs, string, opaque
blob). They type is fixed for a particular attribute.
(4) For structure types, the length is also a version. New fields can be
tacked onto the end.
(5) When copying a versioned struct to userspace, the core handles a
version mismatch by truncating or zero-padding the data as necessary.
This is transparent to the filesystem.
(6) The core handles all the buffering and buffer resizing.
(7) The filesystem never gets any access to the userspace parameter buffer
or result buffer.
(8) "Meta" attributes can describe other attributes.
========
OVERVIEW
========
fsinfo() is a system call that allows information about the filesystem at a
particular path point to be queried as a set of attributes.
Attribute values are of four basic types:
(1) Structure with version-dependent length (the length is the version).
(2) Variable-length string.
(3) List of structures (all the same length).
(4) Opaque blob.
Attributes can have multiple values either as a sequence of values or a
sequence-of-sequences of values and all the values of a particular
attribute must be of the same type. Values can be up to INT_MAX size,
subject to memory availability.
Note that the values of an attribute *are* allowed to vary between dentries
within a single superblock, depending on the specific dentry that you're
looking at, but the values still have to be of the type for that attribute.
I've tried to make the interface as light as possible, so integer attribute
IDs rather than string and the core does all the buffer allocation and
expansion and all the extensibility support work rather than leaving that
to the filesystems. This also means that userspace pointers are not
exposed to the filesystem.
fsinfo() allows a variety of information to be retrieved about a filesystem
and the mount topology:
(1) General superblock attributes:
- Filesystem identifiers (UUID, volume label, device numbers, ...)
- The limits on a filesystem's capabilities
- Information on supported statx fields and attributes and IOC flags.
- A variety single-bit flags indicating supported capabilities.
- Timestamp resolution and range.
- The amount of space/free space in a filesystem (as statfs()).
- Superblock notification counter.
(2) Filesystem-specific superblock attributes:
- Superblock-level timestamps.
- Cell name, workgroup or other netfs grouping concept.
- Server names and addresses.
(3) VFS information:
- Mount topology information.
- Mount attributes.
- Mount notification counter.
- Mount point path.
(4) Information about what the fsinfo() syscall itself supports, including
the type and struct size of attributes.
The system is extensible:
(1) New attributes can be added. There is no requirement that a
filesystem implement every attribute. A helper function is provided
to scan a list of attributes and a filesystem can have multiple such
lists.
(2) Version length-dependent structure attributes can be made larger and
have additional information tacked on the end, provided it keeps the
layout of the existing fields. If an older process asks for a shorter
structure, it will only be given the bits it asks for. If a newer
process asks for a longer structure on an older kernel, the extra
space will be set to 0. In all cases, the size of the data actually
available is returned.
In essence, the size of a structure is that structure's version: a
smaller size is an earlier version and a later version includes
everything that the earlier version did.
(3) New single-bit capability flags can be added. This is a structure-typed
attribute and, as such, (2) applies. Any bits you wanted but the kernel
doesn't support are automatically set to 0.
fsinfo() may be called like the following, for example:
struct fsinfo_params params = {
.at_flags = AT_SYMLINK_NOFOLLOW,
.flags = FSINFO_FLAGS_QUERY_PATH,
.request = FSINFO_ATTR_AFS_SERVER_ADDRESSES,
.Nth = 2,
};
struct fsinfo_server_address address;
len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc",
¶ms, sizeof(params),
&address, sizeof(address));
The above example would query an AFS filesystem to retrieve the address
list for the 3rd server, and:
struct fsinfo_params params = {
.at_flags = AT_SYMLINK_NOFOLLOW,
.flags = FSINFO_FLAGS_QUERY_PATH,
.request = FSINFO_ATTR_NFS_SERVER_NAME;
};
char server_name[256];
len = fsinfo(AT_FDCWD, "/home/dhowells/",
¶ms, sizeof(params),
&server_name, sizeof(server_name));
would retrieve the name of the NFS server as a string.
In future, I want to make fsinfo() capable of querying a context created by
fsopen() or fspick(), e.g.:
fd = fsopen("ext4", 0);
struct fsinfo_params params = {
.flags = FSINFO_FLAGS_QUERY_FSCONTEXT,
.request = FSINFO_ATTR_CONFIGURATION;
};
char buffer[65536];
fsinfo(fd, NULL, ¶ms, sizeof(params), &buffer, sizeof(buffer));
even if that context doesn't currently have a superblock attached.
The patches can be found here also:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
on branch:
fsinfo-core
===================
SIGNIFICANT CHANGES
===================
ver #20:
(*) Changed MOUNT_PROPAGATION_SLAVE to MOUNT_PROPAGATION_DEPENDENT and
renamed the fields in the fsinfo_mount_topology struct. The
MOUNT_PROPAGATION_* settings have been turned into an enum and will
also be passed to mount_setattr().
(*) Adjusted the Ext4 patch from feedback and removed the example status
from it.
(*) Dropped the NFS patch.
(*) I've dropped the superblock notifications for now.
ver #19:
(*) Split FSINFO_ATTR_MOUNT_TOPOLOGY from FSINFO_ATTR_MOUNT_INFO. The
latter requires no locking as it looks no further than the mount
object it's dealing with. The topology attribute, however, has to
take the namespace lock. That said, the info attribute includes a
counter that indicates how many times a mount object's position in the
topology has changed.
(*) A bit of patch rearrangement to put the mount topology-exposing
attributes into one patch.
(*) Pass both AT_* and RESOLVE_* flags to fsinfo() as suggested by Linus,
rather than adding missing RESOLVE_* flags.
ver #18:
(*) Moved the mount and superblock notification patches into a different
branch.
(*) Made superblock configuration (->show_opts), bindmount path
(->show_path) and filesystem statistics (->show_stats) available as
the CONFIGURATION, MOUNT_PATH and FS_STATISTICS attributes.
(*) Made mountpoint device name available, filtered through the superblock
(->show_devname), as the SOURCE attribute.
(*) Made the mountpoint available as a full path as well as a relative
one.
(*) Added more event counters to MOUNT_INFO, including a subtree
notification counter, to make it easier to clean up after a
notification overrun.
(*) Made the event counter value returned by MOUNT_CHILDREN the sum of the
five event counters.
(*) Added a mount uniquifier and added that to the MOUNT_CHILDREN entries
also so that mount ID reuse can be detected.
(*) Merged the SB_NOTIFICATION attribute into the MOUNT_INFO attribute to
avoid duplicate information.
(*) Switched to using the RESOLVE_* flags rather than AT_* flags for
pathwalk control. Added more RESOLVE_* flags.
(*) Used a lock instead of RCU to enumerate children for the
MOUNT_CHILDREN attribute for safety. This is probably worth
revisiting at a later date, however.
David
---
David Howells (14):
fsinfo: Introduce a non-repeating system-unique superblock ID
fsinfo: Add fsinfo() syscall to query filesystem information
fsinfo: Provide a bitmap of the features a filesystem supports
fsinfo: Allow retrieval of superblock devname, options and stats
fsinfo: Allow fsinfo() to look up a mount object by ID
fsinfo: Add a uniquifier ID to struct mount
fsinfo: Allow mount information to be queried
fsinfo: Allow mount topology and propagation info to be retrieved
fsinfo: Provide notification overrun handling support
fsinfo: sample: Mount listing program
fsinfo: Add API documentation
fsinfo: Add support for AFS
fsinfo: Add support to ext4
fsinfo: Add an attribute that lists all the visible mounts in a namespace
Jeff Layton (3):
errseq: add a new errseq_scrape function
vfs: allow fsinfo to fetch the current state of s_wb_err
samples: add error state information to test-fsinfo.c
Documentation/filesystems/fsinfo.rst | 574 +++++++++++++
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/Kconfig | 7 +
fs/Makefile | 1 +
fs/afs/internal.h | 1 +
fs/afs/super.c | 216 ++++-
fs/d_path.c | 2 +-
fs/ext4/Makefile | 1 +
fs/ext4/ext4.h | 6 +
fs/ext4/fsinfo.c | 97 +++
fs/ext4/super.c | 3 +
fs/fsinfo.c | 748 +++++++++++++++++
fs/internal.h | 15 +
fs/mount.h | 3 +
fs/mount_notify.c | 2 +
fs/namespace.c | 427 +++++++++-
include/linux/errseq.h | 1 +
include/linux/fs.h | 4 +
include/linux/fsinfo.h | 112 +++
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/fsinfo.h | 345 ++++++++
include/uapi/linux/mount.h | 13 +-
kernel/sys_ni.c | 1 +
lib/errseq.c | 33 +-
samples/vfs/Makefile | 6 +-
samples/vfs/test-fsinfo.c | 881 ++++++++++++++++++++
samples/vfs/test-mntinfo.c | 277 ++++++
44 files changed, 3791 insertions(+), 11 deletions(-)
create mode 100644 Documentation/filesystems/fsinfo.rst
create mode 100644 fs/ext4/fsinfo.c
create mode 100644 fs/fsinfo.c
create mode 100644 include/linux/fsinfo.h
create mode 100644 include/uapi/linux/fsinfo.h
create mode 100644 samples/vfs/test-fsinfo.c
create mode 100644 samples/vfs/test-mntinfo.c
^ permalink raw reply
* [PATCH 4/4] watch_queue: sample: Display mount tree change notifications
From: David Howells @ 2020-07-24 13:11 UTC (permalink / raw)
To: viro
Cc: dhowells, torvalds, casey, sds, nicolas.dichtel, raven, christian,
jlayton, kzak, mszeredi, linux-api, linux-fsdevel,
linux-security-module, linux-kernel
In-Reply-To: <159559628247.2141315.2107013106060144287.stgit@warthog.procyon.org.uk>
This is run like:
./watch_test
and watches "/" for changes to the mount topology and the attributes of
individual mount objects.
# mount -t tmpfs none /mnt
# mount -o remount,ro /mnt
# mount -o remount,rw /mnt
producing:
# ./watch_test
read() = 16
NOTIFY[000]: ty=000002 sy=00 i=02000010
MOUNT 00000060 change=0[new_mount] aux=416
read() = 16
NOTIFY[000]: ty=000002 sy=04 i=02010010
MOUNT 000001a0 change=4[setattr] aux=0
read() = 16
NOTIFY[000]: ty=000002 sy=04 i=02010010
MOUNT 000001a0 change=4[setattr] aux=0
Signed-off-by: David Howells <dhowells@redhat.com>
---
samples/watch_queue/watch_test.c | 44 +++++++++++++++++++++++++++++++++++++-
1 file changed, 43 insertions(+), 1 deletion(-)
diff --git a/samples/watch_queue/watch_test.c b/samples/watch_queue/watch_test.c
index 46e618a897fe..b526de016de4 100644
--- a/samples/watch_queue/watch_test.c
+++ b/samples/watch_queue/watch_test.c
@@ -26,6 +26,9 @@
#ifndef __NR_keyctl
#define __NR_keyctl -1
#endif
+#ifndef __NR_watch_mount
+#define __NR_watch_mount -1
+#endif
#define BUF_SIZE 256
@@ -58,6 +61,32 @@ static void saw_key_change(struct watch_notification *n, size_t len)
k->key_id, n->subtype, key_subtypes[n->subtype], k->aux);
}
+static const char *mount_subtypes[256] = {
+ [NOTIFY_MOUNT_NEW_MOUNT] = "new_mount",
+ [NOTIFY_MOUNT_UNMOUNT] = "unmount",
+ [NOTIFY_MOUNT_EXPIRY] = "expiry",
+ [NOTIFY_MOUNT_READONLY] = "readonly",
+ [NOTIFY_MOUNT_SETATTR] = "setattr",
+ [NOTIFY_MOUNT_MOVE_FROM] = "move_from",
+ [NOTIFY_MOUNT_MOVE_TO] = "move_to",
+};
+
+static void saw_mount_change(struct watch_notification *n, size_t len)
+{
+ struct mount_notification *m = (struct mount_notification *)n;
+
+ if (len != sizeof(struct mount_notification))
+ return;
+
+ printf("MOUNT %08x change=%u[%s] aux=%u ctr=%x,%x actr=%x\n",
+ m->triggered_on, n->subtype, mount_subtypes[n->subtype],
+ m->auxiliary_mount,
+ m->topology_changes,
+ m->attr_changes,
+ m->aux_topology_changes);
+
+}
+
/*
* Consume and display events.
*/
@@ -134,6 +163,9 @@ static void consumer(int fd)
default:
printf("other type\n");
break;
+ case WATCH_TYPE_MOUNT_NOTIFY:
+ saw_mount_change(&n.n, len);
+ break;
}
p += len;
@@ -142,12 +174,17 @@ static void consumer(int fd)
}
static struct watch_notification_filter filter = {
- .nr_filters = 1,
+ .nr_filters = 2,
.filters = {
[0] = {
.type = WATCH_TYPE_KEY_NOTIFY,
.subtype_filter[0] = UINT_MAX,
},
+ [1] = {
+ .type = WATCH_TYPE_MOUNT_NOTIFY,
+ // Reject move-from notifications
+ .subtype_filter[0] = UINT_MAX & ~(1 << NOTIFY_MOUNT_MOVE_FROM),
+ },
},
};
@@ -181,6 +218,11 @@ int main(int argc, char **argv)
exit(1);
}
+ if (syscall(__NR_watch_mount, AT_FDCWD, "/", 0, fd, 0xde) == -1) {
+ perror("watch_mount");
+ exit(1);
+ }
+
consumer(fd);
exit(0);
}
^ permalink raw reply related
* [PATCH 3/4] watch_queue: Implement mount topology and attribute change notifications
From: David Howells @ 2020-07-24 13:11 UTC (permalink / raw)
To: viro
Cc: dhowells, torvalds, casey, sds, nicolas.dichtel, raven, christian,
jlayton, kzak, mszeredi, linux-api, linux-fsdevel,
linux-security-module, linux-kernel
In-Reply-To: <159559628247.2141315.2107013106060144287.stgit@warthog.procyon.org.uk>
Add a mount notification facility whereby notifications about changes in
mount topology and configuration can be received. Note that this only
covers vfsmount topology changes and not superblock events. A separate
facility will be added for that.
Every mount is given a change counter than counts the number of topological
rearrangements in which it is involved and the number of attribute changes
it undergoes. This allows notification loss to be dealt with. Later
patches will provide a way to quickly retrieve this value, along with
information about topology and parameters for the superblock.
Firstly, a watch queue needs to be created:
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, 256);
then a notification can be set up to report notifications via that queue:
struct watch_notification_filter filter = {
.nr_filters = 1,
.filters = {
[0] = {
.type = WATCH_TYPE_MOUNT_NOTIFY,
.subtype_filter[0] = UINT_MAX,
},
},
};
ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);
watch_mount(AT_FDCWD, "/", 0, fds[1], 0x02);
In this case, it would let me monitor the mount topology subtree rooted at
"/" for events. Mount notifications propagate up the tree towards the
root, so a watch will catch all of the events happening in the subtree
rooted at the watch.
After setting the watch, records will be placed into the queue when, for
example, as superblock switches between read-write and read-only. Records
are of the following format:
struct mount_notification {
struct watch_notification watch;
__u32 triggered_on;
__u32 auxiliary_mount;
__u32 topology_changes;
__u32 attr_changes;
__u32 aux_topology_changes;
} *n;
Where:
n->watch.type will be WATCH_TYPE_MOUNT_NOTIFY.
n->watch.subtype will indicate the type of event, such as
NOTIFY_MOUNT_NEW_MOUNT.
n->watch.info & WATCH_INFO_LENGTH will indicate the length of the
record.
n->watch.info & WATCH_INFO_ID will be the fifth argument to
watch_mount(), shifted.
n->watch.info & NOTIFY_MOUNT_IN_SUBTREE if true indicates that the
notification was generated in the mount subtree rooted at the
watch, and not actually in the watch itself.
n->watch.info & NOTIFY_MOUNT_IS_RECURSIVE if true indicates that
the notification was generated by an event (eg. SETATTR) that was
applied recursively. The notification is only generated for the
object that initially triggered it.
n->watch.info & NOTIFY_MOUNT_IS_NOW_RO will be used for
NOTIFY_MOUNT_READONLY, being set if the mount becomes R/O, and
being cleared otherwise, and for NOTIFY_MOUNT_NEW_MOUNT, being set
if the new mount is readonly.
n->watch.info & NOTIFY_MOUNT_IS_SUBMOUNT if true indicates that the
NOTIFY_MOUNT_NEW_MOUNT notification is in response to a mount
performed by the kernel (e.g. an automount).
n->triggered_on indicates the ID of the mount to which the change
was accounted (e.g. the new parent of a new mount).
n->axiliary_mount indicates the ID of an additional mount that was
affected (e.g. a new mount itself) or 0.
n->topology_changes provides the value of the topology change
counter of the triggered-on mount at the conclusion of the
operation.
n->attr_changes provides the value of the attribute change counter
of the triggered-on mount at the conclusion of the operation.
n->aux_topology_changes provides the value of the topology change
counter of the auxiliary mount at the conclusion of the operation.
Note that it is permissible for event records to be of variable length -
or, at least, the length may be dependent on the subtype. Note also that
the queue can be shared between multiple notifications of various types.
Signed-off-by: David Howells <dhowells@redhat.com>
---
Documentation/watch_queue.rst | 12 +
arch/alpha/kernel/syscalls/syscall.tbl | 1
arch/arm/tools/syscall.tbl | 1
arch/arm64/include/asm/unistd32.h | 2
arch/ia64/kernel/syscalls/syscall.tbl | 1
arch/m68k/kernel/syscalls/syscall.tbl | 1
arch/microblaze/kernel/syscalls/syscall.tbl | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 1
arch/mips/kernel/syscalls/syscall_n64.tbl | 1
arch/mips/kernel/syscalls/syscall_o32.tbl | 1
arch/parisc/kernel/syscalls/syscall.tbl | 1
arch/powerpc/kernel/syscalls/syscall.tbl | 1
arch/s390/kernel/syscalls/syscall.tbl | 1
arch/sh/kernel/syscalls/syscall.tbl | 1
arch/sparc/kernel/syscalls/syscall.tbl | 1
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
arch/xtensa/kernel/syscalls/syscall.tbl | 1
fs/Kconfig | 9 +
fs/Makefile | 1
fs/mount.h | 21 ++
fs/mount_notify.c | 228 +++++++++++++++++++++++++++
fs/namespace.c | 22 +++
include/linux/dcache.h | 1
include/linux/syscalls.h | 2
include/uapi/asm-generic/unistd.h | 4
include/uapi/linux/watch_queue.h | 36 ++++
kernel/sys_ni.c | 3
28 files changed, 354 insertions(+), 3 deletions(-)
create mode 100644 fs/mount_notify.c
diff --git a/Documentation/watch_queue.rst b/Documentation/watch_queue.rst
index 849fad6893ef..3e647992be31 100644
--- a/Documentation/watch_queue.rst
+++ b/Documentation/watch_queue.rst
@@ -8,6 +8,7 @@ opened by userspace. This can be used in conjunction with::
* Key/keyring notifications
+ * Mount notifications.
The notifications buffers can be enabled by:
@@ -233,6 +234,11 @@ Any particular buffer can be fed from multiple sources. Sources include:
See Documentation/security/keys/core.rst for more information.
+ * WATCH_TYPE_MOUNT_NOTIFY
+
+ Notifications of this type indicate changes to mount attributes and the
+ mount topology within the subtree at the indicated point.
+
Event Filtering
===============
@@ -292,9 +298,10 @@ A buffer is created with something like the following::
pipe2(fds, O_TMPFILE);
ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, 256);
-It can then be set to receive keyring change notifications::
+It can then be set to receive notifications::
keyctl(KEYCTL_WATCH_KEY, KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
+ watch_mount(AT_FDCWD, "/", 0, fds[1], 0x02);
The notifications can then be consumed by something like the following::
@@ -331,6 +338,9 @@ The notifications can then be consumed by something like the following::
case WATCH_TYPE_KEY_NOTIFY:
saw_key_change(&n.n);
break;
+ case WATCH_TYPE_MOUNT_NOTIFY:
+ saw_mount_change(&n.n);
+ break;
}
p += len;
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 5ddd128d4b7a..b6cf8403da35 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -478,3 +478,4 @@
547 common openat2 sys_openat2
548 common pidfd_getfd sys_pidfd_getfd
549 common faccessat2 sys_faccessat2
+550 common watch_mount sys_watch_mount
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index d5cae5ffede0..27cc1f53f4a0 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -452,3 +452,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common watch_mount sys_watch_mount
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 6d95d0c8bf2f..4f9cf98cdf0f 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -885,6 +885,8 @@ __SYSCALL(__NR_openat2, sys_openat2)
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_faccessat2 439
__SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_watch_mount 440
+__SYSCALL(__NR_watch_mount, sys_watch_mount)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 49e325b604b3..fc6d87903781 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -359,3 +359,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common watch_mount sys_watch_mount
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index f71b1bbcc198..c671aa0e4d25 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common watch_mount sys_watch_mount
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index edacc4561f2b..65cc53f129ef 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -444,3 +444,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common watch_mount sys_watch_mount
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index f777141f5256..7f034a239930 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -377,3 +377,4 @@
437 n32 openat2 sys_openat2
438 n32 pidfd_getfd sys_pidfd_getfd
439 n32 faccessat2 sys_faccessat2
+440 n32 watch_mount sys_watch_mount
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index da8c76394e17..d39b90de3642 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -353,3 +353,4 @@
437 n64 openat2 sys_openat2
438 n64 pidfd_getfd sys_pidfd_getfd
439 n64 faccessat2 sys_faccessat2
+440 n64 watch_mount sys_watch_mount
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 13280625d312..09f426cb45b1 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -426,3 +426,4 @@
437 o32 openat2 sys_openat2
438 o32 pidfd_getfd sys_pidfd_getfd
439 o32 faccessat2 sys_faccessat2
+440 o32 watch_mount sys_watch_mount
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 5a758fa6ec52..52ff3454baa1 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common watch_mount sys_watch_mount
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index f833a3190822..10b7ed3c7a1b 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -528,3 +528,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common watch_mount sys_watch_mount
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index bfdcb7633957..86f317bf52df 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
437 common openat2 sys_openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2 sys_faccessat2
+440 common watch_mount sys_watch_mount sys_watch_mount
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index acc35daa1b79..0bb0f0b372c7 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common watch_mount sys_watch_mount
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8004a276cb74..369ab65c1e9a 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -484,3 +484,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common watch_mount sys_watch_mount
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index d8f8a1a69ed1..e760ba92c58d 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -443,3 +443,4 @@
437 i386 openat2 sys_openat2
438 i386 pidfd_getfd sys_pidfd_getfd
439 i386 faccessat2 sys_faccessat2
+440 i386 watch_mount sys_watch_mount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 78847b32e137..5b58621d4f75 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -360,6 +360,7 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common watch_mount sys_watch_mount
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 69d0d73876b3..5b28ee39f70f 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -409,3 +409,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common watch_mount sys_watch_mount
diff --git a/fs/Kconfig b/fs/Kconfig
index a88aa3af73c1..1a55e56d5c54 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -117,6 +117,15 @@ source "fs/verity/Kconfig"
source "fs/notify/Kconfig"
+config MOUNT_NOTIFICATIONS
+ bool "Mount topology change notifications"
+ select WATCH_QUEUE
+ help
+ This option provides support for getting change notifications on the
+ mount tree topology. This makes use of the /dev/watch_queue misc
+ device to handle the notification buffer and provides the
+ mount_notify() system call to enable/disable watchpoints.
+
source "fs/quota/Kconfig"
source "fs/autofs/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 2ce5112b02c8..dd0d87e2ef19 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -22,6 +22,7 @@ obj-y += no-block.o
endif
obj-$(CONFIG_PROC_FS) += proc_namespace.o
+obj-$(CONFIG_MOUNT_NOTIFICATIONS) += mount_notify.o
obj-y += notify/
obj-$(CONFIG_EPOLL) += eventpoll.o
diff --git a/fs/mount.h b/fs/mount.h
index c7abb7b394d8..1c777f651446 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -4,6 +4,7 @@
#include <linux/poll.h>
#include <linux/ns_common.h>
#include <linux/fs_pin.h>
+#include <linux/watch_queue.h>
struct mnt_namespace {
atomic_t count;
@@ -78,6 +79,12 @@ struct mount {
int mnt_expiry_mark; /* true if marked for expiry */
struct hlist_head mnt_pins;
struct hlist_head mnt_stuck_children;
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+ atomic_t mnt_topology_changes; /* Number of topology changes applied */
+ atomic_t mnt_attr_changes; /* Number of attribute changes applied */
+ atomic_t mnt_subtree_notifications; /* Number of notifications in subtree */
+ struct watch_list *mnt_watchers; /* Watches on dentries within this mount */
+#endif
} __randomize_layout;
#define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
@@ -159,3 +166,17 @@ static inline bool is_anon_ns(struct mnt_namespace *ns)
}
extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor);
+
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+extern void notify_mount(struct mount *triggered,
+ struct mount *aux,
+ enum mount_notification_subtype subtype,
+ u32 info_flags);
+#else
+static inline void notify_mount(struct mount *triggered,
+ struct mount *aux,
+ enum mount_notification_subtype subtype,
+ u32 info_flags)
+{
+}
+#endif
diff --git a/fs/mount_notify.c b/fs/mount_notify.c
new file mode 100644
index 000000000000..365aac5fa746
--- /dev/null
+++ b/fs/mount_notify.c
@@ -0,0 +1,228 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Provide mount topology/attribute change notifications.
+ *
+ * Copyright (C) 2019 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/syscalls.h>
+#include <linux/slab.h>
+#include <linux/security.h>
+#include "mount.h"
+
+/*
+ * Post mount notifications to all watches going rootwards along the tree.
+ *
+ * Must be called with the mount_lock held.
+ */
+static void post_mount_notification(struct mount *changed,
+ struct mount_notification *notify)
+{
+ const struct cred *cred = current_cred();
+ struct path cursor;
+ struct mount *mnt;
+ unsigned seq;
+
+ seq = 0;
+ rcu_read_lock();
+restart:
+ cursor.mnt = &changed->mnt;
+ cursor.dentry = changed->mnt.mnt_root;
+ mnt = real_mount(cursor.mnt);
+ notify->watch.info &= ~NOTIFY_MOUNT_IN_SUBTREE;
+
+ read_seqbegin_or_lock(&rename_lock, &seq);
+ for (;;) {
+ if (mnt->mnt_watchers &&
+ !hlist_empty(&mnt->mnt_watchers->watchers)) {
+ if (cursor.dentry->d_flags & DCACHE_MOUNT_WATCH)
+ post_watch_notification(mnt->mnt_watchers,
+ ¬ify->watch, cred,
+ (unsigned long)cursor.dentry);
+ } else {
+ cursor.dentry = mnt->mnt.mnt_root;
+ }
+ notify->watch.info |= NOTIFY_MOUNT_IN_SUBTREE;
+
+ if (cursor.dentry == cursor.mnt->mnt_root ||
+ IS_ROOT(cursor.dentry)) {
+ struct mount *parent = READ_ONCE(mnt->mnt_parent);
+
+ /* Escaped? */
+ if (cursor.dentry != cursor.mnt->mnt_root)
+ break;
+
+ /* Global root? */
+ if (mnt == parent)
+ break;
+
+ cursor.dentry = READ_ONCE(mnt->mnt_mountpoint);
+ mnt = parent;
+ cursor.mnt = &mnt->mnt;
+ atomic_inc(&mnt->mnt_subtree_notifications);
+ } else {
+ cursor.dentry = cursor.dentry->d_parent;
+ }
+ }
+
+ if (need_seqretry(&rename_lock, seq)) {
+ seq = 1;
+ goto restart;
+ }
+
+ done_seqretry(&rename_lock, seq);
+ rcu_read_unlock();
+}
+
+/*
+ * Generate a mount notification.
+ */
+void notify_mount(struct mount *trigger,
+ struct mount *aux,
+ enum mount_notification_subtype subtype,
+ u32 info_flags)
+{
+
+ struct mount_notification n;
+
+ memset(&n, 0, sizeof(n));
+ n.watch.type = WATCH_TYPE_MOUNT_NOTIFY;
+ n.watch.subtype = subtype;
+ n.watch.info = info_flags | watch_sizeof(n);
+ n.triggered_on = trigger->mnt_id;
+
+ switch (subtype) {
+ case NOTIFY_MOUNT_EXPIRY:
+ case NOTIFY_MOUNT_READONLY:
+ case NOTIFY_MOUNT_SETATTR:
+ n.topology_changes = atomic_read(&trigger->mnt_topology_changes);
+ n.attr_changes = atomic_inc_return(&trigger->mnt_attr_changes);
+ break;
+
+ case NOTIFY_MOUNT_NEW_MOUNT:
+ case NOTIFY_MOUNT_UNMOUNT:
+ case NOTIFY_MOUNT_MOVE_FROM:
+ case NOTIFY_MOUNT_MOVE_TO:
+ n.auxiliary_mount = aux->mnt_id,
+ n.attr_changes = atomic_read(&trigger->mnt_attr_changes);
+ n.topology_changes = atomic_inc_return(&trigger->mnt_topology_changes);
+ n.aux_topology_changes = atomic_inc_return(&aux->mnt_topology_changes);
+ break;
+
+ default:
+ BUG();
+ }
+
+ post_mount_notification(trigger, &n);
+}
+
+static void release_mount_watch(struct watch *watch)
+{
+ struct dentry *dentry = (struct dentry *)(unsigned long)watch->id;
+
+ dput(dentry);
+}
+
+/**
+ * sys_watch_mount - Watch for mount topology/attribute changes
+ * @dfd: Base directory to pathwalk from or fd referring to mount.
+ * @filename: Path to mount to place the watch upon
+ * @at_flags: Pathwalk control flags
+ * @watch_fd: The watch queue to send notifications to.
+ * @watch_id: The watch ID to be placed in the notification (-1 to remove watch)
+ */
+SYSCALL_DEFINE5(watch_mount,
+ int, dfd,
+ const char __user *, filename,
+ unsigned int, at_flags,
+ int, watch_fd,
+ int, watch_id)
+{
+ struct watch_queue *wqueue;
+ struct watch_list *wlist = NULL;
+ struct watch *watch = NULL;
+ struct mount *m;
+ struct path path;
+ unsigned int lookup_flags =
+ LOOKUP_DIRECTORY | LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+ int ret;
+
+ if (watch_id < -1 || watch_id > 0xff)
+ return -EINVAL;
+ if ((at_flags & ~(AT_NO_AUTOMOUNT | AT_EMPTY_PATH)) != 0)
+ return -EINVAL;
+ if (at_flags & AT_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (at_flags & AT_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+
+ ret = user_path_at(dfd, filename, lookup_flags, &path);
+ if (ret)
+ return ret;
+
+ ret = inode_permission(path.dentry->d_inode, MAY_EXEC);
+ if (ret)
+ goto err_path;
+
+ wqueue = get_watch_queue(watch_fd);
+ if (IS_ERR(wqueue))
+ goto err_path;
+
+ m = real_mount(path.mnt);
+
+ if (watch_id >= 0) {
+ ret = -ENOMEM;
+ if (!READ_ONCE(m->mnt_watchers)) {
+ wlist = kzalloc(sizeof(*wlist), GFP_KERNEL);
+ if (!wlist)
+ goto err_wqueue;
+ init_watch_list(wlist, release_mount_watch);
+ }
+
+ watch = kzalloc(sizeof(*watch), GFP_KERNEL);
+ if (!watch)
+ goto err_wlist;
+
+ init_watch(watch, wqueue);
+ watch->id = (unsigned long)path.dentry;
+ watch->info_id = (u32)watch_id << WATCH_INFO_ID__SHIFT;
+
+ ret = security_watch_mount(watch, &path);
+ if (ret < 0)
+ goto err_watch;
+
+ down_write(&m->mnt.mnt_sb->s_umount);
+ if (!m->mnt_watchers) {
+ m->mnt_watchers = wlist;
+ wlist = NULL;
+ }
+
+ ret = add_watch_to_object(watch, m->mnt_watchers);
+ if (ret == 0) {
+ spin_lock(&path.dentry->d_lock);
+ path.dentry->d_flags |= DCACHE_MOUNT_WATCH;
+ spin_unlock(&path.dentry->d_lock);
+ dget(path.dentry);
+ watch = NULL;
+ }
+ up_write(&m->mnt.mnt_sb->s_umount);
+ } else {
+ down_write(&m->mnt.mnt_sb->s_umount);
+ ret = remove_watch_from_object(m->mnt_watchers, wqueue,
+ (unsigned long)path.dentry,
+ false);
+ up_write(&m->mnt.mnt_sb->s_umount);
+ }
+
+err_watch:
+ kfree(watch);
+err_wlist:
+ kfree(wlist);
+err_wqueue:
+ put_watch_queue(wqueue);
+err_path:
+ path_put(&path);
+ return ret;
+}
diff --git a/fs/namespace.c b/fs/namespace.c
index 4a0f600a3328..73ff5bf0c9af 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -498,6 +498,9 @@ static int mnt_make_readonly(struct mount *mnt)
smp_wmb();
mnt->mnt.mnt_flags &= ~MNT_WRITE_HOLD;
unlock_mount_hash();
+ if (ret == 0)
+ notify_mount(mnt, NULL, NOTIFY_MOUNT_READONLY,
+ NOTIFY_MOUNT_IS_NOW_RO);
return ret;
}
@@ -506,6 +509,7 @@ static int __mnt_unmake_readonly(struct mount *mnt)
lock_mount_hash();
mnt->mnt.mnt_flags &= ~MNT_READONLY;
unlock_mount_hash();
+ notify_mount(mnt, NULL, NOTIFY_MOUNT_READONLY, 0);
return 0;
}
@@ -835,6 +839,7 @@ static struct mountpoint *unhash_mnt(struct mount *mnt)
*/
static void umount_mnt(struct mount *mnt)
{
+ notify_mount(mnt->mnt_parent, mnt, NOTIFY_MOUNT_UNMOUNT, 0);
put_mountpoint(unhash_mnt(mnt));
}
@@ -1175,6 +1180,11 @@ static void mntput_no_expire(struct mount *mnt)
mnt->mnt.mnt_flags |= MNT_DOOMED;
rcu_read_unlock();
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+ if (mnt->mnt_watchers)
+ remove_watch_list(mnt->mnt_watchers, mnt->mnt_id);
+#endif
+
list_del(&mnt->mnt_instance);
if (unlikely(!list_empty(&mnt->mnt_mounts))) {
@@ -1503,6 +1513,7 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
p = list_first_entry(&tmp_list, struct mount, mnt_list);
list_del_init(&p->mnt_expire);
list_del_init(&p->mnt_list);
+
ns = p->mnt_ns;
if (ns) {
ns->mounts--;
@@ -2137,7 +2148,10 @@ static int attach_recursive_mnt(struct mount *source_mnt,
}
if (moving) {
unhash_mnt(source_mnt);
+ notify_mount(source_mnt->mnt_parent, source_mnt,
+ NOTIFY_MOUNT_MOVE_FROM, 0);
attach_mnt(source_mnt, dest_mnt, dest_mp);
+ notify_mount(dest_mnt, source_mnt, NOTIFY_MOUNT_MOVE_TO, 0);
touch_mnt_namespace(source_mnt->mnt_ns);
} else {
if (source_mnt->mnt_ns) {
@@ -2146,6 +2160,11 @@ static int attach_recursive_mnt(struct mount *source_mnt,
}
mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
commit_tree(source_mnt);
+ notify_mount(dest_mnt, source_mnt, NOTIFY_MOUNT_NEW_MOUNT,
+ (source_mnt->mnt.mnt_sb->s_flags & SB_RDONLY ?
+ NOTIFY_MOUNT_IS_NOW_RO : 0) |
+ (source_mnt->mnt.mnt_sb->s_flags & SB_SUBMOUNT ?
+ NOTIFY_MOUNT_IS_SUBMOUNT : 0));
}
hlist_for_each_entry_safe(child, n, &tree_list, mnt_hash) {
@@ -2522,6 +2541,8 @@ static void set_mount_attributes(struct mount *mnt, unsigned int mnt_flags)
mnt->mnt.mnt_flags = mnt_flags;
touch_mnt_namespace(mnt->mnt_ns);
unlock_mount_hash();
+ notify_mount(mnt, NULL, NOTIFY_MOUNT_SETATTR,
+ (mnt_flags & SB_RDONLY ? NOTIFY_MOUNT_IS_NOW_RO : 0));
}
static void mnt_warn_timestamp_expiry(struct path *mountpoint, struct vfsmount *mnt)
@@ -2992,6 +3013,7 @@ void mark_mounts_for_expiry(struct list_head *mounts)
propagate_mount_busy(mnt, 1))
continue;
list_move(&mnt->mnt_expire, &graveyard);
+ notify_mount(mnt, NULL, NOTIFY_MOUNT_EXPIRY, 0);
}
while (!list_empty(&graveyard)) {
mnt = list_first_entry(&graveyard, struct mount, mnt_expire);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a81f0c3cf352..a94c551c62a3 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -219,6 +219,7 @@ struct dentry_operations {
#define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */
#define DCACHE_DENTRY_CURSOR 0x20000000
#define DCACHE_NORCU 0x40000000 /* No RCU delay for freeing */
+#define DCACHE_MOUNT_WATCH 0x80000000 /* There's a mount watch here */
extern seqlock_t rename_lock;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b951a87da987..88d03fd627ab 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1005,6 +1005,8 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
siginfo_t __user *info,
unsigned int flags);
asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_watch_mount(int dfd, const char __user *path,
+ unsigned int at_flags, int watch_fd, int watch_id);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index f4a01305d9a6..fcdca8c7d30a 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -857,9 +857,11 @@ __SYSCALL(__NR_openat2, sys_openat2)
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_faccessat2 439
__SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_watch_mount 440
+__SYSCALL(__NR_watch_mount, sys_watch_mount)
#undef __NR_syscalls
-#define __NR_syscalls 440
+#define __NR_syscalls 441
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
index c3d8320b5d3a..6b6cd2afc590 100644
--- a/include/uapi/linux/watch_queue.h
+++ b/include/uapi/linux/watch_queue.h
@@ -14,7 +14,8 @@
enum watch_notification_type {
WATCH_TYPE_META = 0, /* Special record */
WATCH_TYPE_KEY_NOTIFY = 1, /* Key change event notification */
- WATCH_TYPE__NR = 2
+ WATCH_TYPE_MOUNT_NOTIFY = 2, /* Mount topology change notification */
+ WATCH_TYPE___NR = 3
};
enum watch_meta_notification_subtype {
@@ -101,4 +102,37 @@ struct key_notification {
__u32 aux; /* Per-type auxiliary data */
};
+/*
+ * Type of mount topology change notification.
+ */
+enum mount_notification_subtype {
+ NOTIFY_MOUNT_NEW_MOUNT = 0, /* New mount added */
+ NOTIFY_MOUNT_UNMOUNT = 1, /* Mount removed manually */
+ NOTIFY_MOUNT_EXPIRY = 2, /* Automount expired */
+ NOTIFY_MOUNT_READONLY = 3, /* Mount R/O state changed */
+ NOTIFY_MOUNT_SETATTR = 4, /* Mount attributes changed */
+ NOTIFY_MOUNT_MOVE_FROM = 5, /* Mount moved from here */
+ NOTIFY_MOUNT_MOVE_TO = 6, /* Mount moved to here (compare op_id) */
+};
+
+#define NOTIFY_MOUNT_IN_SUBTREE WATCH_INFO_FLAG_0 /* Event not actually at watched dentry */
+#define NOTIFY_MOUNT_IS_RECURSIVE WATCH_INFO_FLAG_1 /* Change applied recursively */
+#define NOTIFY_MOUNT_IS_NOW_RO WATCH_INFO_FLAG_2 /* Mount changed to R/O */
+#define NOTIFY_MOUNT_IS_SUBMOUNT WATCH_INFO_FLAG_3 /* New mount is submount */
+
+/*
+ * Mount topology/configuration change notification record.
+ * - watch.type = WATCH_TYPE_MOUNT_NOTIFY
+ * - watch.subtype = enum mount_notification_subtype
+ */
+struct mount_notification {
+ struct watch_notification watch; /* WATCH_TYPE_MOUNT_NOTIFY */
+ __u32 triggered_on; /* The mount that triggered the notification */
+ __u32 auxiliary_mount; /* Added/moved/removed mount or 0 */
+ __u32 topology_changes; /* trigger: Number of topology changes applied */
+ __u32 attr_changes; /* trigger: Number of attribute changes applied */
+ __u32 aux_topology_changes; /* aux: Number of topology changes applied */
+ __u32 __padding;
+};
+
#endif /* _UAPI_LINUX_WATCH_QUEUE_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 3b69a560a7ac..3e1c5c9d2efe 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -85,6 +85,9 @@ COND_SYSCALL(ioprio_get);
/* fs/locks.c */
COND_SYSCALL(flock);
+/* fs/mount_notify.c */
+COND_SYSCALL(watch_mount);
+
/* fs/namei.c */
/* fs/namespace.c */
^ permalink raw reply related
* [PATCH 2/4] watch_queue: Add security hooks to rule on setting mount watches
From: David Howells @ 2020-07-24 13:11 UTC (permalink / raw)
To: viro
Cc: James Morris, Casey Schaufler, Stephen Smalley,
linux-security-module, dhowells, torvalds, casey, sds,
nicolas.dichtel, raven, christian, jlayton, kzak, mszeredi,
linux-api, linux-fsdevel, linux-security-module, linux-kernel
In-Reply-To: <159559628247.2141315.2107013106060144287.stgit@warthog.procyon.org.uk>
Add a security hook that will allow an LSM to rule on whether or not a
watch may be set on a mount.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: James Morris <jamorris@linux.microsoft.com>
cc: Casey Schaufler <casey@schaufler-ca.com>
cc: Stephen Smalley <sds@tycho.nsa.gov>
cc: linux-security-module@vger.kernel.org
---
include/linux/lsm_hook_defs.h | 3 +++
include/linux/lsm_hooks.h | 6 ++++++
include/linux/security.h | 8 ++++++++
security/security.c | 7 +++++++
4 files changed, 24 insertions(+)
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index af998f93d256..f6eaf8bd617b 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -264,6 +264,9 @@ LSM_HOOK(int, 0, post_notification, const struct cred *w_cred,
#if defined(CONFIG_SECURITY) && defined(CONFIG_KEY_NOTIFICATIONS)
LSM_HOOK(int, 0, watch_key, struct key *key)
#endif /* CONFIG_SECURITY && CONFIG_KEY_NOTIFICATIONS */
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+LSM_HOOK(int, 0, watch_mount, struct watch *watch, struct path *path)
+#endif
#ifdef CONFIG_SECURITY_NETWORK
LSM_HOOK(int, 0, unix_stream_connect, struct sock *sock, struct sock *other,
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 95b7c1d32062..56275145b91d 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1468,6 +1468,12 @@
* from a key or keyring.
* @key: The key to watch.
*
+ * @watch_mount:
+ * Check to see if a process is allowed to watch for mount topology change
+ * notifications on a mount subtree.
+ * @watch: The watch object
+ * @path: The root of the subtree to watch.
+ *
* Security hooks for using the eBPF maps and programs functionalities through
* eBPF syscalls.
*
diff --git a/include/linux/security.h b/include/linux/security.h
index 0a0a03b36a3b..318fdfe7f4d6 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1314,6 +1314,14 @@ static inline int security_watch_key(struct key *key)
return 0;
}
#endif
+#if defined(CONFIG_SECURITY) && defined(CONFIG_MOUNT_NOTIFICATIONS)
+int security_watch_mount(struct watch *watch, struct path *path);
+#else
+static inline int security_watch_mount(struct watch *watch, struct path *path)
+{
+ return 0;
+}
+#endif
#ifdef CONFIG_SECURITY_NETWORK
diff --git a/security/security.c b/security/security.c
index 70a7ad357bc6..3cdf5039f727 100644
--- a/security/security.c
+++ b/security/security.c
@@ -2067,6 +2067,13 @@ int security_watch_key(struct key *key)
}
#endif
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+int security_watch_mount(struct watch *watch, struct path *path)
+{
+ return call_int_hook(watch_mount, 0, watch, path);
+}
+#endif
+
#ifdef CONFIG_SECURITY_NETWORK
int security_unix_stream_connect(struct sock *sock, struct sock *other, struct sock *newsk)
^ permalink raw reply related
* [PATCH 1/4] watch_queue: Make watch_sizeof() check record size
From: David Howells @ 2020-07-24 13:11 UTC (permalink / raw)
To: viro
Cc: Miklos Szeredi, dhowells, torvalds, casey, sds, nicolas.dichtel,
raven, christian, jlayton, kzak, mszeredi, linux-api,
linux-fsdevel, linux-security-module, linux-kernel
In-Reply-To: <159559628247.2141315.2107013106060144287.stgit@warthog.procyon.org.uk>
Make watch_sizeof() give a build error if the size of the struct won't fit
into the size field in the header.
Reported-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
---
include/linux/watch_queue.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/watch_queue.h b/include/linux/watch_queue.h
index 5e08db2adc31..38e04c7a7951 100644
--- a/include/linux/watch_queue.h
+++ b/include/linux/watch_queue.h
@@ -120,7 +120,12 @@ static inline void remove_watch_list(struct watch_list *wlist, u64 id)
* watch_sizeof - Calculate the information part of the size of a watch record,
* given the structure size.
*/
-#define watch_sizeof(STRUCT) (sizeof(STRUCT) << WATCH_INFO_LENGTH__SHIFT)
+#define watch_sizeof(STRUCT) \
+ ({ \
+ size_t max = WATCH_INFO_LENGTH >> WATCH_INFO_LENGTH__SHIFT; \
+ BUILD_BUG_ON(sizeof(STRUCT) > max); \
+ sizeof(STRUCT) << WATCH_INFO_LENGTH__SHIFT; \
+ })
#endif
^ permalink raw reply related
* [PATCH 0/4] Mount notifications
From: David Howells @ 2020-07-24 13:11 UTC (permalink / raw)
To: viro
Cc: linux-security-module, Casey Schaufler, James Morris,
Stephen Smalley, Miklos Szeredi, dhowells, torvalds, casey, sds,
nicolas.dichtel, raven, christian, jlayton, kzak, mszeredi,
linux-api, linux-fsdevel, linux-security-module, linux-kernel
Here's a set of patches to add notifications for mount topology events,
such as mounting, unmounting, mount expiry, mount reconfiguration.
An LSM hook is included to an LSM to rule on whether or not a mount watch
may be set on a particular path.
Why do we want mount notifications? Whilst /proc/mounts can be polled, it
only tells you that something changed in your namespace. To find out, you
have to trawl /proc/mounts or similar to work out what changed in the mount
object attributes and mount topology. I'm told that the proc file holding
the namespace_sem is a point of contention, especially as the process of
generating the text descriptions of the mounts/superblocks can be quite
involved.
The notification generated here directly indicates the mounts involved in
any particular event and gives an idea of what the change was.
This is combined with a new fsinfo() system call that allows, amongst other
things, the ability to retrieve in one go an { id, change_counter } tuple
from all the children of a specified mount, allowing buffer overruns to be
dealt with quickly.
This is of use to systemd to improve efficiency:
https://lore.kernel.org/linux-fsdevel/20200227151421.3u74ijhqt6ekbiss@ws.net.home/
And it's not just Red Hat that's potentially interested in this:
https://lore.kernel.org/linux-fsdevel/293c9bd3-f530-d75e-c353-ddeabac27cf6@6wind.com/
The kernel patches can also be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications-pipe-core
David
---
David Howells (4):
watch_queue: Make watch_sizeof() check record size
watch_queue: Add security hooks to rule on setting mount watches
watch_queue: Implement mount topology and attribute change notifications
watch_queue: sample: Display mount tree change notifications
Documentation/watch_queue.rst | 12 +-
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/Kconfig | 9 +
fs/Makefile | 1 +
fs/mount.h | 21 ++
fs/mount_notify.c | 228 ++++++++++++++++++++
fs/namespace.c | 22 ++
include/linux/dcache.h | 1 +
include/linux/lsm_hook_defs.h | 3 +
include/linux/lsm_hooks.h | 6 +
include/linux/security.h | 8 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/watch_queue.h | 36 +++-
kernel/sys_ni.c | 3 +
samples/watch_queue/watch_test.c | 44 +++-
security/security.c | 7 +
33 files changed, 421 insertions(+), 4 deletions(-)
create mode 100644 fs/mount_notify.c
^ permalink raw reply
* Re: [PATCH 13/17] watch_queue: Implement mount topology and attribute change notifications [ver #5]
From: David Howells @ 2020-07-24 11:36 UTC (permalink / raw)
To: Ian Kent
Cc: dhowells, Miklos Szeredi, Linus Torvalds, Al Viro,
Casey Schaufler, Stephen Smalley, nicolas.dichtel,
Christian Brauner, andres, Jeff Layton, dray, Karel Zak, keyrings,
Linux API, linux-fsdevel, LSM, linux-kernel
In-Reply-To: <865566fb800a014868a9a7e36a00a14430efb11e.camel@themaw.net>
Ian Kent <raven@themaw.net> wrote:
> I was wondering about id re-use.
>
> Assuming that ids that are returned to the idr db are re-used
> what would the chance that a recently used id would end up
> being used?
>
> Would that chance increase as ids are consumed and freed over
> time?
I've added something to deal with that in the fsinfo branch. I've given each
mount object and superblock a supplementary 64-bit unique ID that's not likely
to repeat before we're no longer around to have to worry about it.
fsinfo() then allows you to retrieve them by path or by mount ID.
So, yes, mnt_id and s_dev are not unique and may be reused very quickly, but
I'm also providing uniquifiers that you can check.
David
^ permalink raw reply
* Re: [PATCH v7 0/7] Add support for O_MAYEXEC
From: Thibaut Sautereau @ 2020-07-24 11:20 UTC (permalink / raw)
To: Mickaël Salaün
Cc: linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
Andrew Morton, Andy Lutomirski, Christian Brauner,
Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet, Scott Shell, Sean Christopherson,
Shuah Khan, Steve Dower, Steve Grubb, Tetsuo Handa,
Vincent Strubel, kernel-hardening, linux-api, linux-integrity,
linux-security-module, linux-fsdevel
In-Reply-To: <20200723171227.446711-1-mic@digikod.net>
On Thu, Jul 23, 2020 at 07:12:20PM +0200, Mickaël Salaün wrote:
> This patch series can be applied on top of v5.8-rc5 .
v5.8-rc6, actually.
> Previous version:
> https://lore.kernel.org/lkml/20200505153156.925111-1-mic@digikod.net/
This is v5.
v6 is at https://lore.kernel.org/lkml/20200714181638.45751-1-mic@digikod.net/
--
Thibaut Sautereau
CLIP OS developer
^ permalink raw reply
* Re: [PATCH 13/17] watch_queue: Implement mount topology and attribute change notifications [ver #5]
From: Ian Kent @ 2020-07-24 10:44 UTC (permalink / raw)
To: David Howells, Miklos Szeredi
Cc: Linus Torvalds, Al Viro, Casey Schaufler, Stephen Smalley,
nicolas.dichtel, Christian Brauner, andres, Jeff Layton, dray,
Karel Zak, keyrings, Linux API, linux-fsdevel, LSM, linux-kernel
In-Reply-To: <2003787.1595585999@warthog.procyon.org.uk>
On Fri, 2020-07-24 at 11:19 +0100, David Howells wrote:
> David Howells <dhowells@redhat.com> wrote:
>
> > > What guarantees that mount_id is going to remain a 32bit entity?
> >
> > You think it likely we'd have >4 billion concurrent mounts on a
> > system? That
> > would require >1.2TiB of RAM just for the struct mount allocations.
> >
> > But I can expand it to __u64.
>
> That said, sys_name_to_handle_at() assumes it's a 32-bit signed
> integer, so
> we're currently limited to ~2 billion concurrent mounts:-/
I was wondering about id re-use.
Assuming that ids that are returned to the idr db are re-used
what would the chance that a recently used id would end up
being used?
Would that chance increase as ids are consumed and freed over
time?
Yeah, it's one of those questions ... ;)
Ian
^ permalink raw reply
* Re: [PATCH 13/17] watch_queue: Implement mount topology and attribute change notifications [ver #5]
From: David Howells @ 2020-07-24 10:19 UTC (permalink / raw)
To: Miklos Szeredi
Cc: dhowells, Linus Torvalds, Al Viro, Casey Schaufler,
Stephen Smalley, nicolas.dichtel, Ian Kent, Christian Brauner,
andres, Jeff Layton, dray, Karel Zak, keyrings, Linux API,
linux-fsdevel, LSM, linux-kernel
In-Reply-To: <1293241.1595501326@warthog.procyon.org.uk>
David Howells <dhowells@redhat.com> wrote:
> > What guarantees that mount_id is going to remain a 32bit entity?
>
> You think it likely we'd have >4 billion concurrent mounts on a system? That
> would require >1.2TiB of RAM just for the struct mount allocations.
>
> But I can expand it to __u64.
That said, sys_name_to_handle_at() assumes it's a 32-bit signed integer, so
we're currently limited to ~2 billion concurrent mounts:-/
David
^ permalink raw reply
* Re: [PATCH] keys: asymmetric: fix error return code in software_key_query()
From: Jarkko Sakkinen @ 2020-07-24 7:16 UTC (permalink / raw)
To: David Howells
Cc: torvalds, Wei Yongjun, keyrings, linux-security-module,
linux-kernel
In-Reply-To: <1269137.1595490145@warthog.procyon.org.uk>
On Thu, Jul 23, 2020 at 08:42:25AM +0100, David Howells wrote:
> Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> wrote:
>
> > Why f1774cb8956a lacked any possible testing? It extends ABI anyway.
> >
> > I think it is a kind of change that would require more screening before
> > getting applied.
>
> Yeah. It went in via a round-about route. I left off development of it when
> the tpm stuff I wrote broke because the tpm2 stuff went in upstream. I then
> handed the patches off to Denis who did the tpm support, but I never got my
> stuff finished enough to work out how to do the testsuite (since it would
> involve using a tpm). However, since I did the PKCS#8 testing module as well,
> I guess I don't need that to at least test the API. I'll look at using that
> to add some tests. Any suggestions as to how to do testing via the tpm?
>
> David
The unfortunate thing is that I was not involved with asym_tpm.c review
process in any possible way, which means that at the moment I lack both:
1. Knowledge of crypto/asymmetric_keys.
2. How asym_tpm.c is implemented.
I only became aware of asym_tpm.c's existence last Sep [*].
I'll put to my backlog to try TPM asymmetric keys (earliest when I'm back
from vacation 08/10).
[*] https://lore.kernel.org/linux-integrity/20190926171601.30404-1-jarkko.sakkinen@linux.intel.com/
/Jarkko
^ permalink raw reply
* Re: [PATCH bpf-next v6 1/7] bpf: Renames to prepare for generalizing sk_storage.
From: Martin KaFai Lau @ 2020-07-24 5:31 UTC (permalink / raw)
To: KP Singh
Cc: linux-kernel, bpf, linux-security-module, Alexei Starovoitov,
Daniel Borkmann, Paul Turner, Jann Horn, Florent Revest
In-Reply-To: <20200723115032.460770-2-kpsingh@chromium.org>
On Thu, Jul 23, 2020 at 01:50:26PM +0200, KP Singh wrote:
> From: KP Singh <kpsingh@google.com>
>
> A purely mechanical change to split the renaming from the actual
> generalization.
>
> Flags/consts:
>
> SK_STORAGE_CREATE_FLAG_MASK BPF_LOCAL_STORAGE_CREATE_FLAG_MASK
> BPF_SK_STORAGE_CACHE_SIZE BPF_LOCAL_STORAGE_CACHE_SIZE
> MAX_VALUE_SIZE BPF_LOCAL_STORAGE_MAX_VALUE_SIZE
>
> Structs:
>
> bucket bpf_local_storage_map_bucket
> bpf_sk_storage_map bpf_local_storage_map
> bpf_sk_storage_data bpf_local_storage_data
> bpf_sk_storage_elem bpf_local_storage_elem
> bpf_sk_storage bpf_local_storage
> selem_linked_to_sk selem_linked_to_storage
> selem_alloc bpf_selem_alloc
>
> The "sk" member in bpf_local_storage is also updated to "owner"
> in preparation for changing the type to void * in a subsequent patch.
>
> Functions:
>
> __selem_unlink_sk bpf_selem_unlink_storage
> __selem_link_sk bpf_selem_link_storage
> selem_unlink_sk __bpf_selem_unlink_storage
> sk_storage_update bpf_local_storage_update
> __sk_storage_lookup bpf_local_storage_lookup
> bpf_sk_storage_map_free bpf_local_storage_map_free
> bpf_sk_storage_map_alloc bpf_local_storage_map_alloc
> bpf_sk_storage_map_alloc_check bpf_local_storage_map_alloc_check
> bpf_sk_storage_map_check_btf bpf_local_storage_map_check_btf
Thanks for separating this mechanical name change in a separate patch.
It is much easier to follow. This patch looks good.
A minor thought is, when I look at unlink_map() and unlink_storage(),
it keeps me looking back for the lock situation. I think
the main reason is the bpf_selem_unlink_map() is locked but
bpf_selem_unlink_storage() is unlocked now. May be:
bpf_selem_unlink_map() => bpf_selem_unlink_map_locked()
bpf_selem_link_map() => bpf_selem_link_map_locked()
__bpf_selem_unlink_storage() => bpf_selem_unlink_storage_locked()
bpf_link_storage() means unlocked
bpf_unlink_storage() means unlocked.
I think it could be one follow-up patch later instead of interrupting
multiple patches in this set for this minor thing. For now, lets
continue with this and remember default is nolock for storage.
I will continue tomorrow.
^ permalink raw reply
* Re: [PATCH] Manual pages: use "root user ID" rather than "rootid"
From: Andrew G. Morgan @ 2020-07-24 3:31 UTC (permalink / raw)
To: Michael Kerrisk (man-pages); +Cc: LSM List
In-Reply-To: <20200723091818.494712-1-mtk.manpages@gmail.com>
Applied both this and the cap_from_text man page change.
I've also updated the latter page to show that what used to be
summarized by cap_to_text() as: "= cap_foo+..." will (in libcap-2.41)
be the equivalent, but shorter, text: "cap_foo=..." which is also more
intuitive.
Cheers
Andrew
On Thu, Jul 23, 2020 at 2:18 AM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> The capabilities(7) page has for quite some time used the term "root user ID",
> which is, I think, a little more precise and expressive than "rootid".
> I think it would be good if libcap used the same terminology,
>
> Signed-off-by: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>
> ---
> doc/cap_get_file.3 | 6 +++---
> doc/getcap.8 | 3 ++-
> doc/setcap.8 | 8 ++++----
> 3 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/doc/cap_get_file.3 b/doc/cap_get_file.3
> index ceacbaf..3f73734 100644
> --- a/doc/cap_get_file.3
> +++ b/doc/cap_get_file.3
> @@ -18,7 +18,7 @@ manipulation on files
> .sp
> .BI "uid_t cap_get_nsowner(cap_t " caps );
> .sp
> -.BI "int cap_set_nsowner(cap_t " caps ", uid_t " rootid );
> +.BI "int cap_set_nsowner(cap_t " caps ", uid_t " rootuid );
> .sp
> Link with \fI\-lcap\fP.
> .SH DESCRIPTION
> @@ -66,13 +66,13 @@ capability in its effective capability set. The effects of writing the
> capability state to any file type other than a regular file are
> undefined.
> .PP
> -A capability set held in memory can be associated with the rootid in
> +A capability set held in memory can be associated with the root user ID in
> use in a specific user namespace. It is possible to get and set this value
> (in the memory copy) with
> .BR cap_get_nsowner ()
> and
> .BR cap_set_nsowner ()
> -respectively. The rootid is ignored by the libcap library in all cases
> +respectively. The root user ID is ignored by the libcap library in all cases
> other than when the capability is written to a file. Only if the value
> is non-zero will the library attempt to include it in the written file
> capability set.
> diff --git a/doc/getcap.8 b/doc/getcap.8
> index 2ad8092..04b601c 100644
> --- a/doc/getcap.8
> +++ b/doc/getcap.8
> @@ -13,7 +13,8 @@ displays the name and capabilities of each specified file.
> prints quick usage.
> .TP 4
> .B \-n
> -prints any non-zero user namespace rootid value found to be associated with
> +prints any non-zero user namespace root user ID value
> +found to be associated with
> a file's capabilities.
> .TP 4
> .B \-r
> diff --git a/doc/setcap.8 b/doc/setcap.8
> index 582c781..463752d 100644
> --- a/doc/setcap.8
> +++ b/doc/setcap.8
> @@ -2,7 +2,7 @@
> .SH NAME
> setcap \- set file capabilities
> .SH SYNOPSIS
> -\fBsetcap\fP [\-q] [\-n <rootid>] [\-v] {\fIcapabilities|\-|\-r} filename\fP [ ... \fIcapabilitiesN\fP \fIfileN\fP ]
> +\fBsetcap\fP [\-q] [\-n <rootuid>] [\-v] {\fIcapabilities|\-|\-r} filename\fP [ ... \fIcapabilitiesN\fP \fIfileN\fP ]
> .SH DESCRIPTION
> In the absence of the
> .B \-v
> @@ -13,13 +13,13 @@ sets the capabilities of each specified
> to the
> .I capabilities
> specified. The optional
> -.B \-n <rootid>
> +.B \-n <rootuid>
> argument can be used to set the file capability for use only in a
> -user namespace with this rootid owner. The
> +user namespace with this root user ID owner. The
> .B \-v
> option is used to verify that the specified capabilities are currently
> associated with the file. If \-v and \-n are supplied, the
> -.B \-n <rootid>
> +.B \-n <rootuid>
> argument is also verified.
> .PP
> The
> --
> 2.26.2
>
^ permalink raw reply
* Re: [PATCH v18 22/23] LSM: Add /proc attr entry for full LSM context
From: Casey Schaufler @ 2020-07-24 1:08 UTC (permalink / raw)
To: Jann Horn
Cc: Casey Schaufler, James Morris, linux-security-module,
SElinux list, Kees Cook, John Johansen, Tetsuo Handa, Paul Moore,
Stephen Smalley, Linux API, Casey Schaufler
In-Reply-To: <CAG48ez36_+0k4ubaHRq=9gVDQspUh6yXkAeMRV=cEy-oyOr-sg@mail.gmail.com>
On 7/8/2020 6:30 PM, Jann Horn wrote:
> On Thu, Jul 9, 2020 at 2:42 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> Add an entry /proc/.../attr/context which displays the full
>> process security "context" in compound format:
>> lsm1\0value\0lsm2\0value\0...
>> This entry is not writable.
>>
>> Reviewed-by: Kees Cook <keescook@chromium.org>
>> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
>> Cc: linux-api@vger.kernel.org
> [...]
>> diff --git a/security/security.c b/security/security.c
> [...]
>> +/**
>> + * append_ctx - append a lsm/context pair to a compound context
>> + * @ctx: the existing compound context
>> + * @ctxlen: size of the old context, including terminating nul byte
>> + * @lsm: new lsm name, nul terminated
>> + * @new: new context, possibly nul terminated
>> + * @newlen: maximum size of @new
>> + *
>> + * replace @ctx with a new compound context, appending @newlsm and @new
>> + * to @ctx. On exit the new data replaces the old, which is freed.
>> + * @ctxlen is set to the new size, which includes a trailing nul byte.
>> + *
>> + * Returns 0 on success, -ENOMEM if no memory is available.
>> + */
>> +static int append_ctx(char **ctx, int *ctxlen, const char *lsm, char *new,
>> + int newlen)
>> +{
>> + char *final;
>> + int llen;
> Please use size_t to represent object sizes
OK.
> , instead of implicitly
> truncating them and assuming that that doesn't wrap. Using "int" here
> not only makes it harder to statically reason about this code, it
> actually can also make the generated code worse:
>
>
> $ cat numtrunc.c
> #include <stddef.h>
>
> size_t my_strlen(char *p);
> void *my_alloc(size_t len);
>
> void *blah_trunc(char *p) {
> int len = my_strlen(p) + 1;
> return my_alloc(len);
> }
>
> void *blah_notrunc(char *p) {
> size_t len = my_strlen(p) + 1;
> return my_alloc(len);
> }
> $ gcc -O2 -c -o numtrunc.o numtrunc.c
> $ objdump -d numtrunc.o
> [...]
> 0000000000000000 <blah_trunc>:
> 0: 48 83 ec 08 sub $0x8,%rsp
> 4: e8 00 00 00 00 callq 9 <blah_trunc+0x9>
> 9: 48 83 c4 08 add $0x8,%rsp
> d: 8d 78 01 lea 0x1(%rax),%edi
> 10: 48 63 ff movslq %edi,%rdi <<<<<<<<unnecessary instruction
> 13: e9 00 00 00 00 jmpq 18 <blah_trunc+0x18>
> [...]
> 0000000000000020 <blah_notrunc>:
> 20: 48 83 ec 08 sub $0x8,%rsp
> 24: e8 00 00 00 00 callq 29 <blah_notrunc+0x9>
> 29: 48 83 c4 08 add $0x8,%rsp
> 2d: 48 8d 78 01 lea 0x1(%rax),%rdi
> 31: e9 00 00 00 00 jmpq 36 <blah_notrunc+0x16>
> $
>
> This is because GCC documents
> (https://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html) that
> for integer conversions where the value does not fit into the signed
> target type, "the value is reduced modulo 2^N to be within range of
> the type"; so the compiler has to assume that you are actually
> intentionally trying to truncate the more significant bits from the
> length, and therefore may have to insert extra code to ensure that
> this truncation happens.
>
>
>> + llen = strlen(lsm) + 1;
>> + newlen = strnlen(new, newlen) + 1;
> This strnlen() call seems dodgy. If an LSM can return a string that
> already contains null bytes, shouldn't that be considered a bug, given
> that it can't be displayed properly? Would it be more appropriate to
> have a WARN_ON(memchr(new, '\0', newlen)) check here and bail out if
> that happens?
Whether or not a security module should include a trailing nul has
been a matter of some discussion. Alas, the discussion has not reached
conscensus. The strnlen() is here to allow modules their own convention.
>
>> + final = kzalloc(*ctxlen + llen + newlen, GFP_KERNEL);
>> + if (final == NULL)
>> + return -ENOMEM;
>> + if (*ctxlen)
>> + memcpy(final, *ctx, *ctxlen);
>> + memcpy(final + *ctxlen, lsm, llen);
>> + memcpy(final + *ctxlen + llen, new, newlen);
>> + kfree(*ctx);
>> + *ctx = final;
>> + *ctxlen = *ctxlen + llen + newlen;
>> + return 0;
>> +}
>> +
>> /*
>> * The default value of the LSM hook is defined in linux/lsm_hook_defs.h and
>> * can be accessed with:
>> @@ -2109,6 +2145,10 @@ int security_getprocattr(struct task_struct *p, const char *lsm, char *name,
>> char **value)
>> {
>> struct security_hook_list *hp;
>> + char *final = NULL;
>> + char *cp;
>> + int rc = 0;
>> + int finallen = 0;
>> int display = lsm_task_display(current);
>> int slot = 0;
>>
> [...]
>> return -ENOMEM;
>> }
>>
>> + if (!strcmp(name, "context")) {
>> + hlist_for_each_entry(hp, &security_hook_heads.getprocattr,
>> + list) {
>> + rc = hp->hook.getprocattr(p, "context", &cp);
>> + if (rc == -EINVAL)
>> + continue;
>> + if (rc < 0) {
>> + kfree(final);
>> + return rc;
>> + }
> This means that if SELinux refuses to give the caller PROCESS__GETATTR
> access to the target process, the entire "context" file will refuse to
> show anything, even if e.g. an AppArmor label would be visible through
> the LSM-specific attribute directory, right?
That is correct.
> That seems awkward.
Sure is.
> Can
> you maybe omit context elements for which the access check failed
> instead, or embed an extra flag byte to signal for each element
> whether the lookup failed, or something along those lines?
The SELinux team seems convinced that if their check fails, the
whole thing must fail.
> If this is an intentional design limitation, it should probably be
> documented in the commit message or so.
Point.
>
>> + rc = append_ctx(&final, &finallen, hp->lsmid->lsm,
>> + cp, rc);
>> + if (rc < 0) {
>> + kfree(final);
>> + return rc;
>> + }
> Isn't there a memory leak here?
Why yes, there is.
> `cp` points to memory that was
> allocated by hp->hook.getprocattr(), and you're not freeing it after
> append_ctx(). (And append_ctx() also doesn't free it.)
>
>> + }
>> + if (final == NULL)
>> + return -EINVAL;
>> + *value = final;
>> + return finallen;
>> + }
>> +
>> hlist_for_each_entry(hp, &security_hook_heads.getprocattr, list) {
>> if (lsm != NULL && strcmp(lsm, hp->lsmid->lsm))
>> continue;
^ permalink raw reply
* [PATCH v7 3/7] exec: Move path_noexec() check earlier
From: Mickaël Salaün @ 2020-07-23 17:12 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
Al Viro, Andrew Morton, Andy Lutomirski, Christian Brauner,
Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet, Scott Shell, Sean Christopherson,
Shuah Khan, Steve Dower, Steve Grubb, Tetsuo Handa,
Thibaut Sautereau, Vincent Strubel, kernel-hardening, linux-api,
linux-integrity, linux-security-module, linux-fsdevel
In-Reply-To: <20200723171227.446711-1-mic@digikod.net>
From: Kees Cook <keescook@chromium.org>
The path_noexec() check, like the regular file check, was happening too
late, letting LSMs see impossible execve()s. Check it earlier as well
in may_open() and collect the redundant fs/exec.c path_noexec() test
under the same robustness comment as the S_ISREG() check.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs path_noexec() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
security_file_open(f)
open()
/* old location of path_noexec() test */
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
---
fs/exec.c | 12 ++++--------
fs/namei.c | 4 ++++
2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index bdc6a6eb5dce..4eea20c27b01 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -147,10 +147,8 @@ SYSCALL_DEFINE1(uselib, const char __user *, library)
* and check again at the very end too.
*/
error = -EACCES;
- if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode)))
- goto exit;
-
- if (path_noexec(&file->f_path))
+ if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) ||
+ path_noexec(&file->f_path)))
goto exit;
fsnotify_open(file);
@@ -897,10 +895,8 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
* and check again at the very end too.
*/
err = -EACCES;
- if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode)))
- goto exit;
-
- if (path_noexec(&file->f_path))
+ if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) ||
+ path_noexec(&file->f_path)))
goto exit;
err = deny_write_access(file);
diff --git a/fs/namei.c b/fs/namei.c
index a559ad943970..ddc9b25540fe 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2863,6 +2863,10 @@ static int may_open(const struct path *path, int acc_mode, int flag)
return -EACCES;
flag &= ~O_TRUNC;
break;
+ case S_IFREG:
+ if ((acc_mode & MAY_EXEC) && path_noexec(path))
+ return -EACCES;
+ break;
}
error = inode_permission(inode, MAY_OPEN | acc_mode);
--
2.27.0
^ permalink raw reply related
* [PATCH v7 1/7] exec: Change uselib(2) IS_SREG() failure to EACCES
From: Mickaël Salaün @ 2020-07-23 17:12 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
Al Viro, Andrew Morton, Andy Lutomirski, Christian Brauner,
Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet, Scott Shell, Sean Christopherson,
Shuah Khan, Steve Dower, Steve Grubb, Tetsuo Handa,
Thibaut Sautereau, Vincent Strubel, kernel-hardening, linux-api,
linux-integrity, linux-security-module, linux-fsdevel
In-Reply-To: <20200723171227.446711-1-mic@digikod.net>
From: Kees Cook <keescook@chromium.org>
Change uselib(2)' S_ISREG() error return to EACCES instead of EINVAL so
the behavior matches execve(2), and the seemingly documented value.
The "not a regular file" failure mode of execve(2) is explicitly
documented[1], but it is not mentioned in uselib(2)[2] which does,
however, say that open(2) and mmap(2) errors may apply. The documentation
for open(2) does not include a "not a regular file" error[3], but mmap(2)
does[4], and it is EACCES.
[1] http://man7.org/linux/man-pages/man2/execve.2.html#ERRORS
[2] http://man7.org/linux/man-pages/man2/uselib.2.html#ERRORS
[3] http://man7.org/linux/man-pages/man2/open.2.html#ERRORS
[4] http://man7.org/linux/man-pages/man2/mmap.2.html#ERRORS
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200605160013.3954297-2-keescook@chromium.org
---
fs/exec.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index e6e8a9a70327..d7c937044d10 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -141,11 +141,10 @@ SYSCALL_DEFINE1(uselib, const char __user *, library)
if (IS_ERR(file))
goto out;
- error = -EINVAL;
+ error = -EACCES;
if (!S_ISREG(file_inode(file)->i_mode))
goto exit;
- error = -EACCES;
if (path_noexec(&file->f_path))
goto exit;
--
2.27.0
^ permalink raw reply related
* [PATCH v7 6/7] selftest/openat2: Add tests for O_MAYEXEC enforcing
From: Mickaël Salaün @ 2020-07-23 17:12 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
Al Viro, Andrew Morton, Andy Lutomirski, Christian Brauner,
Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet, Scott Shell, Sean Christopherson,
Shuah Khan, Steve Dower, Steve Grubb, Tetsuo Handa,
Thibaut Sautereau, Vincent Strubel, kernel-hardening, linux-api,
linux-integrity, linux-security-module, linux-fsdevel,
Thibaut Sautereau
In-Reply-To: <20200723171227.446711-1-mic@digikod.net>
Test propagation of noexec mount points or file executability through
files open with or without O_MAYEXEC, thanks to the
fs.open_mayexec_enforce sysctl.
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Reviewed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Shuah Khan <shuah@kernel.org>
---
Changes since v6:
* Add full combination tests for all file types, including block
devices, character devices, fifos, sockets and symlinks.
* Properly save and restore initial sysctl value for all tests.
Changes since v5:
* Refactor with FIXTURE_VARIANT, which make the tests much more easy to
read and maintain.
* Save and restore initial sysctl value (suggested by Kees Cook).
* Test with a sysctl value of 0.
* Check errno in sysctl_access_write test.
* Update tests for the CAP_SYS_ADMIN switch.
* Update tests to check -EISDIR (replacing -EACCES).
* Replace FIXTURE_DATA() with FIXTURE() (spotted by Kees Cook).
* Use global const strings.
Changes since v3:
* Replace RESOLVE_MAYEXEC with O_MAYEXEC.
* Add tests to check that O_MAYEXEC is ignored by open(2) and openat(2).
Changes since v2:
* Move tests from exec/ to openat2/ .
* Replace O_MAYEXEC with RESOLVE_MAYEXEC from openat2(2).
* Cleanup tests.
Changes since v1:
* Move tests from yama/ to exec/ .
* Fix _GNU_SOURCE in kselftest_harness.h .
* Add a new test sysctl_access_write to check if CAP_MAC_ADMIN is taken
into account.
* Test directory execution which is always forbidden since commit
73601ea5b7b1 ("fs/open.c: allow opening only regular files during
execve()"), and also check that even the root user can not bypass file
execution checks.
* Make sure delete_workspace() always as enough right to succeed.
* Cosmetic cleanup.
---
tools/testing/selftests/kselftest_harness.h | 3 +
tools/testing/selftests/openat2/Makefile | 3 +-
tools/testing/selftests/openat2/config | 1 +
tools/testing/selftests/openat2/helpers.h | 1 +
.../testing/selftests/openat2/omayexec_test.c | 325 ++++++++++++++++++
5 files changed, 332 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/openat2/config
create mode 100644 tools/testing/selftests/openat2/omayexec_test.c
diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h
index c9f03ef93338..68a0acd9ea1e 100644
--- a/tools/testing/selftests/kselftest_harness.h
+++ b/tools/testing/selftests/kselftest_harness.h
@@ -50,7 +50,10 @@
#ifndef __KSELFTEST_HARNESS_H
#define __KSELFTEST_HARNESS_H
+#ifndef _GNU_SOURCE
#define _GNU_SOURCE
+#endif
+
#include <asm/types.h>
#include <errno.h>
#include <stdbool.h>
diff --git a/tools/testing/selftests/openat2/Makefile b/tools/testing/selftests/openat2/Makefile
index 4b93b1417b86..cb98bdb4d5b1 100644
--- a/tools/testing/selftests/openat2/Makefile
+++ b/tools/testing/selftests/openat2/Makefile
@@ -1,7 +1,8 @@
# SPDX-License-Identifier: GPL-2.0-or-later
CFLAGS += -Wall -O2 -g -fsanitize=address -fsanitize=undefined
-TEST_GEN_PROGS := openat2_test resolve_test rename_attack_test
+LDLIBS += -lcap
+TEST_GEN_PROGS := openat2_test resolve_test rename_attack_test omayexec_test
include ../lib.mk
diff --git a/tools/testing/selftests/openat2/config b/tools/testing/selftests/openat2/config
new file mode 100644
index 000000000000..dd53c266bf52
--- /dev/null
+++ b/tools/testing/selftests/openat2/config
@@ -0,0 +1 @@
+CONFIG_SYSCTL=y
diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h
index a6ea27344db2..1dcd3e1e2f38 100644
--- a/tools/testing/selftests/openat2/helpers.h
+++ b/tools/testing/selftests/openat2/helpers.h
@@ -9,6 +9,7 @@
#define _GNU_SOURCE
#include <stdint.h>
+#include <stdbool.h>
#include <errno.h>
#include <linux/types.h>
#include "../kselftest.h"
diff --git a/tools/testing/selftests/openat2/omayexec_test.c b/tools/testing/selftests/openat2/omayexec_test.c
new file mode 100644
index 000000000000..34b91f9d78d0
--- /dev/null
+++ b/tools/testing/selftests/openat2/omayexec_test.c
@@ -0,0 +1,325 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test O_MAYEXEC
+ *
+ * Copyright © 2018-2020 ANSSI
+ *
+ * Author: Mickaël Salaün <mic@digikod.net>
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/capability.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <sys/sysmacros.h>
+#include <unistd.h>
+
+#include "helpers.h"
+#include "../kselftest_harness.h"
+
+#ifndef O_MAYEXEC
+#define O_MAYEXEC 040000000
+#endif
+
+static const char sysctl_path[] = "/proc/sys/fs/open_mayexec_enforce";
+
+static const char workdir_path[] = "./test-mount";
+static const char reg_file_path[] = "./test-mount/regular_file";
+static const char dir_path[] = "./test-mount/directory";
+static const char symlink_path[] = "./test-mount/symlink";
+static const char block_dev_path[] = "./test-mount/block_device";
+static const char char_dev_path[] = "./test-mount/character_device";
+static const char fifo_path[] = "./test-mount/fifo";
+static const char sock_path[] = "./test-mount/socket";
+
+static void ignore_dac(struct __test_metadata *_metadata, int override)
+{
+ cap_t caps;
+ const cap_value_t cap_val[2] = {
+ CAP_DAC_OVERRIDE,
+ CAP_DAC_READ_SEARCH,
+ };
+
+ caps = cap_get_proc();
+ ASSERT_NE(NULL, caps);
+ ASSERT_EQ(0, cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_val,
+ override ? CAP_SET : CAP_CLEAR));
+ ASSERT_EQ(0, cap_set_proc(caps));
+ EXPECT_EQ(0, cap_free(caps));
+}
+
+static void ignore_sys_admin(struct __test_metadata *_metadata, int override)
+{
+ cap_t caps;
+ const cap_value_t cap_val[1] = {
+ CAP_SYS_ADMIN,
+ };
+
+ caps = cap_get_proc();
+ ASSERT_NE(NULL, caps);
+ ASSERT_EQ(0, cap_set_flag(caps, CAP_EFFECTIVE, 1, cap_val,
+ override ? CAP_SET : CAP_CLEAR));
+ ASSERT_EQ(0, cap_set_proc(caps));
+ EXPECT_EQ(0, cap_free(caps));
+}
+
+static void test_omx(struct __test_metadata *_metadata,
+ const char *const path, const int no_mayexec_err_code,
+ const int mayexec_err_code)
+{
+ struct open_how how = {
+ .flags = O_RDONLY | O_NOFOLLOW | O_CLOEXEC,
+ };
+ int fd;
+
+ /* Do not block on pipes. */
+ if (path == fifo_path)
+ how.flags |= O_NONBLOCK;
+
+ /* Opens without O_MAYEXEC. */
+ fd = sys_openat2(AT_FDCWD, path, &how);
+ if (!no_mayexec_err_code) {
+ ASSERT_LE(0, fd) {
+ TH_LOG("Failed to openat2 %s: %d", path, -fd);
+ }
+ EXPECT_EQ(0, close(fd));
+ } else {
+ ASSERT_EQ(no_mayexec_err_code, fd) {
+ TH_LOG("Wrong error for openat2 %s: %d", path, -fd);
+ }
+ }
+
+ how.flags |= O_MAYEXEC;
+
+ /* Checks that O_MAYEXEC is ignored with open(2). */
+ fd = open(path, how.flags);
+ if (!no_mayexec_err_code) {
+ ASSERT_LE(0, fd) {
+ TH_LOG("Failed to open %s: %d", path, errno);
+ }
+ EXPECT_EQ(0, close(fd));
+ } else {
+ ASSERT_EQ(no_mayexec_err_code, -errno);
+ }
+
+ /* Checks that O_MAYEXEC is ignored with openat(2). */
+ fd = openat(AT_FDCWD, path, how.flags);
+ if (!no_mayexec_err_code) {
+ ASSERT_LE(0, fd) {
+ TH_LOG("Failed to openat %s: %d", path, errno);
+ }
+ EXPECT_EQ(0, close(fd));
+ } else {
+ ASSERT_EQ(no_mayexec_err_code, -errno);
+ }
+
+ /* Opens with O_MAYEXEC. */
+ fd = sys_openat2(AT_FDCWD, path, &how);
+ if (!mayexec_err_code) {
+ ASSERT_LE(0, fd) {
+ TH_LOG("Failed to openat2 %s: %d", path, -fd);
+ }
+ EXPECT_EQ(0, close(fd));
+ } else {
+ ASSERT_EQ(mayexec_err_code, fd) {
+ TH_LOG("Wrong error for openat2 %s: %d", path, -fd);
+ }
+ }
+}
+
+static void test_file_types(struct __test_metadata *_metadata, const int err_code,
+ const bool has_policy)
+{
+ test_omx(_metadata, reg_file_path, 0, err_code);
+ test_omx(_metadata, dir_path, 0, -EISDIR);
+ test_omx(_metadata, symlink_path, -ELOOP, -ELOOP);
+ test_omx(_metadata, block_dev_path, 0, has_policy ? -EACCES : 0);
+ test_omx(_metadata, char_dev_path, 0, has_policy ? -EACCES : 0);
+ test_omx(_metadata, fifo_path, 0, has_policy ? -EACCES : 0);
+ test_omx(_metadata, sock_path, -ENXIO, has_policy ? -EACCES : -ENXIO);
+}
+
+static void test_files(struct __test_metadata *_metadata, const int err_code,
+ const bool has_policy)
+{
+ /* Tests as root. */
+ ignore_dac(_metadata, 1);
+ test_file_types(_metadata, err_code, has_policy);
+
+ /* Tests without bypass. */
+ ignore_dac(_metadata, 0);
+ test_file_types(_metadata, err_code, has_policy);
+}
+
+static void sysctl_write_char(struct __test_metadata *_metadata, const char value)
+{
+ int fd;
+
+ fd = open(sysctl_path, O_WRONLY | O_CLOEXEC);
+ ASSERT_LE(0, fd);
+ ASSERT_EQ(1, write(fd, &value, 1));
+ EXPECT_EQ(0, close(fd));
+}
+
+static char sysctl_read_char(struct __test_metadata *_metadata)
+{
+ int fd;
+ char sysctl_value;
+
+ fd = open(sysctl_path, O_RDONLY | O_CLOEXEC);
+ ASSERT_LE(0, fd);
+ ASSERT_EQ(1, read(fd, &sysctl_value, 1));
+ EXPECT_EQ(0, close(fd));
+ return sysctl_value;
+}
+
+FIXTURE(omayexec) {
+ char initial_sysctl_value;
+};
+
+FIXTURE_VARIANT(omayexec) {
+ const bool mount_exec;
+ const bool file_exec;
+ const int sysctl_err_code[3];
+};
+
+FIXTURE_VARIANT_ADD(omayexec, mount_exec_file_exec) {
+ .mount_exec = true,
+ .file_exec = true,
+ .sysctl_err_code = {0, 0, 0},
+};
+
+FIXTURE_VARIANT_ADD(omayexec, mount_exec_file_noexec)
+{
+ .mount_exec = true,
+ .file_exec = false,
+ .sysctl_err_code = {0, -EACCES, -EACCES},
+};
+
+FIXTURE_VARIANT_ADD(omayexec, mount_noexec_file_exec)
+{
+ .mount_exec = false,
+ .file_exec = true,
+ .sysctl_err_code = {-EACCES, 0, -EACCES},
+};
+
+FIXTURE_VARIANT_ADD(omayexec, mount_noexec_file_noexec)
+{
+ .mount_exec = false,
+ .file_exec = false,
+ .sysctl_err_code = {-EACCES, -EACCES, -EACCES},
+};
+
+FIXTURE_SETUP(omayexec)
+{
+ /*
+ * Cleans previous workspace if any error previously happened (don't
+ * check errors).
+ */
+ umount(workdir_path);
+ rmdir(workdir_path);
+
+ /* Creates a clean mount point. */
+ ASSERT_EQ(0, mkdir(workdir_path, 00700));
+ ASSERT_EQ(0, mount("test", workdir_path, "tmpfs", MS_MGC_VAL |
+ (variant->mount_exec ? 0 : MS_NOEXEC),
+ "mode=0700,size=4k"));
+
+ /* Creates a regular file. */
+ ASSERT_EQ(0, mknod(reg_file_path, S_IFREG | (variant->file_exec ? 0500 : 0400), 0));
+ /* Creates a directory. */
+ ASSERT_EQ(0, mkdir(dir_path, variant->file_exec ? 0500 : 0400));
+ /* Creates a symlink pointing to the regular file. */
+ ASSERT_EQ(0, symlink("regular_file", symlink_path));
+ /* Creates a character device: /dev/null. */
+ ASSERT_EQ(0, mknod(char_dev_path, S_IFCHR | 0400, makedev(1, 3)));
+ /* Creates a block device: /dev/loop0 */
+ ASSERT_EQ(0, mknod(block_dev_path, S_IFBLK | 0400, makedev(7, 0)));
+ /* Creates a fifo. */
+ ASSERT_EQ(0, mknod(fifo_path, S_IFIFO | 0400, 0));
+ /* Creates a socket. */
+ ASSERT_EQ(0, mknod(sock_path, S_IFSOCK | 0400, 0));
+
+ /* Saves initial sysctl value. */
+ self->initial_sysctl_value = sysctl_read_char(_metadata);
+
+ /* Prepares for sysctl writes. */
+ ignore_sys_admin(_metadata, 1);
+}
+
+FIXTURE_TEARDOWN(omayexec)
+{
+ /* Restores initial sysctl value. */
+ sysctl_write_char(_metadata, self->initial_sysctl_value);
+
+ /* There is no need to unlink the test files. */
+ ASSERT_EQ(0, umount(workdir_path));
+ ASSERT_EQ(0, rmdir(workdir_path));
+}
+
+TEST_F(omayexec, sysctl_0)
+{
+ /* Do not enforce anything. */
+ sysctl_write_char(_metadata, '0');
+ test_files(_metadata, 0, false);
+}
+
+TEST_F(omayexec, sysctl_1)
+{
+ /* Enforces mount exec check. */
+ sysctl_write_char(_metadata, '1');
+ test_files(_metadata, variant->sysctl_err_code[0], true);
+}
+
+TEST_F(omayexec, sysctl_2)
+{
+ /* Enforces file exec check. */
+ sysctl_write_char(_metadata, '2');
+ test_files(_metadata, variant->sysctl_err_code[1], true);
+}
+
+TEST_F(omayexec, sysctl_3)
+{
+ /* Enforces mount and file exec check. */
+ sysctl_write_char(_metadata, '3');
+ test_files(_metadata, variant->sysctl_err_code[2], true);
+}
+
+FIXTURE(cleanup) {
+ char initial_sysctl_value;
+};
+
+FIXTURE_SETUP(cleanup)
+{
+ /* Saves initial sysctl value. */
+ self->initial_sysctl_value = sysctl_read_char(_metadata);
+}
+
+FIXTURE_TEARDOWN(cleanup)
+{
+ /* Restores initial sysctl value. */
+ ignore_sys_admin(_metadata, 1);
+ sysctl_write_char(_metadata, self->initial_sysctl_value);
+}
+
+TEST_F(cleanup, sysctl_access_write)
+{
+ int fd;
+ ssize_t ret;
+
+ ignore_sys_admin(_metadata, 1);
+ sysctl_write_char(_metadata, '0');
+
+ ignore_sys_admin(_metadata, 0);
+ fd = open(sysctl_path, O_WRONLY | O_CLOEXEC);
+ ASSERT_LE(0, fd);
+ ret = write(fd, "0", 1);
+ ASSERT_EQ(-1, ret);
+ ASSERT_EQ(EPERM, errno);
+ EXPECT_EQ(0, close(fd));
+}
+
+TEST_HARNESS_MAIN
--
2.27.0
^ permalink raw reply related
* [PATCH v7 2/7] exec: Move S_ISREG() check earlier
From: Mickaël Salaün @ 2020-07-23 17:12 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
Al Viro, Andrew Morton, Andy Lutomirski, Christian Brauner,
Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet, Scott Shell, Sean Christopherson,
Shuah Khan, Steve Dower, Steve Grubb, Tetsuo Handa,
Thibaut Sautereau, Vincent Strubel, kernel-hardening, linux-api,
linux-integrity, linux-security-module, linux-fsdevel
In-Reply-To: <20200723171227.446711-1-mic@digikod.net>
From: Kees Cook <keescook@chromium.org>
The execve(2)/uselib(2) syscalls have always rejected non-regular
files. Recently, it was noticed that a deadlock was introduced when trying
to execute pipes, as the S_ISREG() test was happening too late. This was
fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.
Move the test into the other inode type checks (which already look
for other pathological conditions[1]). Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the
test to MAY_EXEC.
Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.
My notes on the call path, and related arguments, checks, etc:
do_open_execat()
struct open_flags open_exec_flags = {
.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
.acc_mode = MAY_EXEC,
...
do_filp_open(dfd, filename, open_flags)
path_openat(nameidata, open_flags, flags)
file = alloc_empty_file(open_flags, current_cred());
do_open(nameidata, file, open_flags)
may_open(path, acc_mode, open_flag)
/* new location of MAY_EXEC vs S_ISREG() test */
inode_permission(inode, MAY_OPEN | acc_mode)
security_inode_permission(inode, acc_mode)
vfs_open(path, file)
do_dentry_open(file, path->dentry->d_inode, open)
/* old location of FMODE_EXEC vs S_ISREG() test */
security_file_open(f)
open()
[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
---
fs/exec.c | 14 ++++++++++++--
fs/namei.c | 6 ++++--
fs/open.c | 6 ------
3 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index d7c937044d10..bdc6a6eb5dce 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -141,8 +141,13 @@ SYSCALL_DEFINE1(uselib, const char __user *, library)
if (IS_ERR(file))
goto out;
+ /*
+ * may_open() has already checked for this, so it should be
+ * impossible to trip now. But we need to be extra cautious
+ * and check again at the very end too.
+ */
error = -EACCES;
- if (!S_ISREG(file_inode(file)->i_mode))
+ if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode)))
goto exit;
if (path_noexec(&file->f_path))
@@ -886,8 +891,13 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
if (IS_ERR(file))
goto out;
+ /*
+ * may_open() has already checked for this, so it should be
+ * impossible to trip now. But we need to be extra cautious
+ * and check again at the very end too.
+ */
err = -EACCES;
- if (!S_ISREG(file_inode(file)->i_mode))
+ if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode)))
goto exit;
if (path_noexec(&file->f_path))
diff --git a/fs/namei.c b/fs/namei.c
index 72d4219c93ac..a559ad943970 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2849,16 +2849,18 @@ static int may_open(const struct path *path, int acc_mode, int flag)
case S_IFLNK:
return -ELOOP;
case S_IFDIR:
- if (acc_mode & MAY_WRITE)
+ if (acc_mode & (MAY_WRITE | MAY_EXEC))
return -EISDIR;
break;
case S_IFBLK:
case S_IFCHR:
if (!may_open_dev(path))
return -EACCES;
- /*FALLTHRU*/
+ fallthrough;
case S_IFIFO:
case S_IFSOCK:
+ if (acc_mode & MAY_EXEC)
+ return -EACCES;
flag &= ~O_TRUNC;
break;
}
diff --git a/fs/open.c b/fs/open.c
index 6cd48a61cda3..623b7506a6db 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -784,12 +784,6 @@ static int do_dentry_open(struct file *f,
return 0;
}
- /* Any file opened for execve()/uselib() has to be a regular file. */
- if (unlikely(f->f_flags & FMODE_EXEC && !S_ISREG(inode->i_mode))) {
- error = -EACCES;
- goto cleanup_file;
- }
-
if (f->f_mode & FMODE_WRITE && !special_file(inode->i_mode)) {
error = get_write_access(inode);
if (unlikely(error))
--
2.27.0
^ permalink raw reply related
* [PATCH v7 7/7] ima: add policy support for the new file open MAY_OPENEXEC flag
From: Mickaël Salaün @ 2020-07-23 17:12 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
Al Viro, Andrew Morton, Andy Lutomirski, Christian Brauner,
Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet, Scott Shell, Sean Christopherson,
Shuah Khan, Steve Dower, Steve Grubb, Tetsuo Handa,
Thibaut Sautereau, Vincent Strubel, kernel-hardening, linux-api,
linux-integrity, linux-security-module, linux-fsdevel
In-Reply-To: <20200723171227.446711-1-mic@digikod.net>
From: Mimi Zohar <zohar@linux.ibm.com>
The kernel has no way of differentiating between a file containing data
or code being opened by an interpreter. The proposed O_MAYEXEC
openat2(2) flag bridges this gap by defining and enabling the
MAY_OPENEXEC flag.
This patch adds IMA policy support for the new MAY_OPENEXEC flag.
Example:
measure func=FILE_CHECK mask=^MAY_OPENEXEC
appraise func=FILE_CHECK appraise_type=imasig mask=^MAY_OPENEXEC
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
Link: https://lore.kernel.org/r/1588167523-7866-3-git-send-email-zohar@linux.ibm.com
---
Documentation/ABI/testing/ima_policy | 2 +-
security/integrity/ima/ima_main.c | 3 ++-
security/integrity/ima/ima_policy.c | 15 +++++++++++----
3 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/Documentation/ABI/testing/ima_policy b/Documentation/ABI/testing/ima_policy
index cd572912c593..caca46125fe0 100644
--- a/Documentation/ABI/testing/ima_policy
+++ b/Documentation/ABI/testing/ima_policy
@@ -31,7 +31,7 @@ Description:
[KEXEC_KERNEL_CHECK] [KEXEC_INITRAMFS_CHECK]
[KEXEC_CMDLINE] [KEY_CHECK]
mask:= [[^]MAY_READ] [[^]MAY_WRITE] [[^]MAY_APPEND]
- [[^]MAY_EXEC]
+ [[^]MAY_EXEC] [[^]MAY_OPENEXEC]
fsmagic:= hex value
fsuuid:= file system UUID (e.g 8bcbe394-4f13-4144-be8e-5aa9ea2ce2f6)
uid:= decimal value
diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index c1583d98c5e5..59fd1658a203 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -490,7 +490,8 @@ int ima_file_check(struct file *file, int mask)
security_task_getsecid(current, &secid);
return process_measurement(file, current_cred(), secid, NULL, 0,
- mask & (MAY_READ | MAY_WRITE | MAY_EXEC |
+ mask & (MAY_READ | MAY_WRITE |
+ MAY_EXEC | MAY_OPENEXEC |
MAY_APPEND), FILE_CHECK);
}
EXPORT_SYMBOL_GPL(ima_file_check);
diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c
index e493063a3c34..6487f0b2afdd 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -406,7 +406,8 @@ static bool ima_match_keyring(struct ima_rule_entry *rule,
* @cred: a pointer to a credentials structure for user validation
* @secid: the secid of the task to be validated
* @func: LIM hook identifier
- * @mask: requested action (MAY_READ | MAY_WRITE | MAY_APPEND | MAY_EXEC)
+ * @mask: requested action (MAY_READ | MAY_WRITE | MAY_APPEND | MAY_EXEC |
+ * MAY_OPENEXEC)
* @keyring: keyring name to check in policy for KEY_CHECK func
*
* Returns true on rule match, false on failure.
@@ -527,7 +528,8 @@ static int get_subaction(struct ima_rule_entry *rule, enum ima_hooks func)
* being made
* @secid: LSM secid of the task to be validated
* @func: IMA hook identifier
- * @mask: requested action (MAY_READ | MAY_WRITE | MAY_APPEND | MAY_EXEC)
+ * @mask: requested action (MAY_READ | MAY_WRITE | MAY_APPEND | MAY_EXEC |
+ * MAY_OPENEXEC)
* @pcr: set the pcr to extend
* @template_desc: the template that should be used for this rule
* @keyring: the keyring name, if given, to be used to check in the policy.
@@ -1091,6 +1093,8 @@ static int ima_parse_rule(char *rule, struct ima_rule_entry *entry)
entry->mask = MAY_READ;
else if (strcmp(from, "MAY_APPEND") == 0)
entry->mask = MAY_APPEND;
+ else if (strcmp(from, "MAY_OPENEXEC") == 0)
+ entry->mask = MAY_OPENEXEC;
else
result = -EINVAL;
if (!result)
@@ -1422,14 +1426,15 @@ const char *const func_tokens[] = {
#ifdef CONFIG_IMA_READ_POLICY
enum {
- mask_exec = 0, mask_write, mask_read, mask_append
+ mask_exec = 0, mask_write, mask_read, mask_append, mask_openexec
};
static const char *const mask_tokens[] = {
"^MAY_EXEC",
"^MAY_WRITE",
"^MAY_READ",
- "^MAY_APPEND"
+ "^MAY_APPEND",
+ "^MAY_OPENEXEC"
};
void *ima_policy_start(struct seq_file *m, loff_t *pos)
@@ -1518,6 +1523,8 @@ int ima_policy_show(struct seq_file *m, void *v)
seq_printf(m, pt(Opt_mask), mt(mask_read) + offset);
if (entry->mask & MAY_APPEND)
seq_printf(m, pt(Opt_mask), mt(mask_append) + offset);
+ if (entry->mask & MAY_OPENEXEC)
+ seq_printf(m, pt(Opt_mask), mt(mask_openexec) + offset);
seq_puts(m, " ");
}
--
2.27.0
^ permalink raw reply related
* [PATCH v7 4/7] fs: Introduce O_MAYEXEC flag for openat2(2)
From: Mickaël Salaün @ 2020-07-23 17:12 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
Al Viro, Andrew Morton, Andy Lutomirski, Christian Brauner,
Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet, Scott Shell, Sean Christopherson,
Shuah Khan, Steve Dower, Steve Grubb, Tetsuo Handa,
Thibaut Sautereau, Vincent Strubel, kernel-hardening, linux-api,
linux-integrity, linux-security-module, linux-fsdevel,
Thibaut Sautereau
In-Reply-To: <20200723171227.446711-1-mic@digikod.net>
When the O_MAYEXEC flag is passed, openat2(2) may be subject to
additional restrictions depending on a security policy managed by the
kernel through a sysctl or implemented by an LSM thanks to the
inode_permission hook. This new flag is ignored by open(2) and
openat(2) because of their unspecified flags handling. When used with
openat2(2), the default behavior is only to forbid to open a directory.
The underlying idea is to be able to restrict scripts interpretation
according to a policy defined by the system administrator. For this to
be possible, script interpreters must use the O_MAYEXEC flag
appropriately. To be fully effective, these interpreters also need to
handle the other ways to execute code: command line parameters (e.g.,
option -e for Perl), module loading (e.g., option -m for Python), stdin,
file sourcing, environment variables, configuration files, etc.
According to the threat model, it may be acceptable to allow some script
interpreters (e.g. Bash) to interpret commands from stdin, may it be a
TTY or a pipe, because it may not be enough to (directly) perform
syscalls. Further documentation can be found in a following patch.
Even without enforced security policy, userland interpreters can set it
to enforce the system policy at their level, knowing that it will not
break anything on running systems which do not care about this feature.
However, on systems which want this feature enforced, there will be
knowledgeable people (i.e. sysadmins who enforced O_MAYEXEC
deliberately) to manage it. A simple security policy implementation,
configured through a dedicated sysctl, is available in a following
patch.
O_MAYEXEC should not be confused with the O_EXEC flag which is intended
for execute-only, which obviously doesn't work for scripts. However, a
similar behavior could be implemented in userland with O_PATH:
https://lore.kernel.org/lkml/1e2f6913-42f2-3578-28ed-567f6a4bdda1@digikod.net/
The implementation of O_MAYEXEC almost duplicates what execve(2) and
uselib(2) are already doing: setting MAY_OPENEXEC in acc_mode (which can
then be checked as MAY_EXEC, if enforced).
This is an updated subset of the patch initially written by Vincent
Strubel for CLIP OS 4:
https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch
This patch has been used for more than 12 years with customized script
interpreters. Some examples (with the original O_MAYEXEC) can be found
here:
https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC
Co-developed-by: Vincent Strubel <vincent.strubel@ssi.gouv.fr>
Signed-off-by: Vincent Strubel <vincent.strubel@ssi.gouv.fr>
Co-developed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Deven Bowers <deven.desai@linux.microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
---
Changes since v6:
* Do not set __FMODE_EXEC for now because of inconsistent behavior:
https://lore.kernel.org/lkml/202007160822.CCDB5478@keescook/
* Returns EISDIR when opening a directory with O_MAYEXEC.
* Removed Deven Bowers and Kees Cook Reviewed-by tags because of the
current update.
Changes since v5:
* Update commit message.
Changes since v3:
* Switch back to O_MAYEXEC, but only handle it with openat2(2) which
checks unknown flags (suggested by Aleksa Sarai). Cf.
https://lore.kernel.org/lkml/20200430015429.wuob7m5ofdewubui@yavin.dot.cyphar.com/
Changes since v2:
* Replace O_MAYEXEC with RESOLVE_MAYEXEC from openat2(2). This change
enables to not break existing application using bogus O_* flags that
may be ignored by current kernels by using a new dedicated flag, only
usable through openat2(2) (suggested by Jeff Layton). Using this flag
will results in an error if the running kernel does not support it.
User space needs to manage this case, as with other RESOLVE_* flags.
The best effort approach to security (for most common distros) will
simply consists of ignoring such an error and retry without
RESOLVE_MAYEXEC. However, a fully controlled system may which to
error out if such an inconsistency is detected.
Changes since v1:
* Set __FMODE_EXEC when using O_MAYEXEC to make this information
available through the new fanotify/FAN_OPEN_EXEC event (suggested by
Jan Kara and Matthew Bobrowski):
https://lore.kernel.org/lkml/20181213094658.GA996@lithium.mbobrowski.org/
---
fs/fcntl.c | 2 +-
fs/namei.c | 4 ++--
fs/open.c | 6 ++++++
include/linux/fcntl.h | 2 +-
include/linux/fs.h | 2 ++
include/uapi/asm-generic/fcntl.h | 7 +++++++
6 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..0357ad667563 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/namei.c b/fs/namei.c
index ddc9b25540fe..3f074ec77390 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -428,7 +428,7 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
/**
* inode_permission - Check for access rights to a given inode
* @inode: Inode to check permission on
- * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC, %MAY_OPENEXEC)
*
* Check for read/write/execute permissions on an inode. We use fs[ug]id for
* this, letting us set arbitrary permissions for filesystem access without
@@ -2849,7 +2849,7 @@ static int may_open(const struct path *path, int acc_mode, int flag)
case S_IFLNK:
return -ELOOP;
case S_IFDIR:
- if (acc_mode & (MAY_WRITE | MAY_EXEC))
+ if (acc_mode & (MAY_WRITE | MAY_EXEC | MAY_OPENEXEC))
return -EISDIR;
break;
case S_IFBLK:
diff --git a/fs/open.c b/fs/open.c
index 623b7506a6db..21c2c1020574 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -987,6 +987,8 @@ inline struct open_how build_open_how(int flags, umode_t mode)
.mode = mode & S_IALLUGO,
};
+ /* O_MAYEXEC is ignored by syscalls relying on build_open_how(). */
+ how.flags &= ~O_MAYEXEC;
/* O_PATH beats everything else. */
if (how.flags & O_PATH)
how.flags &= O_PATH_FLAGS;
@@ -1054,6 +1056,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
if (flags & __O_SYNC)
flags |= O_DSYNC;
+ /* Checks execution permissions on open. */
+ if (flags & O_MAYEXEC)
+ acc_mode |= MAY_OPENEXEC;
+
op->open_flag = flags;
/* O_TRUNC implies we need access checks for write permissions */
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 7bcdcf4f6ab2..e188a360fa5f 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
- O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_MAYEXEC)
/* List of all valid flags for the how->upgrade_mask argument: */
#define VALID_UPGRADE_FLAGS \
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f5abba86107d..56f835c9a87a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -101,6 +101,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
#define MAY_CHDIR 0x00000040
/* called from RCU mode, don't block */
#define MAY_NOT_BLOCK 0x00000080
+/* the inode is opened with O_MAYEXEC */
+#define MAY_OPENEXEC 0x00000100
/*
* flags in file.f_mode. Note that FMODE_READ and FMODE_WRITE must correspond
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..bca90620119f 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -97,6 +97,13 @@
#define O_NDELAY O_NONBLOCK
#endif
+/*
+ * Code execution from file is intended, checks such permission. A simple
+ * policy can be enforced system-wide as explained in
+ * Documentation/admin-guide/sysctl/fs.rst .
+ */
+#define O_MAYEXEC 040000000
+
#define F_DUPFD 0 /* dup */
#define F_GETFD 1 /* get close_on_exec */
#define F_SETFD 2 /* set/clear close_on_exec */
--
2.27.0
^ permalink raw reply related
* [PATCH v7 5/7] fs,doc: Enable to enforce noexec mounts or file exec through O_MAYEXEC
From: Mickaël Salaün @ 2020-07-23 17:12 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
Al Viro, Andrew Morton, Andy Lutomirski, Christian Brauner,
Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet, Scott Shell, Sean Christopherson,
Shuah Khan, Steve Dower, Steve Grubb, Tetsuo Handa,
Thibaut Sautereau, Vincent Strubel, kernel-hardening, linux-api,
linux-integrity, linux-security-module, linux-fsdevel,
Thibaut Sautereau, Randy Dunlap
In-Reply-To: <20200723171227.446711-1-mic@digikod.net>
Allow for the enforcement of the O_MAYEXEC openat2(2) flag. Thanks to
the noexec option from the underlying VFS mount, or to the file execute
permission, userspace can enforce these execution policies. This may
allow script interpreters to check execution permission before reading
commands from a file, or dynamic linkers to allow shared object loading.
Add a new sysctl fs.open_mayexec_enforce to enable system administrators
to enforce two complementary security policies according to the
installed system: enforce the noexec mount option, and enforce
executable file permission. Indeed, because of compatibility with
installed systems, only system administrators are able to check that
this new enforcement is in line with the system mount points and file
permissions. A following patch adds documentation.
Being able to restrict execution also enables to protect the kernel by
restricting arbitrary syscalls that an attacker could perform with a
crafted binary or certain script languages. It also improves multilevel
isolation by reducing the ability of an attacker to use side channels
with specific code. These restrictions can natively be enforced for ELF
binaries (with the noexec mount option) but require this kernel
extension to properly handle scripts (e.g., Python, Perl). To get a
consistent execution policy, additional memory restrictions should also
be enforced (e.g. thanks to SELinux).
Because the O_MAYEXEC flag is a meant to enforce a system-wide security
policy (but not application-centric policies), it does not make sense
for userland to check the sysctl value. Indeed, this new flag only
enables to extend the system ability to enforce a policy thanks to (some
trusted) userland collaboration. Moreover, additional security policies
could be managed by LSMs. This is a best-effort approach from the
application developer point of view:
https://lore.kernel.org/lkml/1477d3d7-4b36-afad-7077-a38f42322238@digikod.net/
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Reviewed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
---
Changes since v6:
* Allow opening pipes, block devices and character devices with
O_MAYEXEC when there is no enforced policy, but forbid any non-regular
file opened with O_MAYEXEC otherwise (i.e. for any enforced policy).
* Add a paragraph about the non-regular files policy.
* Move path_noexec() calls out of the fast-path (suggested by Kees
Cook).
Changes since v5:
* Remove the static enforcement configuration through Kconfig because it
makes the code more simple like this, and because the current sysctl
configuration can only be set with CAP_SYS_ADMIN, the same way mount
options (i.e. noexec) can be set. If an harden distro wants to
enforce a configuration, it should restrict capabilities or sysctl
configuration. Furthermore, an LSM can easily leverage O_MAYEXEC to
fit its need.
* Move checks from inode_permission() to may_open() and make the error
codes more consistent according to file types (in line with a previous
commit): opening a directory with O_MAYEXEC returns EISDIR and other
non-regular file types may return EACCES.
* In may_open(), when OMAYEXEC_ENFORCE_FILE is set, replace explicit
call to generic_permission() with an artificial MAY_EXEC to avoid
double calls. This makes sense especially when an LSM policy forbids
execution of a file.
* Replace the custom proc_omayexec() with
proc_dointvec_minmax_sysadmin(), and then replace the CAP_MAC_ADMIN
check with a CAP_SYS_ADMIN one (suggested by Kees Cook and Stephen
Smalley).
* Use BIT() (suggested by Kees Cook).
* Rename variables (suggested by Kees Cook).
* Reword the kconfig help.
* Import the documentation patch (suggested by Kees Cook):
https://lore.kernel.org/lkml/20200505153156.925111-6-mic@digikod.net/
* Update documentation and add LWN.net article.
Changes since v4:
* Add kernel configuration options to enforce O_MAYEXEC at build time,
and disable the sysctl in such case (requested by James Morris).
* Reword commit message.
Changes since v3:
* Update comment with O_MAYEXEC.
Changes since v2:
* Cosmetic changes.
Changes since v1:
* Move code from Yama to the FS subsystem (suggested by Kees Cook).
* Make omayexec_inode_permission() static (suggested by Jann Horn).
* Use mode 0600 for the sysctl.
* Only match regular files (not directories nor other types), which
follows the same semantic as commit 73601ea5b7b1 ("fs/open.c: allow
opening only regular files during execve()").
---
Documentation/admin-guide/sysctl/fs.rst | 49 +++++++++++++++++++++++++
fs/namei.c | 24 ++++++++++++
include/linux/fs.h | 1 +
kernel/sysctl.c | 12 +++++-
4 files changed, 84 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 2a45119e3331..ce6e2081d3a9 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -37,6 +37,7 @@ Currently, these files are in /proc/sys/fs:
- inode-nr
- inode-state
- nr_open
+- open_mayexec_enforce
- overflowuid
- overflowgid
- pipe-user-pages-hard
@@ -165,6 +166,54 @@ system needs to prune the inode list instead of allocating
more.
+open_mayexec_enforce
+--------------------
+
+While being ignored by :manpage:`open(2)` and :manpage:`openat(2)`, the
+``O_MAYEXEC`` flag can be passed to :manpage:`openat2(2)` to only open regular
+files that are expected to be executable. If the file is not identified as
+executable, then the syscall returns -EACCES. This may allow a script
+interpreter to check executable permission before reading commands from a file,
+or a dynamic linker to only load executable shared objects. One interesting
+use case is to enforce a "write xor execute" policy through interpreters.
+
+The ability to restrict code execution must be thought as a system-wide policy,
+which first starts by restricting mount points with the ``noexec`` option.
+This option is also automatically applied to special filesystems such as /proc .
+This prevents files on such mount points to be directly executed by the kernel
+or mapped as executable memory (e.g. libraries). With script interpreters
+using the ``O_MAYEXEC`` flag, the executable permission can then be checked
+before reading commands from files. This makes it possible to enforce the
+``noexec`` at the interpreter level, and thus propagates this security policy
+to scripts. To be fully effective, these interpreters also need to handle the
+other ways to execute code: command line parameters (e.g., option ``-e`` for
+Perl), module loading (e.g., option ``-m`` for Python), stdin, file sourcing,
+environment variables, configuration files, etc. According to the threat
+model, it may be acceptable to allow some script interpreters (e.g. Bash) to
+interpret commands from stdin, may it be a TTY or a pipe, because it may not be
+enough to (directly) perform syscalls.
+
+There are two complementary security policies: enforce the ``noexec`` mount
+option, and enforce executable file permission. These policies are handled by
+the ``fs.open_mayexec_enforce`` sysctl (writable only with ``CAP_SYS_ADMIN``)
+as a bitmask:
+
+1 - Mount restriction: checks that the mount options for the underlying VFS
+ mount do not prevent execution.
+
+2 - File permission restriction: checks that the to-be-opened file is marked as
+ executable for the current process (e.g., POSIX permissions).
+
+Note that as long as a policy is enforced, opening any non-regular file with
+``O_MAYEXEC`` is denied (e.g. TTYs, pipe), even when such a file is marked as
+executable or is on an executable mount point.
+
+Code samples can be found in tools/testing/selftests/openat2/omayexec_test.c
+and interpreter patches (for the original O_MAYEXEC version) may be found at
+https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC .
+See also an overview article: https://lwn.net/Articles/820000/ .
+
+
overflowgid & overflowuid
-------------------------
diff --git a/fs/namei.c b/fs/namei.c
index 3f074ec77390..8ec13c7fd403 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -39,6 +39,7 @@
#include <linux/bitops.h>
#include <linux/init_task.h>
#include <linux/uaccess.h>
+#include <linux/sysctl.h>
#include "internal.h"
#include "mount.h"
@@ -425,6 +426,11 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
return 0;
}
+#define OPEN_MAYEXEC_ENFORCE_MOUNT BIT(0)
+#define OPEN_MAYEXEC_ENFORCE_FILE BIT(1)
+
+int sysctl_open_mayexec_enforce __read_mostly;
+
/**
* inode_permission - Check for access rights to a given inode
* @inode: Inode to check permission on
@@ -2861,11 +2867,29 @@ static int may_open(const struct path *path, int acc_mode, int flag)
case S_IFSOCK:
if (acc_mode & MAY_EXEC)
return -EACCES;
+ /*
+ * Opening devices (e.g. TTYs) or pipes with O_MAYEXEC may be
+ * legitimate when there is no enforced policy.
+ */
+ if ((acc_mode & MAY_OPENEXEC) && sysctl_open_mayexec_enforce)
+ return -EACCES;
flag &= ~O_TRUNC;
break;
case S_IFREG:
if ((acc_mode & MAY_EXEC) && path_noexec(path))
return -EACCES;
+ if (acc_mode & MAY_OPENEXEC) {
+ if ((sysctl_open_mayexec_enforce & OPEN_MAYEXEC_ENFORCE_MOUNT)
+ && path_noexec(path))
+ return -EACCES;
+ if (sysctl_open_mayexec_enforce & OPEN_MAYEXEC_ENFORCE_FILE)
+ /*
+ * Because acc_mode may change here, the next and only
+ * use of acc_mode should then be by the following call
+ * to inode_permission().
+ */
+ acc_mode |= MAY_EXEC;
+ }
break;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 56f835c9a87a..071f37707ccc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -83,6 +83,7 @@ extern int sysctl_protected_symlinks;
extern int sysctl_protected_hardlinks;
extern int sysctl_protected_fifos;
extern int sysctl_protected_regular;
+extern int sysctl_open_mayexec_enforce;
typedef __kernel_rwf_t rwf_t;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index db1ce7af2563..5008a2566e79 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -113,6 +113,7 @@ static int sixty = 60;
static int __maybe_unused neg_one = -1;
static int __maybe_unused two = 2;
+static int __maybe_unused three = 3;
static int __maybe_unused four = 4;
static unsigned long zero_ul;
static unsigned long one_ul = 1;
@@ -888,7 +889,6 @@ static int proc_taint(struct ctl_table *table, int write,
return err;
}
-#ifdef CONFIG_PRINTK
static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
@@ -897,7 +897,6 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
}
-#endif
/**
* struct do_proc_dointvec_minmax_conv_param - proc_dointvec_minmax() range checking structure
@@ -3264,6 +3263,15 @@ static struct ctl_table fs_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = &two,
},
+ {
+ .procname = "open_mayexec_enforce",
+ .data = &sysctl_open_mayexec_enforce,
+ .maxlen = sizeof(int),
+ .mode = 0600,
+ .proc_handler = proc_dointvec_minmax_sysadmin,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = &three,
+ },
#if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
{
.procname = "binfmt_misc",
--
2.27.0
^ permalink raw reply related
* [PATCH v7 0/7] Add support for O_MAYEXEC
From: Mickaël Salaün @ 2020-07-23 17:12 UTC (permalink / raw)
To: linux-kernel
Cc: Mickaël Salaün, Aleksa Sarai, Alexei Starovoitov,
Al Viro, Andrew Morton, Andy Lutomirski, Christian Brauner,
Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
Philippe Trébuchet, Scott Shell, Sean Christopherson,
Shuah Khan, Steve Dower, Steve Grubb, Tetsuo Handa,
Thibaut Sautereau, Vincent Strubel, kernel-hardening, linux-api,
linux-integrity, linux-security-module, linux-fsdevel
Hi,
This seventh patch series do not set __FMODE_EXEC for the sake of
simplicity. A notification feature could be added later if needed. The
handling of all file types is now well defined and tested: by default,
when opening a path, access to a directory is denied (with EISDIR),
access to a regular file depends on the sysctl policy, and access to
other file types (i.e. fifo, device, socket) is denied if there is any
enforced policy. There is new tests covering all these cases (cf.
test_file_types() ).
As requested by Mimi Zohar, I completed the series with one of her
patches for IMA. I also picked Kees Cook's patches to consolidate exec
permission checking into do_filp_open()'s flow.
# Goal of O_MAYEXEC
The goal of this patch series is to enable to control script execution
with interpreters help. A new O_MAYEXEC flag, usable through
openat2(2), is added to enable userspace script interpreters to delegate
to the kernel (and thus the system security policy) the permission to
interpret/execute scripts or other files containing what can be seen as
commands.
A simple system-wide security policy can be enforced by the system
administrator through a sysctl configuration consistent with the mount
points or the file access rights. The documentation patch explains the
prerequisites.
Furthermore, the security policy can also be delegated to an LSM, either
a MAC system or an integrity system. For instance, the new kernel
MAY_OPENEXEC flag closes a major IMA measurement/appraisal interpreter
integrity gap by bringing the ability to check the use of scripts [1].
Other uses are expected, such as for magic-links [2], SGX integration
[3], bpffs [4] or IPE [5].
# Prerequisite of its use
Userspace needs to adapt to take advantage of this new feature. For
example, the PEP 578 [6] (Runtime Audit Hooks) enables Python 3.8 to be
extended with policy enforcement points related to code interpretation,
which can be used to align with the PowerShell audit features.
Additional Python security improvements (e.g. a limited interpreter
withou -c, stdin piping of code) are on their way [7].
# Examples
The initial idea comes from CLIP OS 4 and the original implementation
has been used for more than 12 years:
https://github.com/clipos-archive/clipos4_doc
Chrome OS has a similar approach:
https://chromium.googlesource.com/chromiumos/docs/+/master/security/noexec_shell_scripts.md
Userland patches can be found here:
https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC
Actually, there is more than the O_MAYEXEC changes (which matches this search)
e.g., to prevent Python interactive execution. There are patches for
Bash, Wine, Java (Icedtea), Busybox's ash, Perl and Python. There are
also some related patches which do not directly rely on O_MAYEXEC but
which restrict the use of browser plugins and extensions, which may be
seen as scripts too:
https://github.com/clipos-archive/clipos4_portage-overlay/tree/master/www-client
An introduction to O_MAYEXEC was given at the Linux Security Summit
Europe 2018 - Linux Kernel Security Contributions by ANSSI:
https://www.youtube.com/watch?v=chNjCRtPKQY&t=17m15s
The "write xor execute" principle was explained at Kernel Recipes 2018 -
CLIP OS: a defense-in-depth OS:
https://www.youtube.com/watch?v=PjRE0uBtkHU&t=11m14s
See also an overview article: https://lwn.net/Articles/820000/
This patch series can be applied on top of v5.8-rc5 . This can be tested
with CONFIG_SYSCTL. I would really appreciate constructive comments on
this patch series.
Previous version:
https://lore.kernel.org/lkml/20200505153156.925111-1-mic@digikod.net/
[1] https://lore.kernel.org/lkml/1544647356.4028.105.camel@linux.ibm.com/
[2] https://lore.kernel.org/lkml/20190904201933.10736-6-cyphar@cyphar.com/
[3] https://lore.kernel.org/lkml/CALCETrVovr8XNZSroey7pHF46O=kj_c5D9K8h=z2T_cNrpvMig@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CALCETrVeZ0eufFXwfhtaG_j+AdvbzEWE0M3wjXMWVEO7pj+xkw@mail.gmail.com/
[5] https://lore.kernel.org/lkml/20200406221439.1469862-12-deven.desai@linux.microsoft.com/
[6] https://www.python.org/dev/peps/pep-0578/
[7] https://lore.kernel.org/lkml/0c70debd-e79e-d514-06c6-4cd1e021fa8b@python.org/
Regards,
Kees Cook (3):
exec: Change uselib(2) IS_SREG() failure to EACCES
exec: Move S_ISREG() check earlier
exec: Move path_noexec() check earlier
Mickaël Salaün (3):
fs: Introduce O_MAYEXEC flag for openat2(2)
fs,doc: Enable to enforce noexec mounts or file exec through O_MAYEXEC
selftest/openat2: Add tests for O_MAYEXEC enforcing
Mimi Zohar (1):
ima: add policy support for the new file open MAY_OPENEXEC flag
Documentation/ABI/testing/ima_policy | 2 +-
Documentation/admin-guide/sysctl/fs.rst | 49 +++
fs/exec.c | 23 +-
fs/fcntl.c | 2 +-
fs/namei.c | 36 +-
fs/open.c | 12 +-
include/linux/fcntl.h | 2 +-
include/linux/fs.h | 3 +
include/uapi/asm-generic/fcntl.h | 7 +
kernel/sysctl.c | 12 +-
security/integrity/ima/ima_main.c | 3 +-
security/integrity/ima/ima_policy.c | 15 +-
tools/testing/selftests/kselftest_harness.h | 3 +
tools/testing/selftests/openat2/Makefile | 3 +-
tools/testing/selftests/openat2/config | 1 +
tools/testing/selftests/openat2/helpers.h | 1 +
.../testing/selftests/openat2/omayexec_test.c | 325 ++++++++++++++++++
17 files changed, 470 insertions(+), 29 deletions(-)
create mode 100644 tools/testing/selftests/openat2/config
create mode 100644 tools/testing/selftests/openat2/omayexec_test.c
--
2.27.0
^ permalink raw reply
* Re: [PATCH 1/2] Smack: fix another vsscanf out of bounds
From: Casey Schaufler @ 2020-07-23 16:38 UTC (permalink / raw)
To: Dan Carpenter
Cc: James Morris, Serge E. Hallyn, linux-security-module,
linux-kernel, syzkaller-bugs, Casey Schaufler
In-Reply-To: <20200723152219.GA302005@mwanda>
On 7/23/2020 8:22 AM, Dan Carpenter wrote:
> This is similar to commit 84e99e58e8d1 ("Smack: slab-out-of-bounds in
> vsscanf") where we added a bounds check on "rule".
>
> Reported-by: syzbot+a22c6092d003d6fe1122@syzkaller.appspotmail.com
> Fixes: f7112e6c9abf ("Smack: allow for significantly longer Smack labels v4")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Thanks. I'll be testing these and take them assuming they pass.
> ---
> This check is very straight forward and should fix the bug. But if you
> look at the fixes tag we used to rely on the check:
>
> if (count != (SMK_CIPSOMIN + catlen * SMK_DIGITLEN))
>
> and now that has been changed to:
>
> if (format == SMK_FIXED24_FMT &&
> count != (SMK_CIPSOMIN + catlen * SMK_DIGITLEN))
> goto out;
>
> so it doesn't apply for every format.
>
> security/smack/smackfs.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/security/smack/smackfs.c b/security/smack/smackfs.c
> index c21b656b3263..81c6ceeaa4f9 100644
> --- a/security/smack/smackfs.c
> +++ b/security/smack/smackfs.c
> @@ -905,6 +905,10 @@ static ssize_t smk_set_cipso(struct file *file, const char __user *buf,
>
> for (i = 0; i < catlen; i++) {
> rule += SMK_DIGITLEN;
> + if (rule > data + count) {
> + rc = -EOVERFLOW;
> + goto out;
> + }
> ret = sscanf(rule, "%u", &cat);
> if (ret != 1 || cat > SMACK_CIPSO_MAXCATNUM)
> goto out;
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox