linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/38] VFS: Introduce filesystem context [ver #10]
@ 2018-07-27 17:31 David Howells
  2018-07-27 17:31 ` [PATCH 01/38] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
                   ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 17:31 UTC (permalink / raw)
  To: viro
  Cc: John Johansen, Tejun Heo, Eric W. Biederman, selinux, Paul Moore,
	Li Zefan, linux-api, apparmor, Casey Schaufler, fenghua.yu,
	Greg Kroah-Hartman, Eric Biggers, linux-security-module,
	Tetsuo Handa, Johannes Weiner, Stephen Smalley, tomoyo-dev-en,
	cgroups, torvalds, dhowells, linux-fsdevel, linux-kernel


Hi Al,

[!] NOTE: This is a preview of the patches; Apparmor is currently broken and
    needs fixing.

Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, creating
the superblock and then effecting the mount.  This is also used for remount
since much of the parsing stuff is common in many filesystems.

This allows namespaces and other information to be conveyed through the
mount procedure.

This also allows Miklós Szeredi's idea of doing:

	fd = fsopen("nfs");
	fsconfig(fd, fsconfig_set_string, "option", "val", 0);
	fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
	mfd = fsmount(fd, MS_NODEV);
	move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

that he presented at LSF-2017 to be implemented (see the relevant patches
in the series).

I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.

I've implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.

Unconverted filesystems are handled by a legacy filesystem wrapper.


====================
WHY DO WE WANT THIS?
====================

Firstly, there's a bunch of problems with the mount(2) syscall:

 (1) It's actually six or seven different interfaces rolled into one and weird
     combinations of flags make it do different things beyond the original
     specification of the syscall.

 (2) It produces a particularly large and diverse set of errors, which have to
     be mapped back to a small error code.  Yes, there's dmesg - if you have
     it configured - but you can't necessarily see that if you're doing a
     mount inside of a container.

 (3) It copies a PAGE_SIZE block of data for each of the type, device name and
     options.

 (4) The size of the buffers is PAGE_SIZE - and this is arch dependent.

 (5) You can't mount into another mount namespace.  I could, for example,
     build a container without having to be in that container's namespace if I
     can do it from outside.

 (6) It's not really geared for the specification of multiple sources, but
     some filesystems really want that - overlayfs, for example.

and some problems in the internal kernel api:

 (1) There's no defined way to supply namespace configuration for the
     superblock - so, for instance, I can't say that I want to create a
     superblock in a particular network namespace (on automount, say).

     NFS hacks around this by creating multiple shadow file_system_types with
     different ->mount() ops.

 (2) When calling mount internally, unless you have NFS-like hacks, you have
     to generate or otherwise provide text config data which then gets parsed,
     when some of the time you could bypass the parsing stage entirely.

 (3) The amount of data in the data buffer is not known, but the data buffer
     might be on a kernel stack somewhere, leading to the possibility of
     tripping the stack underrun guard.

and other issues too:

 (1) Superblock remount in some filesystems applies options on an as-parsed
     basis, so if there's a parse failure, a partial alteration with no
     rollback is effected.

 (2) Under some circumstances, the mount data may get copied multiple times so
     that it can have multiple parsers applied to it or because it has to be
     parsed multiple times - for instance, once to get the preliminary info
     required to access the on-disk superblock and then again to update the
     superblock record in the kernel.

I want to be able to add support for a bunch of things:

 (1) UID, GID and Project ID mapping/translation.  I want to be able to
     install a translation table of some sort on the superblock to translate
     source identifiers (which may be foreign numeric UIDs/GIDs, text names,
     GUIDs) into system identifiers.  This needs to be done before the
     superblock is published[*].

     Note that this may, for example, involve using the context and the
     superblock held therein to issue an RPC to a server to look up
     translations.

     [*] By "published" I mean made available through mount so that other
     	 userspace processes can access it by path.

     Maybe specifying a translation range element with something like:

	fsconfig(fd, fsconfig_translate_uid, "<srcuid> <nsuid> <count>", 0, 0);

     The translation information also needs to propagate over an automount in
     some circumstances.

 (2) Namespace configuration.  I want to be able to tell the superblock
     creation process what namespaces should be applied when it created (in
     particular the userns and netns) for containerisation purposes, e.g.:

	fsconfig(fd, fsconfig_set_namespace, "user", 0, userns_fd);
	fsconfig(fd, fsconfig_set_namespace, "net", 0, netns_fd);

 (3) Namespace propagation.  I want to have a properly defined mechanism for
     propagating namespace configuration over automounts within the kernel.
     This will be particularly useful for network filesystems.

 (4) Pre-mount attribute query.  A chunk of the changes is actually the
     fsinfo() syscall to query attributes of the filesystem beyond what's
     available in statx() and statfs().  This will allow a created superblock
     to be queried before it is published.

 (5) Upcall for configuration.  I would like to be able to query configuration
     that's stored in userspace when an automount is made.  For instance, to
     look up network parameters for NFS or to find a cache selector for
     fscache.

     The internal fs_context could be passed to the upcall process or the
     kernel could read a config file directly if named appropriately for the
     superblock, perhaps:

	[/etc/fscontext.d/afs/example.com/cell.cfg]
	realm = EXAMPLE.COM
	translation = uid,3000,4000,100
	fscache = tag=fred

 (6) Event notifications.  I want to be able to install a watch on a
     superblock before it is published to catch things like quota events and
     EIO.

 (7) Large and binary parameters.  There might be at some point a need to pass
     large/binary objects like Microsoft PACs around.  If I understand PACs
     correctly, you can obtain these from the Kerberos server and then pass
     them to the file server when you connect.

     Having it possible to pass large or binary objects as individual fsconfig
     calls make parsing these trivial.  OTOH, some or all of this can
     potentially be handled with the use of the keyrings interface - as the afs
     filesystem does for passing kerberos tokens around; it's just that that
     seems overkill for a parameter you may only need once.


===================
SIGNIFICANT CHANGES
===================

 ver #10:

 (*) Renamed "option" to "parameter" in a number of places.

 (*) Replaced the use of write() to drive the configuration with an fsconfig()
     syscall.  This also allows at-style paths and fds to be presented as typed
     object.

 (*) Routed the key=value parameter concept all the way through from the
     fsconfig() system call to the LSM and filesystem.

 (*) Added a parameter-description concept and helper functions to help
     interpret a parameter and possibly convert the value.

 (*) Made it possible to query the parameter description using the fsinfo()
     syscall.  Added a test-fs-query sample to dump the parameters used by a
     filesystem.

 ver #9:

 (*) Dropped the fd cookie stuff and the FMODE_*/O_* split stuff.

 (*) Al added an open_tree() system call to allow a mount tree to be picked
     referenced or cloned into an O_PATH-style fd.  This can then be used
     with sys_move_mount().  Dropped the O_CLONE_MOUNT and O_NON_RECURSIVE
     open() flags.

 (*) Brought error logging back in, though only in the fs_context and not
     in the task_struct.

 (*) Separated MS_REMOUNT|MS_BIND handling from MS_REMOUNT handling.

 (*) Used anon_inodes for the fd returned by fsopen() and fspick().  This
     requires making it unconditional.

 (*) Fixed lots of bugs.  Especial thanks to Al and Eric Biggers for
     finding them and providing patches.

 (*) Wrote manual pages, which I'll post separately.

 ver #8:

 (*) Changed the way fsmount() mounts into the namespace according to some
     of Al's ideas.

 (*) Put better typing on the fd cookie obtained from __fdget() & co..

 (*) Stored the fd cookie in struct nameidata rather than the dfd number.

 (*) Changed sys_fsmount() to return an O_PATH-style fd rather than
     actually mounting into the mount namespace.

 (*) Separated internal FMODE_* handling from O_* handling to free up
     certain O_* flag numbers.

 (*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use
     with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd.

 (*) Added a new syscall, sys_move_mount(), to move a mount from an
     dfd+path source to a dfd+path destination.

 (*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the
     vfsmount attached to file->f_path needs 'unmounting' if set.

 (*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful.

	[!] This doesn't work quite right.

 (*) Added a new syscall, fsinfo(), to query information about a
     filesystem.  The idea being that this will, in future, work with the
     fd from fsopen() too and permit querying of the parameters and
     metadata before fsmount() is called.

 ver #7:

 (*) Undo an incorrect MS_* -> SB_* conversion.

 (*) Pass the mount data buffer size to all the mount-related functions that
     take the data pointer.  This fixes a problem where someone (say SELinux)
     tries to copy the mount data, assuming it to be a page in size, and
     overruns the buffer - thereby incurring an oops by hitting a guard page.

 (*) Made the AFS filesystem use them as an example.  This is a much easier to
     deal with than with NFS or Ext4 as there are very few mount options.

 ver #6:

 (*) Dropped the supplementary error string facility for the moment.

 (*) Dropped the NFS patches for the moment.

 (*) Dropped the reserved file descriptor argument from fsopen() and
     replaced it with three reserved pointers that must be NULL.

 ver #5:

 (*) Renamed sb_config -> fs_context and adjusted variable names.

 (*) Differentiated the flags in sb->s_flags (now named SB_*) from those
     passed to mount(2) (named MS_*).

 (*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
     caller always provide a struct file_system_type pointer and the
     parameters required.

 (*) Got rid of vfs_submount_fc() in favour of passing
     FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context().  The purpose is now
     used more.

 (*) Call ->validate() on the remount path.

 (*) Got rid of the inode locking in sys_fsmount().

 (*) Call security_sb_mountpoint() in the mount(2) path.

 ver #4:

 (*) Split the sb_config patch up somewhat.

 (*) Made the supplementary error string facility something attached to the
     task_struct rather than the sb_config so that error messages can be
     obtained from NFS doing a mount-root-and-pathwalk inside the
     nfs_get_tree() operation.

     Further, made this managed and read by prctl rather than through the
     mount fd so that it's more generally available.

 ver #3:

 (*) Rebased on 4.12-rc1.

 (*) Split the NFS patch up somewhat.

 ver #2:

 (*) Removed the ->fill_super() from sb_config_operations and passed it in
     directly to functions that want to call it.  NFS now calls
     nfs_fill_super() directly rather than jumping through a pointer to it
     since there's only the one option at the moment.

 (*) Removed ->mnt_ns and ->sb from sb_config and moved ->pid_ns into
     proc_sb_config.

 (*) Renamed create_super -> get_tree.

 (*) Renamed struct mount_context to struct sb_config and amended various
     variable names.

 (*) sys_fsmount() acquired AT_* flags and MS_* flags (for MNT_* flags)
     arguments.

 ver #1:

 (*) Split the sb_config stuff out into its own header.

 (*) Support non-context aware filesystems through a special set of
     sb_config operations.

 (*) Stored the created superblock and root dentry into the sb_config after
     creation rather than directly into a vfsmount.  This allows some
     arguments to be removed to various NFS functions.

 (*) Added an explicit superblock-creation step.  This allows a created
     superblock to then be mounted multiple times.

 (*) Added a flag to say that the sb_config is degraded and cannot have
     another go at having a superblock creation whilst getting rid of the
     one that says it's already mounted.

Possible further developments:

 (*) Implement sb reconfiguration (for now it returns ENOANO).

 (*) Implement mount context support in more filesystems, ext4 being next
     on my list.

 (*) Move the walk-from-root stuff that nfs has to generic code so that you
     can do something akin to:

	mount /dev/sda1:/foo/bar /mnt

     See nfs_follow_remote_path() and mount_subtree().  This is slightly
     tricky in NFS as we have to prevent referral loops.

 (*) Work out how to get at the error message incurred by submounts
     encountered during nfs_follow_remote_path().

     Should the error message be moved to task_struct and made more
     general, perhaps retrieved with a prctl() function?

 (*) Clean up/consolidate the security functions.  Possibly add a
     validation hook to be called at the same time as the mount context
     validate op.

The patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

on branch:

	mount-api

David
---
Al Viro (2):
      vfs: syscall: Add open_tree(2) to reference or clone a mount
      teach move_mount(2) to work with OPEN_TREE_CLONE

David Howells (36):
      vfs: syscall: Add move_mount(2) to move mounts around
      vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled
      vfs: Introduce the basic header for the new mount API's filesystem context
      vfs: Introduce logging functions
      vfs: Add configuration parser helpers
      vfs: Add LSM hooks for the new mount API
      selinux: Implement the new mount API LSM hooks
      smack: Implement filesystem context security hooks
      apparmor: Implement security hooks for the new mount API
      vfs: Pass key and value into LSM and FS and provide a helper parser
      tomoyo: Implement security hooks for the new mount API
      vfs: Separate changing mount flags full remount
      vfs: Implement a filesystem superblock creation/configuration context
      vfs: Remove unused code after filesystem context changes
      procfs: Move proc_fill_super() to fs/proc/root.c
      proc: Add fs_context support to procfs
      ipc: Convert mqueue fs to fs_context
      cpuset: Use fs_context
      kernfs, sysfs, cgroup, intel_rdt: Support fs_context
      hugetlbfs: Convert to fs_context
      vfs: Remove kern_mount_data()
      vfs: Provide documentation for new mount API
      Make anon_inodes unconditional
      vfs: syscall: Add fsopen() to prepare for superblock creation
      vfs: Implement logging through fs_context
      vfs: Add some logging to the core users of the fs_context log
      vfs: syscall: Add fsconfig() for configuring and managing a context
      vfs: syscall: Add fsmount() to create a mount for a superblock
      vfs: syscall: Add fspick() to select a superblock for reconfiguration
      afs: Add fs_context support
      afs: Use fs_context to pass parameters over automount
      vfs: syscall: Add fsinfo() to query filesystem information
      afs: Add fsinfo support
      vfs: Add a sample program for the new mount API
      vfs: Allow fsinfo() to query what's in an fs_context
      vfs: Allow fsinfo() to be used to query an fs parameter description


 Documentation/filesystems/mount_api.txt  |  706 ++++++++++++++++++++++++
 arch/arc/kernel/setup.c                  |    1 
 arch/arm/kernel/atags_parse.c            |    1 
 arch/sh/kernel/setup.c                   |    1 
 arch/sparc/kernel/setup_32.c             |    1 
 arch/sparc/kernel/setup_64.c             |    1 
 arch/x86/entry/syscalls/syscall_32.tbl   |    7 
 arch/x86/entry/syscalls/syscall_64.tbl   |    7 
 arch/x86/kernel/cpu/intel_rdt.h          |   15 +
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |  184 ++++--
 arch/x86/kernel/setup.c                  |    1 
 drivers/base/devtmpfs.c                  |    1 
 fs/Kconfig                               |    7 
 fs/Makefile                              |    5 
 fs/afs/internal.h                        |    9 
 fs/afs/mntpt.c                           |  148 +++--
 fs/afs/super.c                           |  605 ++++++++++++++-------
 fs/afs/volume.c                          |    4 
 fs/f2fs/super.c                          |    2 
 fs/file_table.c                          |    9 
 fs/filesystems.c                         |    4 
 fs/fs_context.c                          |  778 +++++++++++++++++++++++++++
 fs/fs_parser.c                           |  476 ++++++++++++++++
 fs/fsopen.c                              |  490 +++++++++++++++++
 fs/hugetlbfs/inode.c                     |  392 ++++++++------
 fs/internal.h                            |   13 
 fs/kernfs/mount.c                        |   89 +--
 fs/libfs.c                               |   19 +
 fs/namei.c                               |    4 
 fs/namespace.c                           |  866 +++++++++++++++++++++++-------
 fs/pnode.c                               |    1 
 fs/proc/inode.c                          |   51 --
 fs/proc/internal.h                       |    6 
 fs/proc/root.c                           |  245 ++++++--
 fs/statfs.c                              |  574 ++++++++++++++++++++
 fs/super.c                               |  368 ++++++++++---
 fs/sysfs/mount.c                         |   67 ++
 include/linux/cgroup.h                   |    3 
 include/linux/fs.h                       |   25 +
 include/linux/fs_context.h               |  207 +++++++
 include/linux/fs_parser.h                |  116 ++++
 include/linux/fsinfo.h                   |   40 +
 include/linux/kernfs.h                   |   39 +
 include/linux/lsm_hooks.h                |   70 ++
 include/linux/module.h                   |    6 
 include/linux/mount.h                    |    5 
 include/linux/security.h                 |   61 ++
 include/linux/syscalls.h                 |   13 
 include/uapi/linux/fcntl.h               |    2 
 include/uapi/linux/fs.h                  |   82 +--
 include/uapi/linux/fsinfo.h              |  301 ++++++++++
 include/uapi/linux/mount.h               |   75 +++
 init/Kconfig                             |   10 
 init/do_mounts.c                         |    1 
 init/do_mounts_initrd.c                  |    1 
 ipc/mqueue.c                             |  121 +++-
 kernel/cgroup/cgroup-internal.h          |   50 +-
 kernel/cgroup/cgroup-v1.c                |  347 +++++++-----
 kernel/cgroup/cgroup.c                   |  256 ++++++---
 kernel/cgroup/cpuset.c                   |   68 ++
 samples/Kconfig                          |    7 
 samples/Makefile                         |    2 
 samples/mount_api/Makefile               |    7 
 samples/mount_api/test-fsmount.c         |  118 ++++
 samples/statx/Makefile                   |    7 
 samples/statx/test-fs-query.c            |  137 +++++
 samples/statx/test-fsinfo.c              |  539 +++++++++++++++++++
 security/apparmor/include/mount.h        |   11 
 security/apparmor/lsm.c                  |  106 ++++
 security/apparmor/mount.c                |   47 ++
 security/security.c                      |   51 ++
 security/selinux/hooks.c                 |  311 ++++++++++-
 security/smack/smack.h                   |   11 
 security/smack/smack_lsm.c               |  370 ++++++++++++-
 security/tomoyo/common.h                 |    3 
 security/tomoyo/mount.c                  |   46 ++
 security/tomoyo/tomoyo.c                 |   15 +
 77 files changed, 8375 insertions(+), 1470 deletions(-)
 create mode 100644 Documentation/filesystems/mount_api.txt
 create mode 100644 fs/fs_context.c
 create mode 100644 fs/fs_parser.c
 create mode 100644 fs/fsopen.c
 create mode 100644 include/linux/fs_context.h
 create mode 100644 include/linux/fs_parser.h
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 include/uapi/linux/mount.h
 create mode 100644 samples/mount_api/Makefile
 create mode 100644 samples/mount_api/test-fsmount.c
 create mode 100644 samples/statx/test-fs-query.c
 create mode 100644 samples/statx/test-fsinfo.c

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 01/38] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #10]
  2018-07-27 17:31 [PATCH 00/38] VFS: Introduce filesystem context [ver #10] David Howells
@ 2018-07-27 17:31 ` David Howells
  2018-07-27 17:31 ` [PATCH 02/38] vfs: syscall: Add move_mount(2) to move mounts around " David Howells
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 17:31 UTC (permalink / raw)
  To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel

From: Al Viro <viro@zeniv.linux.org.uk>

open_tree(dfd, pathname, flags)

Returns an O_PATH-opened file descriptor or an error.
dfd and pathname specify the location to open, in usual
fashion (see e.g. fstatat(2)).  flags should be an OR of
some of the following:
	* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
same meanings as usual
	* OPEN_TREE_CLOEXEC - make the resulting descriptor
close-on-exec
	* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
instead of opening the location in question, create a detached
mount tree matching the subtree rooted at location specified by
dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
without it - only the part within in the mount containing the
location in question.  In other words, the same as mount --rbind
or mount --bind would've taken.  The detached tree will be
dissolved on the final close of obtained file.  Creation of such
detached trees requires the same capabilities as doing mount --bind.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/file_table.c                        |    9 +-
 fs/internal.h                          |    1 
 fs/namespace.c                         |  132 +++++++++++++++++++++++++++-----
 include/linux/fs.h                     |    3 +
 include/linux/syscalls.h               |    1 
 include/uapi/linux/fcntl.h             |    2 
 include/uapi/linux/mount.h             |   10 ++
 9 files changed, 135 insertions(+), 25 deletions(-)
 create mode 100644 include/uapi/linux/mount.h

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..ea1b413afd47 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,4 @@
 384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
 385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
+387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..0545bed581dc 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+335	common	open_tree		__x64_sys_open_tree
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/file_table.c b/fs/file_table.c
index 7ec0b3e5f05d..7480271a0d21 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -189,6 +189,7 @@ static void __fput(struct file *file)
 	struct dentry *dentry = file->f_path.dentry;
 	struct vfsmount *mnt = file->f_path.mnt;
 	struct inode *inode = file->f_inode;
+	fmode_t mode = file->f_mode;
 
 	might_sleep();
 
@@ -209,14 +210,14 @@ static void __fput(struct file *file)
 		file->f_op->release(inode, file);
 	security_file_free(file);
 	if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
-		     !(file->f_mode & FMODE_PATH))) {
+		     !(mode & FMODE_PATH))) {
 		cdev_put(inode->i_cdev);
 	}
 	fops_put(file->f_op);
 	put_pid(file->f_owner.pid);
-	if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
+	if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
 		i_readcount_dec(inode);
-	if (file->f_mode & FMODE_WRITER) {
+	if (mode & FMODE_WRITER) {
 		put_write_access(inode);
 		__mnt_drop_write(mnt);
 	}
@@ -224,6 +225,8 @@ static void __fput(struct file *file)
 	file->f_path.mnt = NULL;
 	file->f_inode = NULL;
 	file_free(file);
+	if (unlikely(mode & FMODE_NEED_UNMOUNT))
+		dissolve_on_fput(mnt);
 	dput(dentry);
 	mntput(mnt);
 }
diff --git a/fs/internal.h b/fs/internal.h
index 56533b08532e..383ee4724f77 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -85,6 +85,7 @@ extern void __mnt_drop_write(struct vfsmount *);
 extern void __mnt_drop_write_file(struct file *);
 extern void mnt_drop_write_file_path(struct file *);
 
+extern void dissolve_on_fput(struct vfsmount *);
 /*
  * fs_struct.c
  */
diff --git a/fs/namespace.c b/fs/namespace.c
index 03cc3b5bcf00..a4a01ecbcacd 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -20,12 +20,14 @@
 #include <linux/init.h>		/* init_rootfs */
 #include <linux/fs_struct.h>	/* get_fs_root et.al. */
 #include <linux/fsnotify.h>	/* fsnotify_vfsmount_delete */
+#include <linux/file.h>
 #include <linux/uaccess.h>
 #include <linux/proc_ns.h>
 #include <linux/magic.h>
 #include <linux/bootmem.h>
 #include <linux/task_work.h>
 #include <linux/sched/task.h>
+#include <uapi/linux/mount.h>
 
 #include "pnode.h"
 #include "internal.h"
@@ -1840,6 +1842,16 @@ struct vfsmount *collect_mounts(const struct path *path)
 	return &tree->mnt;
 }
 
+void dissolve_on_fput(struct vfsmount *mnt)
+{
+	namespace_lock();
+	lock_mount_hash();
+	mntget(mnt);
+	umount_tree(real_mount(mnt), UMOUNT_SYNC);
+	unlock_mount_hash();
+	namespace_unlock();
+}
+
 void drop_collected_mounts(struct vfsmount *mnt)
 {
 	namespace_lock();
@@ -2199,6 +2211,30 @@ static bool has_locked_children(struct mount *mnt, struct dentry *dentry)
 	return false;
 }
 
+static struct mount *__do_loopback(struct path *old_path, int recurse)
+{
+	struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt);
+
+	if (IS_MNT_UNBINDABLE(old))
+		return mnt;
+
+	if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations)
+		return mnt;
+
+	if (!recurse && has_locked_children(old, old_path->dentry))
+		return mnt;
+
+	if (recurse)
+		mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
+	else
+		mnt = clone_mnt(old, old_path->dentry, 0);
+
+	if (!IS_ERR(mnt))
+		mnt->mnt.mnt_flags &= ~MNT_LOCKED;
+
+	return mnt;
+}
+
 /*
  * do loopback mount.
  */
@@ -2206,7 +2242,7 @@ static int do_loopback(struct path *path, const char *old_name,
 				int recurse)
 {
 	struct path old_path;
-	struct mount *mnt = NULL, *old, *parent;
+	struct mount *mnt = NULL, *parent;
 	struct mountpoint *mp;
 	int err;
 	if (!old_name || !*old_name)
@@ -2220,38 +2256,21 @@ static int do_loopback(struct path *path, const char *old_name,
 		goto out;
 
 	mp = lock_mount(path);
-	err = PTR_ERR(mp);
-	if (IS_ERR(mp))
+	if (IS_ERR(mp)) {
+		err = PTR_ERR(mp);
 		goto out;
+	}
 
-	old = real_mount(old_path.mnt);
 	parent = real_mount(path->mnt);
-
-	err = -EINVAL;
-	if (IS_MNT_UNBINDABLE(old))
-		goto out2;
-
 	if (!check_mnt(parent))
 		goto out2;
 
-	if (!check_mnt(old) && old_path.dentry->d_op != &ns_dentry_operations)
-		goto out2;
-
-	if (!recurse && has_locked_children(old, old_path.dentry))
-		goto out2;
-
-	if (recurse)
-		mnt = copy_tree(old, old_path.dentry, CL_COPY_MNT_NS_FILE);
-	else
-		mnt = clone_mnt(old, old_path.dentry, 0);
-
+	mnt = __do_loopback(&old_path, recurse);
 	if (IS_ERR(mnt)) {
 		err = PTR_ERR(mnt);
 		goto out2;
 	}
 
-	mnt->mnt.mnt_flags &= ~MNT_LOCKED;
-
 	err = graft_tree(mnt, parent, mp);
 	if (err) {
 		lock_mount_hash();
@@ -2265,6 +2284,75 @@ static int do_loopback(struct path *path, const char *old_name,
 	return err;
 }
 
+SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
+{
+	struct file *file;
+	struct path path;
+	int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
+	bool detached = flags & OPEN_TREE_CLONE;
+	int error;
+	int fd;
+
+	BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
+
+	if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE |
+		      AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE |
+		      OPEN_TREE_CLOEXEC))
+		return -EINVAL;
+
+	if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE)) == AT_RECURSIVE)
+		return -EINVAL;
+
+	if (flags & AT_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (flags & AT_SYMLINK_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (flags & AT_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+	if (detached && !may_mount())
+		return -EPERM;
+
+	fd = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	error = user_path_at(dfd, filename, lookup_flags, &path);
+	if (error)
+		goto out;
+
+	if (detached) {
+		struct mount *mnt = __do_loopback(&path, flags & AT_RECURSIVE);
+		if (IS_ERR(mnt)) {
+			error = PTR_ERR(mnt);
+			goto out2;
+		}
+		mntput(path.mnt);
+		path.mnt = &mnt->mnt;
+	}
+
+	file = dentry_open(&path, O_PATH, current_cred());
+	if (IS_ERR(file)) {
+		error = PTR_ERR(file);
+		goto out3;
+	}
+
+	if (detached)
+		file->f_mode |= FMODE_NEED_UNMOUNT;
+	path_put(&path);
+	fd_install(fd, file);
+	return fd;
+
+out3:
+	if (detached)
+		dissolve_on_fput(path.mnt);
+out2:
+	path_put(&path);
+out:
+	put_unused_fd(fd);
+	return error;
+}
+
 static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
 {
 	int error = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3a18cddb74e..067f0e31aec7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -154,6 +154,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File is capable of returning -EAGAIN if I/O will block */
 #define FMODE_NOWAIT	((__force fmode_t)0x8000000)
 
+/* File represents mount that needs unmounting */
+#define FMODE_NEED_UNMOUNT     ((__force fmode_t)0x10000000)
+
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
  * that indicates that they should check the contents of the iovec are
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 73810808cdf2..3cc6b8f8bd2f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -900,6 +900,7 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
 			  unsigned mask, struct statx __user *buffer);
 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
 			 int flags, uint32_t sig);
+asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..594b85f7cb86 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -90,5 +90,7 @@
 #define AT_STATX_FORCE_SYNC	0x2000	/* - Force the attributes to be sync'd with the server */
 #define AT_STATX_DONT_SYNC	0x4000	/* - Don't sync attributes with the server */
 
+#define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
+
 
 #endif /* _UAPI_LINUX_FCNTL_H */
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
new file mode 100644
index 000000000000..e8db2911adca
--- /dev/null
+++ b/include/uapi/linux/mount.h
@@ -0,0 +1,10 @@
+#ifndef _UAPI_LINUX_MOUNT_H
+#define _UAPI_LINUX_MOUNT_H
+
+/*
+ * open_tree() flags.
+ */
+#define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
+#define OPEN_TREE_CLOEXEC	O_CLOEXEC	/* Close the file on execve() */
+
+#endif /* _UAPI_LINUX_MOUNT_H */

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 02/38] vfs: syscall: Add move_mount(2) to move mounts around [ver #10]
  2018-07-27 17:31 [PATCH 00/38] VFS: Introduce filesystem context [ver #10] David Howells
  2018-07-27 17:31 ` [PATCH 01/38] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
@ 2018-07-27 17:31 ` David Howells
  2018-07-27 17:34 ` [PATCH 26/38] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 17:31 UTC (permalink / raw)
  To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel

Add a move_mount() system call that will move a mount from one place to
another and, in the next commit, allow to attach an unattached mount tree.

The new system call looks like the following:

	int move_mount(int from_dfd, const char *from_path,
		       int to_dfd, const char *to_path,
		       unsigned int flags);

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/namespace.c                         |  102 ++++++++++++++++++++++++++------
 include/linux/lsm_hooks.h              |    6 ++
 include/linux/security.h               |    7 ++
 include/linux/syscalls.h               |    3 +
 include/uapi/linux/mount.h             |   11 +++
 security/security.c                    |    5 ++
 8 files changed, 118 insertions(+), 18 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ea1b413afd47..76d092b7d1b0 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -399,3 +399,4 @@
 385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
+388	i386	move_mount		sys_move_mount			__ia32_sys_move_mount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 0545bed581dc..37ba4e65eee6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -344,6 +344,7 @@
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
 335	common	open_tree		__x64_sys_open_tree
+336	common	move_mount		__x64_sys_move_mount
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index a4a01ecbcacd..e2934a4f342b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2447,43 +2447,37 @@ static inline int tree_contains_unbindable(struct mount *mnt)
 	return 0;
 }
 
-static int do_move_mount(struct path *path, const char *old_name)
+static int do_move_mount(struct path *old_path, struct path *new_path)
 {
-	struct path old_path, parent_path;
+	struct path parent_path = {.mnt = NULL, .dentry = NULL};
 	struct mount *p;
 	struct mount *old;
 	struct mountpoint *mp;
 	int err;
-	if (!old_name || !*old_name)
-		return -EINVAL;
-	err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
-	if (err)
-		return err;
 
-	mp = lock_mount(path);
+	mp = lock_mount(new_path);
 	err = PTR_ERR(mp);
 	if (IS_ERR(mp))
 		goto out;
 
-	old = real_mount(old_path.mnt);
-	p = real_mount(path->mnt);
+	old = real_mount(old_path->mnt);
+	p = real_mount(new_path->mnt);
 
 	err = -EINVAL;
 	if (!check_mnt(p) || !check_mnt(old))
 		goto out1;
 
-	if (old->mnt.mnt_flags & MNT_LOCKED)
+	if (!mnt_has_parent(old))
 		goto out1;
 
-	err = -EINVAL;
-	if (old_path.dentry != old_path.mnt->mnt_root)
+	if (old->mnt.mnt_flags & MNT_LOCKED)
 		goto out1;
 
-	if (!mnt_has_parent(old))
+	if (old_path->dentry != old_path->mnt->mnt_root)
 		goto out1;
 
-	if (d_is_dir(path->dentry) !=
-	      d_is_dir(old_path.dentry))
+	if (d_is_dir(new_path->dentry) !=
+	    d_is_dir(old_path->dentry))
 		goto out1;
 	/*
 	 * Don't move a mount residing in a shared parent.
@@ -2501,7 +2495,8 @@ static int do_move_mount(struct path *path, const char *old_name)
 		if (p == old)
 			goto out1;
 
-	err = attach_recursive_mnt(old, real_mount(path->mnt), mp, &parent_path);
+	err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
+				   &parent_path);
 	if (err)
 		goto out1;
 
@@ -2513,6 +2508,22 @@ static int do_move_mount(struct path *path, const char *old_name)
 out:
 	if (!err)
 		path_put(&parent_path);
+	return err;
+}
+
+static int do_move_mount_old(struct path *path, const char *old_name)
+{
+	struct path old_path;
+	int err;
+
+	if (!old_name || !*old_name)
+		return -EINVAL;
+
+	err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
+	if (err)
+		return err;
+
+	err = do_move_mount(&old_path, path);
 	path_put(&old_path);
 	return err;
 }
@@ -2934,7 +2945,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
 	else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
 		retval = do_change_type(&path, flags);
 	else if (flags & MS_MOVE)
-		retval = do_move_mount(&path, dev_name);
+		retval = do_move_mount_old(&path, dev_name);
 	else
 		retval = do_new_mount(&path, type_page, sb_flags, mnt_flags,
 				      dev_name, data_page, data_size);
@@ -3169,6 +3180,61 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
 	return ksys_mount(dev_name, dir_name, type, flags, data);
 }
 
+/*
+ * Move a mount from one place to another.
+ *
+ * Note the flags value is a combination of MOVE_MOUNT_* flags.
+ */
+SYSCALL_DEFINE5(move_mount,
+		int, from_dfd, const char *, from_pathname,
+		int, to_dfd, const char *, to_pathname,
+		unsigned int, flags)
+{
+	struct path from_path, to_path;
+	unsigned int lflags;
+	int ret = 0;
+
+	if (!may_mount())
+		return -EPERM;
+
+	if (flags & ~MOVE_MOUNT__MASK)
+		return -EINVAL;
+
+	/* If someone gives a pathname, they aren't permitted to move
+	 * from an fd that requires unmount as we can't get at the flag
+	 * to clear it afterwards.
+	 */
+	lflags = 0;
+	if (flags & MOVE_MOUNT_F_SYMLINKS)	lflags |= LOOKUP_FOLLOW;
+	if (flags & MOVE_MOUNT_F_AUTOMOUNTS)	lflags |= LOOKUP_AUTOMOUNT;
+	if (flags & MOVE_MOUNT_F_EMPTY_PATH)	lflags |= LOOKUP_EMPTY;
+
+	ret = user_path_at(from_dfd, from_pathname, lflags, &from_path);
+	if (ret < 0)
+		return ret;
+
+	lflags = 0;
+	if (flags & MOVE_MOUNT_T_SYMLINKS)	lflags |= LOOKUP_FOLLOW;
+	if (flags & MOVE_MOUNT_T_AUTOMOUNTS)	lflags |= LOOKUP_AUTOMOUNT;
+	if (flags & MOVE_MOUNT_T_EMPTY_PATH)	lflags |= LOOKUP_EMPTY;
+
+	ret = user_path_at(to_dfd, to_pathname, lflags, &to_path);
+	if (ret < 0)
+		goto out_from;
+
+	ret = security_move_mount(&from_path, &to_path);
+	if (ret < 0)
+		goto out_to;
+
+	ret = do_move_mount(&from_path, &to_path);
+
+out_to:
+	path_put(&to_path);
+out_from:
+	path_put(&from_path);
+	return ret;
+}
+
 /*
  * Return true if path is reachable from root
  *
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index b43bbc893074..924424e7be8f 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -147,6 +147,10 @@
  *	Parse a string of security data filling in the opts structure
  *	@options string containing all mount options known by the LSM
  *	@opts binary data structure usable by the LSM
+ * @move_mount:
+ *	Check permission before a mount is moved.
+ *	@from_path indicates the mount that is going to be moved.
+ *	@to_path indicates the mountpoint that will be mounted upon.
  * @dentry_init_security:
  *	Compute a context for a dentry as the inode is not yet available
  *	since NFSv4 has no label backed by an EA anyway.
@@ -1480,6 +1484,7 @@ union security_list_options {
 					unsigned long kern_flags,
 					unsigned long *set_kern_flags);
 	int (*sb_parse_opts_str)(char *options, struct security_mnt_opts *opts);
+	int (*move_mount)(const struct path *from_path, const struct path *to_path);
 	int (*dentry_init_security)(struct dentry *dentry, int mode,
 					const struct qstr *name, void **ctx,
 					u32 *ctxlen);
@@ -1811,6 +1816,7 @@ struct security_hook_heads {
 	struct hlist_head sb_set_mnt_opts;
 	struct hlist_head sb_clone_mnt_opts;
 	struct hlist_head sb_parse_opts_str;
+	struct hlist_head move_mount;
 	struct hlist_head dentry_init_security;
 	struct hlist_head dentry_create_files_as;
 #ifdef CONFIG_SECURITY_PATH
diff --git a/include/linux/security.h b/include/linux/security.h
index 1498b9e0539b..9bb5bc6d596c 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -245,6 +245,7 @@ int security_sb_clone_mnt_opts(const struct super_block *oldsb,
 				unsigned long kern_flags,
 				unsigned long *set_kern_flags);
 int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts);
+int security_move_mount(const struct path *from_path, const struct path *to_path);
 int security_dentry_init_security(struct dentry *dentry, int mode,
 					const struct qstr *name, void **ctx,
 					u32 *ctxlen);
@@ -599,6 +600,12 @@ static inline int security_sb_parse_opts_str(char *options, struct security_mnt_
 	return 0;
 }
 
+static inline int security_move_mount(const struct path *from_path,
+				      const struct path *to_path)
+{
+	return 0;
+}
+
 static inline int security_inode_alloc(struct inode *inode)
 {
 	return 0;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3cc6b8f8bd2f..3c0855d9b105 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -901,6 +901,9 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
 			 int flags, uint32_t sig);
 asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
+asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
+			       int to_dfd, const char __user *to_path,
+			       unsigned int ms_flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index e8db2911adca..89adf0d731ab 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -7,4 +7,15 @@
 #define OPEN_TREE_CLONE		1		/* Clone the target tree and attach the clone */
 #define OPEN_TREE_CLOEXEC	O_CLOEXEC	/* Close the file on execve() */
 
+/*
+ * move_mount() flags.
+ */
+#define MOVE_MOUNT_F_SYMLINKS		0x00000001 /* Follow symlinks on from path */
+#define MOVE_MOUNT_F_AUTOMOUNTS		0x00000002 /* Follow automounts on from path */
+#define MOVE_MOUNT_F_EMPTY_PATH		0x00000004 /* Empty from path permitted */
+#define MOVE_MOUNT_T_SYMLINKS		0x00000010 /* Follow symlinks on to path */
+#define MOVE_MOUNT_T_AUTOMOUNTS		0x00000020 /* Follow automounts on to path */
+#define MOVE_MOUNT_T_EMPTY_PATH		0x00000040 /* Empty to path permitted */
+#define MOVE_MOUNT__MASK		0x00000077
+
 #endif /* _UAPI_LINUX_MOUNT_H */
diff --git a/security/security.c b/security/security.c
index 7cafc1c90d16..5149c2cbe8a7 100644
--- a/security/security.c
+++ b/security/security.c
@@ -439,6 +439,11 @@ int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts)
 }
 EXPORT_SYMBOL(security_sb_parse_opts_str);
 
+int security_move_mount(const struct path *from_path, const struct path *to_path)
+{
+	return call_int_hook(move_mount, 0, from_path, to_path);
+}
+
 int security_inode_alloc(struct inode *inode)
 {
 	inode->i_security = NULL;

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 26/38] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #10]
  2018-07-27 17:31 [PATCH 00/38] VFS: Introduce filesystem context [ver #10] David Howells
  2018-07-27 17:31 ` [PATCH 01/38] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
  2018-07-27 17:31 ` [PATCH 02/38] vfs: syscall: Add move_mount(2) to move mounts around " David Howells
@ 2018-07-27 17:34 ` David Howells
  2018-07-27 17:34 ` [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 17:34 UTC (permalink / raw)
  To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel

Provide an fsopen() system call that starts the process of preparing to
create a superblock that will then be mountable, using an fd as a context
handle.  fsopen() is given the name of the filesystem that will be used:

	int mfd = fsopen(const char *fsname, unsigned int flags);

where flags can be 0 or FSOPEN_CLOEXEC.

For example:

	sfd = fsopen("ext4", FSOPEN_CLOEXEC);
        fsconfig(sfd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
        fsconfig(sfd, fsconfig_set_flag, "noatime", NULL, 0);
        fsconfig(sfd, fsconfig_set_flag, "acl", NULL, 0);
        fsconfig(sfd, fsconfig_set_flag, "user_xattr", NULL, 0);
        fsconfig(sfd, fsconfig_set_string, "sb", "1", 0);
        fsconfig(sfd, fsconfig_cmd_create, NULL, NULL, 0);
	fsinfo(sfd, NULL, ...); // query new superblock attributes
	mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

	sfd = fsopen("afs", -1);
        fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0);
        fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
	mfd = fsmount(sfd, 0, MS_NODEV);
	move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

If an error is reported at any step, an error message may be available to be
read() back (ENODATA will be reported if there isn't an error available) in
the form:

	"e <subsys>:<problem>"
	"e SELinux:Mount on mountpoint not permitted"

Once fsmount() has been called, further fsconfig() calls will incur EBUSY,
even if the fsmount() fails.  read() is still possible to retrieve error
information.

The fsopen() syscall creates a mount context and hangs it of the fd that it
returns.

Netlink is not used because it is optional and would make the core VFS
dependent on the networking layer and also potentially add network
namespace issues.

Note that, for the moment, the caller must have SYS_CAP_ADMIN to use
fsopen().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/Makefile                            |    2 -
 fs/fs_context.c                        |    4 +
 fs/fsopen.c                            |   87 ++++++++++++++++++++++++++++++++
 include/linux/fs_context.h             |    4 +
 include/linux/syscalls.h               |    1 
 include/uapi/linux/fs.h                |    5 ++
 8 files changed, 104 insertions(+), 1 deletion(-)
 create mode 100644 fs/fsopen.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 76d092b7d1b0..1647fefd2969 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
 388	i386	move_mount		sys_move_mount			__ia32_sys_move_mount
+389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 37ba4e65eee6..235d33dbccb2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 334	common	rseq			__x64_sys_rseq
 335	common	open_tree		__x64_sys_open_tree
 336	common	move_mount		__x64_sys_move_mount
+337	common	fsopen			__x64_sys_fsopen
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index ae681523b4b1..e3ea8093b178 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -13,7 +13,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		seq_file.o xattr.o libfs.o fs-writeback.o \
 		pnode.o splice.o sync.o utimes.o d_path.o \
 		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
-		fs_context.o fs_parser.o
+		fs_context.o fs_parser.o fsopen.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fs_context.c b/fs/fs_context.c
index c298cbfb62a2..7259caf42c24 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -263,6 +263,8 @@ struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
 	fc->fs_type	= get_filesystem(fs_type);
 	fc->cred	= get_current_cred();
 
+	mutex_init(&fc->uapi_mutex);
+
 	switch (purpose) {
 	case FS_CONTEXT_FOR_KERNEL_MOUNT:
 		fc->sb_flags |= SB_KERNMOUNT;
@@ -347,6 +349,8 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
 	if (!fc)
 		return ERR_PTR(-ENOMEM);
 
+	mutex_init(&fc->uapi_mutex);
+
 	fc->fs_private	= NULL;
 	fc->s_fs_info	= NULL;
 	fc->source	= NULL;
diff --git a/fs/fsopen.c b/fs/fsopen.c
new file mode 100644
index 000000000000..f30080e1ebc4
--- /dev/null
+++ b/fs/fsopen.c
@@ -0,0 +1,87 @@
+/* Filesystem access-by-fd.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fs_context.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/security.h>
+#include <linux/anon_inodes.h>
+#include <linux/namei.h>
+#include <linux/file.h>
+#include "mount.h"
+
+static int fscontext_release(struct inode *inode, struct file *file)
+{
+	struct fs_context *fc = file->private_data;
+
+	if (fc) {
+		file->private_data = NULL;
+		put_fs_context(fc);
+	}
+	return 0;
+}
+
+const struct file_operations fscontext_fops = {
+	.release	= fscontext_release,
+	.llseek		= no_llseek,
+};
+
+/*
+ * Attach a filesystem context to a file and an fd.
+ */
+static int fscontext_create_fd(struct fs_context *fc, unsigned int o_flags)
+{
+	int fd;
+
+	fd = anon_inode_getfd("fscontext", &fscontext_fops, fc,
+			      O_RDWR | o_flags);
+	if (fd < 0)
+		put_fs_context(fc);
+	return fd;
+}
+
+/*
+ * Open a filesystem by name so that it can be configured for mounting.
+ *
+ * We are allowed to specify a container in which the filesystem will be
+ * opened, thereby indicating which namespaces will be used (notably, which
+ * network namespace will be used for network filesystems).
+ */
+SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
+{
+	struct file_system_type *fs_type;
+	struct fs_context *fc;
+	const char *fs_name;
+
+	if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (flags & ~FSOPEN_CLOEXEC)
+		return -EINVAL;
+
+	fs_name = strndup_user(_fs_name, PAGE_SIZE);
+	if (IS_ERR(fs_name))
+		return PTR_ERR(fs_name);
+
+	fs_type = get_fs_type(fs_name);
+	kfree(fs_name);
+	if (!fs_type)
+		return -ENODEV;
+
+	fc = vfs_new_fs_context(fs_type, NULL, 0, FS_CONTEXT_FOR_USER_MOUNT);
+	put_filesystem(fs_type);
+	if (IS_ERR(fc))
+		return PTR_ERR(fc);
+
+	fc->phase = FS_CONTEXT_CREATE_PARAMS;
+	return fscontext_create_fd(fc, flags & FSOPEN_CLOEXEC ? O_CLOEXEC : 0);
+}
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 8d812fcc5f54..488d30de1f4f 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -14,6 +14,7 @@
 
 #include <linux/kernel.h>
 #include <linux/errno.h>
+#include <linux/mutex.h>
 
 struct cred;
 struct dentry;
@@ -87,6 +88,7 @@ struct fs_parameter {
  */
 struct fs_context {
 	const struct fs_context_operations *ops;
+	struct mutex		uapi_mutex;	/* Userspace access mutex */
 	struct file_system_type	*fs_type;
 	void			*fs_private;	/* The filesystem's context */
 	struct dentry		*root;		/* The root and superblock */
@@ -142,6 +144,8 @@ extern int vfs_get_super(struct fs_context *fc,
 			 int (*fill_super)(struct super_block *sb,
 					   struct fs_context *fc));
 
+extern const struct file_operations fscontext_fops;
+
 #define logfc(FC, FMT, ...) pr_notice(FMT, ## __VA_ARGS__)
 
 /**
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3c0855d9b105..ad6c7ff33c01 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -904,6 +904,7 @@ asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
 asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 			       int to_dfd, const char __user *to_path,
 			       unsigned int ms_flags);
+asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 1c982eb44ff4..f8818e6cddd6 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -344,4 +344,9 @@ typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/*
+ * Flags for fsopen() and co.
+ */
+#define FSOPEN_CLOEXEC		0x00000001
+
 #endif /* _UAPI_LINUX_FS_H */

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #10]
  2018-07-27 17:31 [PATCH 00/38] VFS: Introduce filesystem context [ver #10] David Howells
                   ` (2 preceding siblings ...)
  2018-07-27 17:34 ` [PATCH 26/38] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
@ 2018-07-27 17:34 ` David Howells
  2018-07-27 19:42   ` Andy Lutomirski
                     ` (3 more replies)
  2018-07-27 17:34 ` [PATCH 30/38] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
                   ` (2 subsequent siblings)
  6 siblings, 4 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 17:34 UTC (permalink / raw)
  To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel

Add a syscall for configuring a filesystem creation context and triggering
actions upon it, to be used in conjunction with fsopen, fspick and fsmount.

    long fsconfig(int fs_fd, unsigned int cmd, const char *key,
		  const void *value, int aux);

Where fs_fd indicates the context, cmd indicates the action to take, key
indicates the parameter name for parameter-setting actions and, if needed,
value points to a buffer containing the value and aux can give more
information for the value.

The following command IDs are proposed:

 (*) fsconfig_set_flag: No value is specified.  The parameter must be
     boolean in nature.  The key may be prefixed with "no" to invert the
     setting. value must be NULL and aux must be 0.

 (*) fsconfig_set_string: A string value is specified.  The parameter can
     be expecting boolean, integer, string or take a path.  A conversion to
     an appropriate type will be attempted (which may include looking up as
     a path).  value points to a NUL-terminated string and aux must be 0.

 (*) fsconfig_set_binary: A binary blob is specified.  value points to
     the blob and aux indicates its size.  The parameter must be expecting
     a blob.

 (*) fsconfig_set_path: A non-empty path is specified.  The parameter must
     be expecting a path object.  value points to a NUL-terminated string
     that is the path and aux is a file descriptor at which to start a
     relative lookup or AT_FDCWD.

 (*) fsconfig_set_path_empty: As fsconfig_set_path, but with AT_EMPTY_PATH
     implied.

 (*) fsconfig_set_fd: An open file descriptor is specified.  value must
     be NULL and aux indicates the file descriptor.

 (*) fsconfig_cmd_create: Trigger superblock creation.

 (*) fsconfig_cmd_reconfigure: Trigger superblock reconfiguration.

For the "set" command IDs, the idea is that the file_system_type will point
to a list of parameters and the types of value that those parameters expect
to take.  The core code can then do the parse and argument conversion and
then give the LSM and FS a cooked option or array of options to use.

Source specification is also done the same way same way, using special keys
"source", "source1", "source2", etc..

[!] Note that, for the moment, the key and value are just glued back
together and handed to the filesystem.  Every filesystem that uses options
uses match_token() and co. to do this, and this will need to be changed -
but not all at once.

Example usage:

    fd = fsopen("ext4", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
    fsconfig(fd, fsconfig_set_path_empty, "journal_path", "", journal_fd);
    fsconfig(fd, fsconfig_set_fd, "journal_fd", "", journal_fd);
    fsconfig(fd, fsconfig_set_flag, "user_xattr", NULL, 0);
    fsconfig(fd, fsconfig_set_flag, "noacl", NULL, 0);
    fsconfig(fd, fsconfig_set_string, "sb", "1", 0);
    fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
    fsconfig(fd, fsconfig_set_string, "data", "journal", 0);
    fsconfig(fd, fsconfig_set_string, "context", "unconfined_u:...", 0);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

    fd = fsopen("ext4", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_string, "source", "/dev/sda1", 0);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

    fd = fsopen("afs", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

or:

    fd = fsopen("jffs2", FSOPEN_CLOEXEC);
    fsconfig(fd, fsconfig_set_string, "source", "mtd0", 0);
    fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
    mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/fsopen.c                            |  278 ++++++++++++++++++++++++++++++++
 include/linux/syscalls.h               |    2 
 include/uapi/linux/fs.h                |   14 ++
 5 files changed, 296 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 1647fefd2969..f9970310c126 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -401,3 +401,4 @@
 387	i386	open_tree		sys_open_tree			__ia32_sys_open_tree
 388	i386	move_mount		sys_move_mount			__ia32_sys_move_mount
 389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
+390	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 235d33dbccb2..4185d36e03bb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -346,6 +346,7 @@
 335	common	open_tree		__x64_sys_open_tree
 336	common	move_mount		__x64_sys_move_mount
 337	common	fsopen			__x64_sys_fsopen
+338	common	fsconfig		__x64_sys_fsconfig
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 7a25b4c3bc18..d2d23c02839a 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -10,6 +10,7 @@
  */
 
 #include <linux/fs_context.h>
+#include <linux/fs_parser.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 #include <linux/syscalls.h>
@@ -17,6 +18,7 @@
 #include <linux/anon_inodes.h>
 #include <linux/namei.h>
 #include <linux/file.h>
+#include "internal.h"
 #include "mount.h"
 
 /*
@@ -152,3 +154,279 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
 	put_fs_context(fc);
 	return ret;
 }
+
+/*
+ * Check the state and apply the configuration.  Note that this function is
+ * allowed to 'steal' the value by setting param->xxx to NULL before returning.
+ */
+static int vfs_fsconfig(struct fs_context *fc, struct fs_parameter *param)
+{
+	int ret;
+
+	/* We need to reinitialise the context if we have reconfiguration
+	 * pending after creation or a previous reconfiguration.
+	 */
+	if (fc->phase == FS_CONTEXT_AWAITING_RECONF) {
+		if (fc->fs_type->init_fs_context) {
+			ret = fc->fs_type->init_fs_context(fc, fc->root);
+			if (ret < 0) {
+				fc->phase = FS_CONTEXT_FAILED;
+				return ret;
+			}
+		} else {
+			/* Leave legacy context ops in place */
+		}
+
+		/* Do the security check last because ->init_fs_context may
+		 * change the namespace subscriptions.
+		 */
+		ret = security_fs_context_alloc(fc, fc->root);
+		if (ret < 0) {
+			fc->phase = FS_CONTEXT_FAILED;
+			return ret;
+		}
+
+		fc->phase = FS_CONTEXT_RECONF_PARAMS;
+	}
+
+	if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
+	    fc->phase != FS_CONTEXT_RECONF_PARAMS)
+		return -EBUSY;
+
+	return vfs_parse_fs_param(fc, param);
+}
+
+/*
+ * Perform an action on a context.
+ */
+static int vfs_fsconfig_action(struct fs_context *fc, enum fsconfig_command cmd)
+{
+	int ret = -EINVAL;
+
+	switch (cmd) {
+	case fsconfig_cmd_create:
+		if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
+			return -EBUSY;
+		fc->phase = FS_CONTEXT_CREATING;
+		ret = vfs_get_tree(fc);
+		if (ret == 0)
+			fc->phase = FS_CONTEXT_AWAITING_MOUNT;
+		else
+			fc->phase = FS_CONTEXT_FAILED;
+		return ret;
+
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+/**
+ * sys_fsconfig - Set parameters and trigger actions on a context
+ * @fd: The filesystem context to act upon
+ * @cmd: The action to take
+ * @_key: Where appropriate, the parameter key to set
+ * @_value: Where appropriate, the parameter value to set
+ * @aux: Additional information for the value
+ *
+ * This system call is used to set parameters on a context, including
+ * superblock settings, data source and security labelling.
+ *
+ * Actions include triggering the creation of a superblock and the
+ * reconfiguration of the superblock attached to the specified context.
+ *
+ * When setting a parameter, @cmd indicates the type of value being proposed
+ * and @_key indicates the parameter to be altered.
+ *
+ * @_value and @aux are used to specify the value, should a value be required:
+ *
+ * (*) fsconfig_set_flag: No value is specified.  The parameter must be boolean
+ *     in nature.  The key may be prefixed with "no" to invert the
+ *     setting. @_value must be NULL and @aux must be 0.
+ *
+ * (*) fsconfig_set_string: A string value is specified.  The parameter can be
+ *     expecting boolean, integer, string or take a path.  A conversion to an
+ *     appropriate type will be attempted (which may include looking up as a
+ *     path).  @_value points to a NUL-terminated string and @aux must be 0.
+ *
+ * (*) fsconfig_set_binary: A binary blob is specified.  @_value points to the
+ *     blob and @aux indicates its size.  The parameter must be expecting a
+ *     blob.
+ *
+ * (*) fsconfig_set_path: A non-empty path is specified.  The parameter must be
+ *     expecting a path object.  @_value points to a NUL-terminated string that
+ *     is the path and @aux is a file descriptor at which to start a relative
+ *     lookup or AT_FDCWD.
+ *
+ * (*) fsconfig_set_path_empty: As fsconfig_set_path, but with AT_EMPTY_PATH
+ *     implied.
+ *
+ * (*) fsconfig_set_fd: An open file descriptor is specified.  @_value must be
+ *     NULL and @aux indicates the file descriptor.
+ */
+SYSCALL_DEFINE5(fsconfig,
+		int, fd,
+		unsigned int, cmd,
+		const char __user *, _key,
+		const void __user *, _value,
+		int, aux)
+{
+	struct fs_context *fc;
+	struct fd f;
+	int ret;
+
+	struct fs_parameter param = {
+		.type	= fs_value_is_undefined,
+	};
+
+	if (fd < 0)
+		return -EINVAL;
+
+	switch (cmd) {
+	case fsconfig_set_flag:
+		if (!_key || _value || aux)
+			return -EINVAL;
+		break;
+	case fsconfig_set_string:
+		if (!_key || !_value || aux)
+			return -EINVAL;
+		break;
+	case fsconfig_set_binary:
+		if (!_key || !_value || aux <= 0 || aux > 1024 * 1024)
+			return -EINVAL;
+		break;
+	case fsconfig_set_path:
+	case fsconfig_set_path_empty:
+		if (!_key || !_value || (aux != AT_FDCWD && aux < 0))
+			return -EINVAL;
+		break;
+	case fsconfig_set_fd:
+		if (!_key || _value || aux < 0)
+			return -EINVAL;
+		break;
+	case fsconfig_cmd_create:
+	case fsconfig_cmd_reconfigure:
+		if (_key || _value || aux)
+			return -EINVAL;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+	ret = -EINVAL;
+	if (f.file->f_op != &fscontext_fops)
+		goto out_f;
+
+	fc = f.file->private_data;
+	if (fc->ops == &legacy_fs_context_ops) {
+		switch (cmd) {
+		case fsconfig_set_binary:
+		case fsconfig_set_path:
+		case fsconfig_set_path_empty:
+		case fsconfig_set_fd:
+			ret = -EOPNOTSUPP;
+			goto out_f;
+		}
+	}
+
+	if (_key) {
+		param.key = strndup_user(_key, 256);
+		if (IS_ERR(param.key)) {
+			ret = PTR_ERR(param.key);
+			goto out_f;
+		}
+	}
+
+	switch (cmd) {
+	case fsconfig_set_string:
+		param.type = fs_value_is_string;
+		param.string = strndup_user(_value, 256);
+		if (IS_ERR(param.string)) {
+			ret = PTR_ERR(param.string);
+			goto out_key;
+		}
+		param.size = strlen(param.string);
+		break;
+	case fsconfig_set_binary:
+		param.type = fs_value_is_blob;
+		param.size = aux;
+		param.blob = memdup_user_nul(_value, aux);
+		if (IS_ERR(param.blob)) {
+			ret = PTR_ERR(param.blob);
+			goto out_key;
+		}
+		break;
+	case fsconfig_set_path:
+		param.type = fs_value_is_filename;
+		param.name = getname_flags(_value, 0, NULL);
+		if (IS_ERR(param.name)) {
+			ret = PTR_ERR(param.name);
+			goto out_key;
+		}
+		param.dirfd = aux;
+		param.size = strlen(param.name->name);
+		break;
+	case fsconfig_set_path_empty:
+		param.type = fs_value_is_filename_empty;
+		param.name = getname_flags(_value, LOOKUP_EMPTY, NULL);
+		if (IS_ERR(param.name)) {
+			ret = PTR_ERR(param.name);
+			goto out_key;
+		}
+		param.dirfd = aux;
+		param.size = strlen(param.name->name);
+		break;
+	case fsconfig_set_fd:
+		param.type = fs_value_is_file;
+		ret = -EBADF;
+		param.file = fget(aux);
+		if (!param.file)
+			goto out_key;
+		break;
+	default:
+		break;
+	}
+
+	ret = mutex_lock_interruptible(&fc->uapi_mutex);
+	if (ret == 0) {
+		switch (cmd) {
+		case fsconfig_cmd_create:
+		case fsconfig_cmd_reconfigure:
+			ret = vfs_fsconfig_action(fc, cmd);
+			break;
+		default:
+			ret = vfs_fsconfig(fc, &param);
+			break;
+		}
+		mutex_unlock(&fc->uapi_mutex);
+	}
+
+	/* Clean up the our record of any value that we obtained from
+	 * userspace.  Note that the value may have been stolen by the LSM or
+	 * filesystem, in which case the value pointer will have been cleared.
+	 */
+	switch (cmd) {
+	case fsconfig_set_string:
+	case fsconfig_set_binary:
+		kfree(param.string);
+		break;
+	case fsconfig_set_path:
+	case fsconfig_set_path_empty:
+		if (param.name)
+			putname(param.name);
+		break;
+	case fsconfig_set_fd:
+		if (param.file)
+			fput(param.file);
+		break;
+	default:
+		break;
+	}
+out_key:
+	kfree(param.key);
+out_f:
+	fdput(f);
+	return ret;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index ad6c7ff33c01..9628d14a7ede 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -905,6 +905,8 @@ asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 			       int to_dfd, const char __user *to_path,
 			       unsigned int ms_flags);
 asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
+asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
+			     const void __user *value, int aux);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f8818e6cddd6..7c9e165e8689 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -349,4 +349,18 @@ typedef int __bitwise __kernel_rwf_t;
  */
 #define FSOPEN_CLOEXEC		0x00000001
 
+/*
+ * The type of fsconfig() call made.
+ */
+enum fsconfig_command {
+	fsconfig_set_flag,		/* Set parameter, supplying no value */
+	fsconfig_set_string,		/* Set parameter, supplying a string value */
+	fsconfig_set_binary,		/* Set parameter, supplying a binary blob value */
+	fsconfig_set_path,		/* Set parameter, supplying an object by path */
+	fsconfig_set_path_empty,	/* Set parameter, supplying an object by (empty) path */
+	fsconfig_set_fd,		/* Set parameter, supplying an object by fd */
+	fsconfig_cmd_create,		/* Invoke superblock creation */
+	fsconfig_cmd_reconfigure,	/* Invoke superblock reconfiguration */
+};
+
 #endif /* _UAPI_LINUX_FS_H */

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 30/38] vfs: syscall: Add fsmount() to create a mount for a superblock [ver #10]
  2018-07-27 17:31 [PATCH 00/38] VFS: Introduce filesystem context [ver #10] David Howells
                   ` (3 preceding siblings ...)
  2018-07-27 17:34 ` [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
@ 2018-07-27 17:34 ` David Howells
  2018-07-27 19:27   ` Andy Lutomirski
  2018-07-27 22:06   ` David Howells
  2018-07-27 17:34 ` [PATCH 31/38] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
  6 siblings, 2 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 17:34 UTC (permalink / raw)
  To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel

Provide a system call by which a filesystem opened with fsopen() and
configured by a series of writes can be mounted:

	int ret = fsmount(int fsfd, unsigned int flags,
			  unsigned int ms_flags);

where fsfd is the file descriptor returned by fsopen().  flags can be 0 or
FSMOUNT_CLOEXEC.  ms_flags is a bitwise-OR of the following flags:

	MS_RDONLY
	MS_NOSUID
	MS_NODEV
	MS_NOEXEC
	MS_NOATIME
	MS_NODIRATIME
	MS_RELATIME
	MS_STRICTATIME

	MS_UNBINDABLE
	MS_PRIVATE
	MS_SLAVE
	MS_SHARED

In the event that fsmount() fails, it may be possible to get an error
message by calling read() on fsfd.  If no message is available, ENODATA
will be reported.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/namespace.c                         |  140 +++++++++++++++++++++++++++++++-
 include/linux/syscalls.h               |    1 
 include/uapi/linux/fs.h                |    2 
 5 files changed, 141 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f9970310c126..c78b68256f8a 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -402,3 +402,4 @@
 388	i386	move_mount		sys_move_mount			__ia32_sys_move_mount
 389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
 390	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
+391	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 4185d36e03bb..d44ead5d4368 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -347,6 +347,7 @@
 336	common	move_mount		__x64_sys_move_mount
 337	common	fsopen			__x64_sys_fsopen
 338	common	fsconfig		__x64_sys_fsconfig
+339	common	fsmount			__x64_sys_fsmount
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index ea07066a2731..b1661b90256d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2503,7 +2503,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
 
 	attached = mnt_has_parent(old);
 	/*
-	 * We need to allow open_tree(OPEN_TREE_CLONE) followed by
+	 * We need to allow open_tree(OPEN_TREE_CLONE) or fsmount() followed by
 	 * move_mount(), but mustn't allow "/" to be moved.
 	 */
 	if (old->mnt_ns && !attached)
@@ -3348,9 +3348,141 @@ struct vfsmount *kern_mount(struct file_system_type *type)
 EXPORT_SYMBOL_GPL(kern_mount);
 
 /*
- * Move a mount from one place to another.
- * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
- * used to copy a mount subtree.
+ * Create a kernel mount representation for a new, prepared superblock
+ * (specified by fs_fd) and attach to an open_tree-like file descriptor.
+ */
+SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags)
+{
+	struct fs_context *fc;
+	struct file *file;
+	struct path newmount;
+	struct fd f;
+	unsigned int mnt_flags = 0;
+	long ret;
+
+	if (!may_mount())
+		return -EPERM;
+
+	if ((flags & ~(FSMOUNT_CLOEXEC)) != 0)
+		return -EINVAL;
+
+	if (ms_flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC |
+			 MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
+			 MS_STRICTATIME))
+		return -EINVAL;
+
+	if (ms_flags & MS_RDONLY)
+		mnt_flags |= MNT_READONLY;
+	if (ms_flags & MS_NOSUID)
+		mnt_flags |= MNT_NOSUID;
+	if (ms_flags & MS_NODEV)
+		mnt_flags |= MNT_NODEV;
+	if (ms_flags & MS_NOEXEC)
+		mnt_flags |= MNT_NOEXEC;
+	if (ms_flags & MS_NODIRATIME)
+		mnt_flags |= MNT_NODIRATIME;
+
+	if (ms_flags & MS_STRICTATIME) {
+		if (ms_flags & MS_NOATIME)
+			return -EINVAL;
+	} else if (ms_flags & MS_NOATIME) {
+		mnt_flags |= MNT_NOATIME;
+	} else {
+		mnt_flags |= MNT_RELATIME;
+	}
+
+	f = fdget(fs_fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EINVAL;
+	if (f.file->f_op != &fscontext_fops)
+		goto err_fsfd;
+
+	fc = f.file->private_data;
+
+	/* There must be a valid superblock or we can't mount it */
+	ret = -EINVAL;
+	if (!fc->root)
+		goto err_fsfd;
+
+	ret = -EPERM;
+	if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
+		pr_warn("VFS: Mount too revealing\n");
+		goto err_fsfd;
+	}
+
+	ret = mutex_lock_interruptible(&fc->uapi_mutex);
+	if (ret < 0)
+		goto err_fsfd;
+
+	ret = -EBUSY;
+	if (fc->phase != FS_CONTEXT_AWAITING_MOUNT)
+		goto err_unlock;
+
+	ret = -EPERM;
+	if ((fc->sb_flags & SB_MANDLOCK) && !may_mandlock())
+		goto err_unlock;
+
+	newmount.mnt = vfs_create_mount(fc, mnt_flags);
+	if (IS_ERR(newmount.mnt)) {
+		ret = PTR_ERR(newmount.mnt);
+		goto err_unlock;
+	}
+	newmount.dentry = dget(fc->root);
+
+	/* We've done the mount bit - now move the file context into more or
+	 * less the same state as if we'd done an fspick().  We don't want to
+	 * do any memory allocation or anything like that at this point as we
+	 * don't want to have to handle any errors incurred.
+	 */
+	if (fc->ops && fc->ops->free)
+		fc->ops->free(fc);
+	fc->fs_private = NULL;
+	fc->s_fs_info = NULL;
+	fc->sb_flags = 0;
+	fc->sloppy = false;
+	fc->silent = false;
+	security_fs_context_free(fc);
+	fc->security = NULL;
+	kfree(fc->subtype);
+	fc->subtype = NULL;
+	kfree(fc->source);
+	fc->source = NULL;
+
+	fc->purpose = FS_CONTEXT_FOR_RECONFIGURE;
+	fc->phase = FS_CONTEXT_AWAITING_RECONF;
+
+	/* Attach to an apparent O_PATH fd with a note that we need to unmount
+	 * it, not just simply put it.
+	 */
+	file = dentry_open(&newmount, O_PATH, fc->cred);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto err_path;
+	}
+	file->f_mode |= FMODE_NEED_UNMOUNT;
+
+	ret = get_unused_fd_flags((flags & FSMOUNT_CLOEXEC) ? O_CLOEXEC : 0);
+	if (ret >= 0)
+		fd_install(ret, file);
+	else
+		fput(file);
+
+err_path:
+	path_put(&newmount);
+err_unlock:
+	mutex_unlock(&fc->uapi_mutex);
+err_fsfd:
+	fdput(f);
+	return ret;
+}
+
+/*
+ * Move a mount from one place to another.  In combination with
+ * fsopen()/fsmount() this is used to install a new mount and in combination
+ * with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be used to copy
+ * a mount subtree.
  *
  * Note the flags value is a combination of MOVE_MOUNT_* flags.
  */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 9628d14a7ede..65db661cc2da 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -907,6 +907,7 @@ asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
 asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
 asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
 			     const void __user *value, int aux);
+asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 7c9e165e8689..297362908d01 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -349,6 +349,8 @@ typedef int __bitwise __kernel_rwf_t;
  */
 #define FSOPEN_CLOEXEC		0x00000001
 
+#define FSMOUNT_CLOEXEC		0x00000001
+
 /*
  * The type of fsconfig() call made.
  */

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 31/38] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #10]
  2018-07-27 17:31 [PATCH 00/38] VFS: Introduce filesystem context [ver #10] David Howells
                   ` (4 preceding siblings ...)
  2018-07-27 17:34 ` [PATCH 30/38] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
@ 2018-07-27 17:34 ` David Howells
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
  6 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 17:34 UTC (permalink / raw)
  To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel

Provide an fspick() system call that can be used to pick an existing
mountpoint into an fs_context which can thereafter be used to reconfigure a
superblock (equivalent of the superblock side of -o remount).

This looks like:

	int fd = fspick(AT_FDCWD, "/mnt",
			FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
        fsconfig(fd, fsconfig_set_flag, "intr", NULL, 0);
        fsconfig(fd, fsconfig_set_flag, "noac", NULL, 0);
        fsconfig(fd, fsconfig_cmd_reconfigure, NULL, NULL, 0);

At the point of fspick being called, the file descriptor referring to the
filesystem context is in exactly the same state as the one that was created
by fsopen() after fsmount() has been successfully called.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 fs/fsopen.c                            |   58 ++++++++++++++++++++++++++++++++
 include/linux/syscalls.h               |    1 +
 include/uapi/linux/fs.h                |    5 +++
 5 files changed, 66 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c78b68256f8a..d1eb6c815790 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -403,3 +403,4 @@
 389	i386	fsopen			sys_fsopen			__ia32_sys_fsopen
 390	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 391	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
+392	i386	fspick			sys_fspick			__ia32_sys_fspick
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index d44ead5d4368..d3ab703c02bb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -348,6 +348,7 @@
 337	common	fsopen			__x64_sys_fsopen
 338	common	fsconfig		__x64_sys_fsconfig
 339	common	fsmount			__x64_sys_fsmount
+340	common	fspick			__x64_sys_fspick
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/fsopen.c b/fs/fsopen.c
index d2d23c02839a..51ce50904988 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -155,6 +155,64 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
 	return ret;
 }
 
+/*
+ * Pick a superblock into a context for reconfiguration.
+ */
+SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
+{
+	struct fs_context *fc;
+	struct path target;
+	unsigned int lookup_flags;
+	int ret;
+
+	if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if ((flags & ~(FSPICK_CLOEXEC |
+		       FSPICK_SYMLINK_NOFOLLOW |
+		       FSPICK_NO_AUTOMOUNT |
+		       FSPICK_EMPTY_PATH)) != 0)
+		return -EINVAL;
+
+	lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+	if (flags & FSPICK_SYMLINK_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (flags & FSPICK_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (flags & FSPICK_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+	ret = user_path_at(dfd, path, lookup_flags, &target);
+	if (ret < 0)
+		goto err;
+
+	ret = -EOPNOTSUPP;
+	if (!target.dentry->d_sb->s_op->reconfigure)
+		goto err_path;
+
+	fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
+				0, FS_CONTEXT_FOR_RECONFIGURE);
+	if (IS_ERR(fc)) {
+		ret = PTR_ERR(fc);
+		goto err_path;
+	}
+
+	fc->phase = FS_CONTEXT_RECONF_PARAMS;
+
+	ret = fscontext_alloc_log(fc);
+	if (ret < 0)
+		goto err_fc;
+
+	path_put(&target);
+	return fscontext_create_fd(fc, flags & FSPICK_CLOEXEC ? O_CLOEXEC : 0);
+
+err_fc:
+	put_fs_context(fc);
+err_path:
+	path_put(&target);
+err:
+	return ret;
+}
+
 /*
  * Check the state and apply the configuration.  Note that this function is
  * allowed to 'steal' the value by setting param->xxx to NULL before returning.
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 65db661cc2da..701522957a12 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -908,6 +908,7 @@ asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
 asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
 			     const void __user *value, int aux);
 asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
+asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 297362908d01..be70cbac21b4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -351,6 +351,11 @@ typedef int __bitwise __kernel_rwf_t;
 
 #define FSMOUNT_CLOEXEC		0x00000001
 
+#define FSPICK_CLOEXEC		0x00000001
+#define FSPICK_SYMLINK_NOFOLLOW	0x00000002
+#define FSPICK_NO_AUTOMOUNT	0x00000004
+#define FSPICK_EMPTY_PATH	0x00000008
+
 /*
  * The type of fsconfig() call made.
  */

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 17:31 [PATCH 00/38] VFS: Introduce filesystem context [ver #10] David Howells
                   ` (5 preceding siblings ...)
  2018-07-27 17:34 ` [PATCH 31/38] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
@ 2018-07-27 17:35 ` David Howells
  2018-07-27 19:35   ` Andy Lutomirski
                     ` (10 more replies)
  6 siblings, 11 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 17:35 UTC (permalink / raw)
  To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel

Add a system call to allow filesystem information to be queried.  A request
value can be given to indicate the desired attribute.  Support is provided
for enumerating multi-value attributes.

===============
NEW SYSTEM CALL
===============

The new system call looks like:

	int ret = fsinfo(int dfd,
			 const char *filename,
			 const struct fsinfo_params *params,
			 void *buffer,
			 size_t buf_size);

The params parameter optionally points to a block of parameters:

	struct fsinfo_params {
		__u32	at_flags;
		__u32	request;
		__u32	Nth;
		__u32	Mth;
		__u32	__reserved[6];
	};

If params is NULL, it is assumed params->request should be
fsinfo_attr_statfs, params->Nth should be 0, params->Mth should be 0 and
params->at_flags should be 0.

If params is given, all of params->__reserved[] must be 0.

dfd, filename and params->at_flags indicate the file to query.  There is no
equivalent of lstat() as that can be emulated with fsinfo() by setting
AT_SYMLINK_NOFOLLOW in params->at_flags.  There is also no equivalent of
fstat() as that can be emulated by passing a NULL filename to fsinfo() with
the fd of interest in dfd.  AT_NO_AUTOMOUNT can also be used to an allow
automount point to be queried without triggering it.

params->request indicates the attribute/attributes to be queried.  This can
be one of:

	fsinfo_attr_statfs		- statfs-style info
	fsinfo_attr_fsinfo		- Information about fsinfo()
	fsinfo_attr_ids			- Filesystem IDs
	fsinfo_attr_limits		- Filesystem limits
	fsinfo_attr_supports		- What's supported in statx(), IOC flags
	fsinfo_attr_capabilities	- Filesystem capabilities
	fsinfo_attr_timestamp_info	- Inode timestamp info
	fsinfo_attr_volume_id		- Volume ID (string)
	fsinfo_attr_volume_uuid		- Volume UUID
	fsinfo_attr_volume_name		- Volume name (string)
	fsinfo_attr_cell_name		- Cell name (string)
	fsinfo_attr_domain_name		- Domain name (string)
	fsinfo_attr_realm_name		- Realm name (string)
	fsinfo_attr_server_name		- Name of the Nth server (string)
	fsinfo_attr_server_address	- Mth address of the Nth server
	fsinfo_attr_parameter		- Nth mount parameter (string)
	fsinfo_attr_source		- Nth mount source name (string)
	fsinfo_attr_name_encoding	- Filename encoding (string)
	fsinfo_attr_name_codepage	- Filename codepage (string)
	fsinfo_attr_io_size		- I/O size hints

Some attributes (such as the servers backing a network filesystem) can have
multiple values.  These can be enumerated by setting params->Nth and
params->Mth to 0, 1, ... until ENODATA is returned.

buffer and buf_size point to the reply buffer.  The buffer is filled up to
the specified size, even if this means truncating the reply.  The full size
of the reply is returned.  In future versions, this will allow extra fields
to be tacked on to the end of the reply, but anyone not expecting them will
only get the subset they're expecting.  If either buffer of buf_size are 0,
no copy will take place and the data size will be returned.

At the moment, this will only work on x86_64 and i386 as it requires the
system call to be wired up.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---

 arch/x86/entry/syscalls/syscall_32.tbl |    1 
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 fs/statfs.c                            |  464 ++++++++++++++++++++++++++++
 include/linux/fs.h                     |    4 
 include/linux/fsinfo.h                 |   40 ++
 include/linux/syscalls.h               |    4 
 include/uapi/linux/fsinfo.h            |  234 ++++++++++++++
 samples/statx/Makefile                 |    5 
 samples/statx/test-fsinfo.c            |  539 ++++++++++++++++++++++++++++++++
 9 files changed, 1291 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/fsinfo.h
 create mode 100644 include/uapi/linux/fsinfo.h
 create mode 100644 samples/statx/test-fsinfo.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index d1eb6c815790..806760188a31 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -404,3 +404,4 @@
 390	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 391	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 392	i386	fspick			sys_fspick			__ia32_sys_fspick
+393	i386	fsinfo			sys_fsinfo			__ia32_sys_fsinfo
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index d3ab703c02bb..0823eed2b02e 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -349,6 +349,7 @@
 338	common	fsconfig		__x64_sys_fsconfig
 339	common	fsmount			__x64_sys_fsmount
 340	common	fspick			__x64_sys_fspick
+341	common	fsinfo			__x64_sys_fsinfo
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/statfs.c b/fs/statfs.c
index 5b2a24f0f263..caf0773957e9 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -9,6 +9,7 @@
 #include <linux/security.h>
 #include <linux/uaccess.h>
 #include <linux/compat.h>
+#include <linux/fsinfo.h>
 #include "internal.h"
 
 static int flags_by_mnt(int mnt_flags)
@@ -384,3 +385,466 @@ COMPAT_SYSCALL_DEFINE2(ustat, unsigned, dev, struct compat_ustat __user *, u)
 	return 0;
 }
 #endif
+
+/*
+ * Get basic filesystem stats from statfs.
+ */
+static int fsinfo_generic_statfs(struct dentry *dentry,
+				 struct fsinfo_statfs *p)
+{
+	struct super_block *sb;
+	struct kstatfs buf;
+	int ret;
+
+	ret = statfs_by_dentry(dentry, &buf);
+	if (ret < 0)
+		return ret;
+
+	sb = dentry->d_sb;
+	p->f_blocks	= buf.f_blocks;
+	p->f_bfree	= buf.f_bfree;
+	p->f_bavail	= buf.f_bavail;
+	p->f_files	= buf.f_files;
+	p->f_ffree	= buf.f_ffree;
+	p->f_favail	= buf.f_ffree;
+	p->f_bsize	= buf.f_bsize;
+	p->f_frsize	= buf.f_frsize;
+	return sizeof(*p);
+}
+
+static int fsinfo_generic_ids(struct dentry *dentry,
+			      struct fsinfo_ids *p)
+{
+	struct super_block *sb;
+	struct kstatfs buf;
+	int ret;
+
+	ret = statfs_by_dentry(dentry, &buf);
+	if (ret < 0)
+		return ret;
+
+	sb = dentry->d_sb;
+	p->f_fstype	= sb->s_magic;
+	p->f_dev_major	= MAJOR(sb->s_dev);
+	p->f_dev_minor	= MINOR(sb->s_dev);
+	p->f_flags	= ST_VALID | flags_by_sb(sb->s_flags);
+
+	memcpy(&p->f_fsid, &buf.f_fsid, sizeof(p->f_fsid));
+	strcpy(p->f_fs_name, dentry->d_sb->s_type->name);
+	return sizeof(*p);
+}
+
+static int fsinfo_generic_limits(struct dentry *dentry,
+				 struct fsinfo_limits *lim)
+{
+	struct super_block *sb = dentry->d_sb;
+
+	lim->max_file_size = sb->s_maxbytes;
+	lim->max_hard_links = sb->s_max_links;
+	lim->max_uid = UINT_MAX;
+	lim->max_gid = UINT_MAX;
+	lim->max_projid = UINT_MAX;
+	lim->max_filename_len = NAME_MAX;
+	lim->max_symlink_len = PAGE_SIZE;
+	lim->max_xattr_name_len = XATTR_NAME_MAX;
+	lim->max_xattr_body_len = XATTR_SIZE_MAX;
+	lim->max_dev_major = 0xffffff;
+	lim->max_dev_minor = 0xff;
+	return sizeof(*lim);
+}
+
+static int fsinfo_generic_supports(struct dentry *dentry,
+				   struct fsinfo_supports *c)
+{
+	struct super_block *sb = dentry->d_sb;
+
+	c->stx_mask = STATX_BASIC_STATS;
+	if (sb->s_d_op && sb->s_d_op->d_automount)
+		c->stx_attributes |= STATX_ATTR_AUTOMOUNT;
+	return sizeof(*c);
+}
+
+static int fsinfo_generic_capabilities(struct dentry *dentry,
+				       struct fsinfo_capabilities *c)
+{
+	struct super_block *sb = dentry->d_sb;
+
+	if (sb->s_mtd)
+		fsinfo_set_cap(c, fsinfo_cap_is_flash_fs);
+	else if (sb->s_bdev)
+		fsinfo_set_cap(c, fsinfo_cap_is_block_fs);
+
+	if (sb->s_quota_types & QTYPE_MASK_USR)
+		fsinfo_set_cap(c, fsinfo_cap_user_quotas);
+	if (sb->s_quota_types & QTYPE_MASK_GRP)
+		fsinfo_set_cap(c, fsinfo_cap_group_quotas);
+	if (sb->s_quota_types & QTYPE_MASK_PRJ)
+		fsinfo_set_cap(c, fsinfo_cap_project_quotas);
+	if (sb->s_d_op && sb->s_d_op->d_automount)
+		fsinfo_set_cap(c, fsinfo_cap_automounts);
+	if (sb->s_id[0])
+		fsinfo_set_cap(c, fsinfo_cap_volume_id);
+
+	fsinfo_set_cap(c, fsinfo_cap_has_atime);
+	fsinfo_set_cap(c, fsinfo_cap_has_ctime);
+	fsinfo_set_cap(c, fsinfo_cap_has_mtime);
+	return sizeof(*c);
+}
+
+static int fsinfo_generic_timestamp_info(struct dentry *dentry,
+					 struct fsinfo_timestamp_info *ts)
+{
+	struct super_block *sb = dentry->d_sb;
+
+	/* If unset, assume 1s granularity */
+	u16 mantissa = 1;
+	s8 exponent = 0;
+
+	ts->minimum_timestamp = S64_MIN;
+	ts->maximum_timestamp = S64_MAX;
+	if (sb->s_time_gran < 1000000000) {
+		if (sb->s_time_gran < 1000)
+			exponent = -9;
+		else if (sb->s_time_gran < 1000000)
+			exponent = -6;
+		else
+			exponent = -3;
+	}
+#define set_gran(x)				\
+	do {					\
+		ts->x##_mantissa = mantissa;	\
+		ts->x##_exponent = exponent;	\
+	} while (0)
+	set_gran(atime_gran);
+	set_gran(btime_gran);
+	set_gran(ctime_gran);
+	set_gran(mtime_gran);
+	return sizeof(*ts);
+}
+
+static int fsinfo_generic_volume_uuid(struct dentry *dentry,
+				      struct fsinfo_volume_uuid *vu)
+{
+	struct super_block *sb = dentry->d_sb;
+
+	memcpy(vu, &sb->s_uuid, sizeof(*vu));
+	return sizeof(*vu);
+}
+
+static int fsinfo_generic_volume_id(struct dentry *dentry, char *buf)
+{
+	struct super_block *sb = dentry->d_sb;
+	size_t len = strlen(sb->s_id);
+
+	if (buf)
+		memcpy(buf, sb->s_id, len + 1);
+	return len;
+}
+
+static int fsinfo_generic_name_encoding(struct dentry *dentry, char *buf)
+{
+	static const char encoding[] = "utf8";
+
+	if (buf)
+		memcpy(buf, encoding, sizeof(encoding) - 1);
+	return sizeof(encoding) - 1;
+}
+
+static int fsinfo_generic_io_size(struct dentry *dentry,
+				  struct fsinfo_io_size *c)
+{
+	struct super_block *sb = dentry->d_sb;
+	struct kstatfs buf;
+	int ret;
+
+	if (sb->s_op->statfs == simple_statfs) {
+		c->dio_size_gran = 1;
+		c->dio_mem_align = 1;
+	} else {
+		ret = statfs_by_dentry(dentry, &buf);
+		if (ret < 0)
+			return ret;
+		c->dio_size_gran = buf.f_bsize;
+		c->dio_mem_align = buf.f_bsize;
+	}
+	return sizeof(*c);
+}
+
+/*
+ * Implement some queries generically from stuff in the superblock.
+ */
+int generic_fsinfo(struct dentry *dentry, struct fsinfo_kparams *params)
+{
+#define _gen(X) fsinfo_attr_##X: return fsinfo_generic_##X(dentry, params->buffer)
+
+	switch (params->request) {
+	case _gen(statfs);
+	case _gen(ids);
+	case _gen(limits);
+	case _gen(supports);
+	case _gen(capabilities);
+	case _gen(timestamp_info);
+	case _gen(volume_uuid);
+	case _gen(volume_id);
+	case _gen(name_encoding);
+	case _gen(io_size);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+EXPORT_SYMBOL(generic_fsinfo);
+
+/*
+ * Retrieve the filesystem info.  We make some stuff up if the operation is not
+ * supported.
+ */
+int vfs_fsinfo(const struct path *path, struct fsinfo_kparams *params)
+{
+	struct dentry *dentry = path->dentry;
+	int (*get_fsinfo)(struct dentry *, struct fsinfo_kparams *);
+	int ret;
+
+	if (params->request == fsinfo_attr_fsinfo) {
+		struct fsinfo_fsinfo *info = params->buffer;
+
+		info->max_attr	= fsinfo_attr__nr;
+		info->max_cap	= fsinfo_cap__nr;
+		return sizeof(*info);
+	}
+
+	get_fsinfo = dentry->d_sb->s_op->get_fsinfo;
+	if (!get_fsinfo) {
+		if (!dentry->d_sb->s_op->statfs)
+			return -EOPNOTSUPP;
+		get_fsinfo = generic_fsinfo;
+	}
+
+	ret = security_sb_statfs(dentry);
+	if (ret)
+		return ret;
+
+	ret = get_fsinfo(dentry, params);
+	if (ret < 0)
+		return ret;
+
+	if (params->request == fsinfo_attr_ids &&
+	    params->buffer) {
+		struct fsinfo_ids *p = params->buffer;
+
+		p->f_flags |= flags_by_mnt(path->mnt->mnt_flags);
+	}
+	return ret;
+}
+
+static int vfs_fsinfo_path(int dfd, const char __user *filename,
+			   struct fsinfo_kparams *params)
+{
+	struct path path;
+	unsigned lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+	int ret = -EINVAL;
+
+	if ((params->at_flags & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
+				 AT_EMPTY_PATH)) != 0)
+		return -EINVAL;
+
+	if (params->at_flags & AT_SYMLINK_NOFOLLOW)
+		lookup_flags &= ~LOOKUP_FOLLOW;
+	if (params->at_flags & AT_NO_AUTOMOUNT)
+		lookup_flags &= ~LOOKUP_AUTOMOUNT;
+	if (params->at_flags & AT_EMPTY_PATH)
+		lookup_flags |= LOOKUP_EMPTY;
+
+retry:
+	ret = user_path_at(dfd, filename, lookup_flags, &path);
+	if (ret)
+		goto out;
+
+	ret = vfs_fsinfo(&path, params);
+	path_put(&path);
+	if (retry_estale(ret, lookup_flags)) {
+		lookup_flags |= LOOKUP_REVAL;
+		goto retry;
+	}
+out:
+	return ret;
+}
+
+static int vfs_fsinfo_fd(unsigned int fd, struct fsinfo_kparams *params)
+{
+	struct fd f = fdget_raw(fd);
+	int ret = -EBADF;
+
+	if (f.file) {
+		ret = vfs_fsinfo(&f.file->f_path, params);
+		fdput(f);
+	}
+	return ret;
+}
+
+/*
+ * Return buffer information by requestable attribute.
+ *
+ * STRUCT indicates a fixed-size structure with only one instance.
+ * STRUCT_N indicates a fixed-size structure that may have multiple instances.
+ * STRING indicates a string with only one instance.
+ * STRING_N indicates a string that may have multiple instances.
+ * STRUCT_ARRAY indicates an array of fixed-size structs with only one instance.
+ * STRUCT_ARRAY_N as above that may have multiple instances.
+ *
+ * If an entry is marked STRUCT, STRUCT_N or STRUCT_NM then if no buffer is
+ * supplied to sys_fsinfo(), sys_fsinfo() will handle returning the buffer size
+ * without calling vfs_fsinfo() and the filesystem.
+ *
+ * No struct may have more than 252 bytes (ie. 0x3f * 4)
+ */
+#define FSINFO_STRING(N)	 [fsinfo_attr_##N] = 0x0000
+#define FSINFO_STRUCT(N)	 [fsinfo_attr_##N] = sizeof(struct fsinfo_##N)
+#define FSINFO_STRING_N(N)	 [fsinfo_attr_##N] = 0x4000
+#define FSINFO_STRUCT_N(N)	 [fsinfo_attr_##N] = 0x4000 | sizeof(struct fsinfo_##N)
+#define FSINFO_STRUCT_NM(N)	 [fsinfo_attr_##N] = 0x8000 | sizeof(struct fsinfo_##N)
+static const u16 fsinfo_buffer_sizes[fsinfo_attr__nr] = {
+	FSINFO_STRUCT		(statfs),
+	FSINFO_STRUCT		(fsinfo),
+	FSINFO_STRUCT		(ids),
+	FSINFO_STRUCT		(limits),
+	FSINFO_STRUCT		(capabilities),
+	FSINFO_STRUCT		(supports),
+	FSINFO_STRUCT		(timestamp_info),
+	FSINFO_STRING		(volume_id),
+	FSINFO_STRUCT		(volume_uuid),
+	FSINFO_STRING		(volume_name),
+	FSINFO_STRING		(cell_name),
+	FSINFO_STRING		(domain_name),
+	FSINFO_STRING		(realm_name),
+	FSINFO_STRING_N		(server_name),
+	FSINFO_STRUCT_NM	(server_address),
+	FSINFO_STRING_N		(parameter),
+	FSINFO_STRING_N		(source),
+	FSINFO_STRING		(name_encoding),
+	FSINFO_STRING		(name_codepage),
+	FSINFO_STRUCT		(io_size),
+};
+
+/**
+ * sys_fsinfo - System call to get filesystem information
+ * @dfd: Base directory to pathwalk from or fd referring to filesystem.
+ * @filename: Filesystem to query or NULL.
+ * @_params: Parameters to define request (or NULL for enhanced statfs).
+ * @_buffer: Result buffer.
+ * @buf_size: Size of result buffer.
+ *
+ * Get information on a filesystem.  The filesystem attribute to be queried is
+ * indicated by @_params->request, and some of the attributes can have multiple
+ * values, indexed by @_params->Nth and @_params->Mth.  If @_params is NULL,
+ * then the 0th fsinfo_attr_statfs attribute is queried.  If an attribute does
+ * not exist, EOPNOTSUPP is returned; if the Nth,Mth value does not exist,
+ * ENODATA is returned.
+ *
+ * On success, the size of the attribute's value is returned.  If @buf_size is
+ * 0 or @_buffer is NULL, only the size is returned.  If the size of the value
+ * is larger than @buf_size, it will be truncated by the copy.  If the size of
+ * the value is smaller than @buf_size then the excess buffer space will be
+ * cleared.  The full size of the value will be returned, irrespective of how
+ * much data is actually placed in the buffer.
+ */
+SYSCALL_DEFINE5(fsinfo,
+		int, dfd, const char __user *, filename,
+		struct fsinfo_params __user *, _params,
+		void __user *, _buffer, size_t, buf_size)
+{
+	struct fsinfo_params user_params;
+	struct fsinfo_kparams params;
+	size_t size;
+	int ret;
+
+	if (_params) {
+		if (copy_from_user(&user_params, _params, sizeof(user_params)))
+			return -EFAULT;
+		if (user_params.__reserved[0] ||
+		    user_params.__reserved[1] ||
+		    user_params.__reserved[2] ||
+		    user_params.__reserved[3] ||
+		    user_params.__reserved[4] ||
+		    user_params.__reserved[5])
+			return -EINVAL;
+		if (user_params.request >= fsinfo_attr__nr)
+			return -EOPNOTSUPP;
+		params.at_flags = user_params.at_flags;
+		params.request = user_params.request;
+		params.Nth = user_params.Nth;
+		params.Mth = user_params.Mth;
+	} else {
+		params.at_flags = 0;
+		params.request = fsinfo_attr_statfs;
+		params.Nth = 0;
+		params.Mth = 0;
+	}
+
+	if (!_buffer || !buf_size) {
+		buf_size = 0;
+		_buffer = NULL;
+	}
+
+	/* Allocate an appropriately-sized buffer.  We will truncate the
+	 * contents when we write the contents back to userspace.
+	 */
+	size = fsinfo_buffer_sizes[params.request];
+	switch (size & 0xc000) {
+	case 0x0000:
+		if (params.Nth != 0)
+			return -ENODATA;
+		/* Fall through */
+	case 0x4000:
+		if (params.Mth != 0)
+			return -ENODATA;
+		/* Fall through */
+	case 0x8000:
+		break;
+	case 0xc000:
+		return -ENOBUFS;
+	}
+
+	size &= ~0xc000;
+	if (size == 0) {
+		size = 4096; /* String */
+	} else {
+		if (buf_size == 0)
+			return size; /* We know how big the buffer should be */
+
+		/* Clear any part of the buffer that we won't fill. */
+		if (buf_size > size &&
+		    clear_user(_buffer, buf_size) != 0)
+			return -EFAULT;
+	}
+
+	if (buf_size > 0) {
+		params.buf_size = size;
+		params.buffer = kzalloc(size, GFP_KERNEL);
+		if (!params.buffer)
+			return -ENOMEM;
+	} else {
+		params.buf_size = 0;
+		params.buffer = NULL;
+	}
+
+	if (filename)
+		ret = vfs_fsinfo_path(dfd, filename, &params);
+	else
+		ret = vfs_fsinfo_fd(dfd, &params);
+	if (ret < 0)
+		goto error;
+
+	if (ret == 0) {
+		ret = -ENODATA;
+		goto error;
+	}
+
+	if (buf_size > ret)
+		buf_size = ret;
+
+	if (copy_to_user(_buffer, params.buffer, buf_size))
+		ret = -EFAULT;
+error:
+	kfree(params.buffer);
+	return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3e661d033163..053d53861995 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -64,6 +64,8 @@ struct fscrypt_operations;
 struct fs_context;
 struct fsconfig_parser;
 struct fsconfig_param;
+struct fsinfo_kparams;
+enum fsinfo_attribute;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1849,6 +1851,7 @@ struct super_operations {
 	int (*thaw_super) (struct super_block *);
 	int (*unfreeze_fs) (struct super_block *);
 	int (*statfs) (struct dentry *, struct kstatfs *);
+	int (*get_fsinfo) (struct dentry *, struct fsinfo_kparams *);
 	int (*remount_fs) (struct super_block *, int *, char *, size_t);
 	int (*reconfigure) (struct super_block *, struct fs_context *);
 	void (*umount_begin) (struct super_block *);
@@ -2226,6 +2229,7 @@ extern int iterate_mounts(int (*)(struct vfsmount *, void *), void *,
 extern int vfs_statfs(const struct path *, struct kstatfs *);
 extern int user_statfs(const char __user *, struct kstatfs *);
 extern int fd_statfs(int, struct kstatfs *);
+extern int vfs_fsinfo(const struct path *, struct fsinfo_kparams *);
 extern int freeze_super(struct super_block *super);
 extern int thaw_super(struct super_block *super);
 extern bool our_mnt(struct vfsmount *mnt);
diff --git a/include/linux/fsinfo.h b/include/linux/fsinfo.h
new file mode 100644
index 000000000000..c356391b4b2a
--- /dev/null
+++ b/include/linux/fsinfo.h
@@ -0,0 +1,40 @@
+/* Filesystem information query
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_FSINFO_H
+#define _LINUX_FSINFO_H
+
+#include <uapi/linux/fsinfo.h>
+
+struct fsinfo_kparams {
+	__u32			at_flags;	/* AT_SYMLINK_NOFOLLOW and similar */
+	enum fsinfo_attribute	request;	/* What is being asking for */
+	__u32			Nth;		/* Instance of it (some may have multiple) */
+	__u32			Mth;		/* Subinstance */
+	void			*buffer;	/* Where to place the reply */
+	size_t			buf_size;	/* Size of the buffer */
+};
+
+extern int generic_fsinfo(struct dentry *, struct fsinfo_kparams *);
+
+static inline void fsinfo_set_cap(struct fsinfo_capabilities *c,
+				  enum fsinfo_capability cap)
+{
+	c->capabilities[cap / 8] |= 1 << (cap % 8);
+}
+
+static inline void fsinfo_clear_cap(struct fsinfo_capabilities *c,
+				    enum fsinfo_capability cap)
+{
+	c->capabilities[cap / 8] &= ~(1 << (cap % 8));
+}
+
+#endif /* _LINUX_FSINFO_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 701522957a12..bc7173c09f4d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -49,6 +49,7 @@ struct stat64;
 struct statfs;
 struct statfs64;
 struct statx;
+struct fsinfo_params;
 struct __sysctl_args;
 struct sysinfo;
 struct timespec;
@@ -909,6 +910,9 @@ asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key
 			     const void __user *value, int aux);
 asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
 asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags);
+asmlinkage long sys_fsinfo(int dfd, const char __user *path,
+			   struct fsinfo_params __user *params,
+			   void __user *buffer, size_t buf_size);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
new file mode 100644
index 000000000000..abcf414dd3be
--- /dev/null
+++ b/include/uapi/linux/fsinfo.h
@@ -0,0 +1,234 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* fsinfo() definitions.
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+#ifndef _UAPI_LINUX_FSINFO_H
+#define _UAPI_LINUX_FSINFO_H
+
+#include <linux/types.h>
+#include <linux/socket.h>
+
+/*
+ * The filesystem attributes that can be requested.  Note that some attributes
+ * may have multiple instances which can be switched in the parameter block.
+ */
+enum fsinfo_attribute {
+	fsinfo_attr_statfs		= 0,	/* statfs()-style state */
+	fsinfo_attr_fsinfo		= 1,	/* Information about fsinfo() */
+	fsinfo_attr_ids			= 2,	/* Filesystem IDs */
+	fsinfo_attr_limits		= 3,	/* Filesystem limits */
+	fsinfo_attr_supports		= 4,	/* What's supported in statx, iocflags, ... */
+	fsinfo_attr_capabilities	= 5,	/* Filesystem capabilities (bits) */
+	fsinfo_attr_timestamp_info	= 6,	/* Inode timestamp info */
+	fsinfo_attr_volume_id		= 7,	/* Volume ID (string) */
+	fsinfo_attr_volume_uuid		= 8,	/* Volume UUID (LE uuid) */
+	fsinfo_attr_volume_name		= 9,	/* Volume name (string) */
+	fsinfo_attr_cell_name		= 10,	/* Cell name (string) */
+	fsinfo_attr_domain_name		= 11,	/* Domain name (string) */
+	fsinfo_attr_realm_name		= 12,	/* Realm name (string) */
+	fsinfo_attr_server_name		= 13,	/* Name of the Nth server */
+	fsinfo_attr_server_address	= 14,	/* Mth address of the Nth server */
+	fsinfo_attr_parameter		= 15,	/* Nth mount parameter (string) */
+	fsinfo_attr_source		= 16,	/* Nth mount source name (string) */
+	fsinfo_attr_name_encoding	= 17,	/* Filename encoding (string) */
+	fsinfo_attr_name_codepage	= 18,	/* Filename codepage (string) */
+	fsinfo_attr_io_size		= 19,	/* Optimal I/O sizes */
+	fsinfo_attr__nr
+};
+
+/*
+ * Optional fsinfo() parameter structure.
+ *
+ * If this is not given, it is assumed that fsinfo_attr_statfs instance 0,0 is
+ * desired.
+ */
+struct fsinfo_params {
+	__u32	at_flags;	/* AT_SYMLINK_NOFOLLOW and similar flags */
+	__u32	request;	/* What is being asking for (enum fsinfo_attribute) */
+	__u32	Nth;		/* Instance of it (some may have multiple) */
+	__u32	Mth;		/* Subinstance of Nth instance */
+	__u32	__reserved[6];	/* Reserved params; all must be 0 */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_statfs).
+ * - This gives extended filesystem information.
+ */
+struct fsinfo_statfs {
+	__u64	f_blocks;	/* Total number of blocks in fs */
+	__u64	f_bfree;	/* Total number of free blocks */
+	__u64	f_bavail;	/* Number of free blocks available to ordinary user */
+	__u64	f_files;	/* Total number of file nodes in fs */
+	__u64	f_ffree;	/* Number of free file nodes */
+	__u64	f_favail;	/* Number of free file nodes available to ordinary user */
+	__u32	f_bsize;	/* Optimal block size */
+	__u32	f_frsize;	/* Fragment size */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_ids).
+ *
+ * List of basic identifiers as is normally found in statfs().
+ */
+struct fsinfo_ids {
+	char	f_fs_name[15 + 1];
+	__u64	f_flags;	/* Filesystem mount flags (MS_*) */
+	__u64	f_fsid;		/* Short 64-bit Filesystem ID (as statfs) */
+	__u64	f_sb_id;	/* Internal superblock ID for sbnotify()/mntnotify() */
+	__u32	f_fstype;	/* Filesystem type from linux/magic.h [uncond] */
+	__u32	f_dev_major;	/* As st_dev_* from struct statx [uncond] */
+	__u32	f_dev_minor;
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_limits).
+ *
+ * List of supported filesystem limits.
+ */
+struct fsinfo_limits {
+	__u64	max_file_size;			/* Maximum file size */
+	__u64	max_uid;			/* Maximum UID supported */
+	__u64	max_gid;			/* Maximum GID supported */
+	__u64	max_projid;			/* Maximum project ID supported */
+	__u32	max_dev_major;			/* Maximum device major representable */
+	__u32	max_dev_minor;			/* Maximum device minor representable */
+	__u32	max_hard_links;			/* Maximum number of hard links on a file */
+	__u32	max_xattr_body_len;		/* Maximum xattr content length */
+	__u32	max_xattr_name_len;		/* Maximum xattr name length */
+	__u32	max_filename_len;		/* Maximum filename length */
+	__u32	max_symlink_len;		/* Maximum symlink content length */
+	__u32	__reserved[1];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_supports).
+ *
+ * What's supported in various masks, such as statx() attribute and mask bits
+ * and IOC flags.
+ */
+struct fsinfo_supports {
+	__u64	stx_attributes;		/* What statx::stx_attributes are supported */
+	__u32	stx_mask;		/* What statx::stx_mask bits are supported */
+	__u32	ioc_flags;		/* What FS_IOC_* flags are supported */
+	__u32	win_file_attrs;		/* What DOS/Windows FILE_* attributes are supported */
+	__u32	__reserved[1];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_capabilities).
+ *
+ * Bitmask indicating filesystem capabilities where renderable as single bits.
+ */
+enum fsinfo_capability {
+	fsinfo_cap_is_kernel_fs		= 0,	/* fs is kernel-special filesystem */
+	fsinfo_cap_is_block_fs		= 1,	/* fs is block-based filesystem */
+	fsinfo_cap_is_flash_fs		= 2,	/* fs is flash filesystem */
+	fsinfo_cap_is_network_fs	= 3,	/* fs is network filesystem */
+	fsinfo_cap_is_automounter_fs	= 4,	/* fs is automounter special filesystem */
+	fsinfo_cap_automounts		= 5,	/* fs supports automounts */
+	fsinfo_cap_adv_locks		= 6,	/* fs supports advisory file locking */
+	fsinfo_cap_mand_locks		= 7,	/* fs supports mandatory file locking */
+	fsinfo_cap_leases		= 8,	/* fs supports file leases */
+	fsinfo_cap_uids			= 9,	/* fs supports numeric uids */
+	fsinfo_cap_gids			= 10,	/* fs supports numeric gids */
+	fsinfo_cap_projids		= 11,	/* fs supports numeric project ids */
+	fsinfo_cap_id_names		= 12,	/* fs supports user names */
+	fsinfo_cap_id_guids		= 13,	/* fs supports user guids */
+	fsinfo_cap_windows_attrs	= 14,	/* fs has windows attributes */
+	fsinfo_cap_user_quotas		= 15,	/* fs has per-user quotas */
+	fsinfo_cap_group_quotas		= 16,	/* fs has per-group quotas */
+	fsinfo_cap_project_quotas	= 17,	/* fs has per-project quotas */
+	fsinfo_cap_xattrs		= 18,	/* fs has xattrs */
+	fsinfo_cap_journal		= 19,	/* fs has a journal */
+	fsinfo_cap_data_is_journalled	= 20,	/* fs is using data journalling */
+	fsinfo_cap_o_sync		= 21,	/* fs supports O_SYNC */
+	fsinfo_cap_o_direct		= 22,	/* fs supports O_DIRECT */
+	fsinfo_cap_volume_id		= 23,	/* fs has a volume ID */
+	fsinfo_cap_volume_uuid		= 24,	/* fs has a volume UUID */
+	fsinfo_cap_volume_name		= 25,	/* fs has a volume name */
+	fsinfo_cap_volume_fsid		= 26,	/* fs has a volume FSID */
+	fsinfo_cap_cell_name		= 27,	/* fs has a cell name */
+	fsinfo_cap_domain_name		= 28,	/* fs has a domain name */
+	fsinfo_cap_realm_name		= 29,	/* fs has a realm name */
+	fsinfo_cap_iver_all_change	= 30,	/* i_version represents data + meta changes */
+	fsinfo_cap_iver_data_change	= 31,	/* i_version represents data changes only */
+	fsinfo_cap_iver_mono_incr	= 32,	/* i_version incremented monotonically */
+	fsinfo_cap_symlinks		= 33,	/* fs supports symlinks */
+	fsinfo_cap_hard_links		= 34,	/* fs supports hard links */
+	fsinfo_cap_hard_links_1dir	= 35,	/* fs supports hard links in same dir only */
+	fsinfo_cap_device_files		= 36,	/* fs supports bdev, cdev */
+	fsinfo_cap_unix_specials	= 37,	/* fs supports pipe, fifo, socket */
+	fsinfo_cap_resource_forks	= 38,	/* fs supports resource forks/streams */
+	fsinfo_cap_name_case_indep	= 39,	/* Filename case independence is mandatory */
+	fsinfo_cap_name_non_utf8	= 40,	/* fs has non-utf8 names */
+	fsinfo_cap_name_has_codepage	= 41,	/* fs has a filename codepage */
+	fsinfo_cap_sparse		= 42,	/* fs supports sparse files */
+	fsinfo_cap_not_persistent	= 43,	/* fs is not persistent */
+	fsinfo_cap_no_unix_mode		= 44,	/* fs does not support unix mode bits */
+	fsinfo_cap_has_atime		= 45,	/* fs supports access time */
+	fsinfo_cap_has_btime		= 46,	/* fs supports birth/creation time */
+	fsinfo_cap_has_ctime		= 47,	/* fs supports change time */
+	fsinfo_cap_has_mtime		= 48,	/* fs supports modification time */
+	fsinfo_cap__nr
+};
+
+struct fsinfo_capabilities {
+	__u8	capabilities[(fsinfo_cap__nr + 7) / 8];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_timestamp_info).
+ */
+struct fsinfo_timestamp_info {
+	__s64	minimum_timestamp;	/* Minimum timestamp value in seconds */
+	__s64	maximum_timestamp;	/* Maximum timestamp value in seconds */
+	__u16	atime_gran_mantissa;	/* Granularity(secs) = mant * 10^exp */
+	__u16	btime_gran_mantissa;
+	__u16	ctime_gran_mantissa;
+	__u16	mtime_gran_mantissa;
+	__s8	atime_gran_exponent;
+	__s8	btime_gran_exponent;
+	__s8	ctime_gran_exponent;
+	__s8	mtime_gran_exponent;
+	__u32	__reserved[1];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_volume_uuid).
+ */
+struct fsinfo_volume_uuid {
+	__u8	uuid[16];
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_server_addresses).
+ *
+ * Find the Mth address of the Nth server for a network mount.
+ */
+struct fsinfo_server_address {
+	struct __kernel_sockaddr_storage address;
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_io_size).
+ *
+ * Retrieve I/O size hints for a filesystem.
+ */
+struct fsinfo_io_size {
+	__u32		dio_size_gran;	/* Size granularity for O_DIRECT */
+	__u32		dio_mem_align;	/* Memory alignment for O_DIRECT */
+};
+
+/*
+ * Information struct for fsinfo(fsinfo_attr_fsinfo).
+ *
+ * This gives information about fsinfo() itself.
+ */
+struct fsinfo_fsinfo {
+	__u32	max_attr;	/* Number of supported attributes (fsinfo_attr__nr) */
+	__u32	max_cap;	/* Number of supported capabilities (fsinfo_cap__nr) */
+};
+
+#endif /* _UAPI_LINUX_FSINFO_H */
diff --git a/samples/statx/Makefile b/samples/statx/Makefile
index 59df7c25a9d1..9cb9a88e3a10 100644
--- a/samples/statx/Makefile
+++ b/samples/statx/Makefile
@@ -1,7 +1,10 @@
 # List of programs to build
-hostprogs-$(CONFIG_SAMPLE_STATX) := test-statx
+hostprogs-$(CONFIG_SAMPLE_STATX) := test-statx test-fsinfo
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
 
 HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
+
+HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
+HOSTLOADLIBES_test-fsinfo += -lm
diff --git a/samples/statx/test-fsinfo.c b/samples/statx/test-fsinfo.c
new file mode 100644
index 000000000000..deab0081ecd1
--- /dev/null
+++ b/samples/statx/test-fsinfo.c
@@ -0,0 +1,539 @@
+/* Test the fsinfo() system call
+ *
+ * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define _GNU_SOURCE
+#define _ATFILE_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <unistd.h>
+#include <ctype.h>
+#include <errno.h>
+#include <time.h>
+#include <math.h>
+#include <fcntl.h>
+#include <sys/syscall.h>
+#include <linux/fsinfo.h>
+#include <linux/socket.h>
+#include <sys/stat.h>
+
+static __attribute__((unused))
+ssize_t fsinfo(int dfd, const char *filename, struct fsinfo_params *params,
+	       void *buffer, size_t buf_size)
+{
+	return syscall(__NR_fsinfo, dfd, filename, params, buffer, buf_size);
+}
+
+#define FSINFO_STRING(N)	 [fsinfo_attr_##N] = 0x00
+#define FSINFO_STRUCT(N)	 [fsinfo_attr_##N] = sizeof(struct fsinfo_##N)/sizeof(__u32)
+#define FSINFO_STRING_N(N)	 [fsinfo_attr_##N] = 0x40
+#define FSINFO_STRUCT_N(N)	 [fsinfo_attr_##N] = 0x40 | sizeof(struct fsinfo_##N)/sizeof(__u32)
+#define FSINFO_STRUCT_NM(N)	 [fsinfo_attr_##N] = 0x80 | sizeof(struct fsinfo_##N)/sizeof(__u32)
+static const __u8 fsinfo_buffer_sizes[fsinfo_attr__nr] = {
+	FSINFO_STRUCT		(statfs),
+	FSINFO_STRUCT		(fsinfo),
+	FSINFO_STRUCT		(ids),
+	FSINFO_STRUCT		(limits),
+	FSINFO_STRUCT		(supports),
+	FSINFO_STRUCT		(capabilities),
+	FSINFO_STRUCT		(timestamp_info),
+	FSINFO_STRING		(volume_id),
+	FSINFO_STRUCT		(volume_uuid),
+	FSINFO_STRING		(volume_name),
+	FSINFO_STRING		(cell_name),
+	FSINFO_STRING		(domain_name),
+	FSINFO_STRING		(realm_name),
+	FSINFO_STRING_N		(server_name),
+	FSINFO_STRUCT_NM	(server_address),
+	FSINFO_STRING_N		(parameter),
+	FSINFO_STRING_N		(source),
+	FSINFO_STRING		(name_encoding),
+	FSINFO_STRING		(name_codepage),
+	FSINFO_STRUCT		(io_size),
+};
+
+#define FSINFO_NAME(N) [fsinfo_attr_##N] = #N
+static const char *fsinfo_attr_names[fsinfo_attr__nr] = {
+	FSINFO_NAME(statfs),
+	FSINFO_NAME(fsinfo),
+	FSINFO_NAME(ids),
+	FSINFO_NAME(limits),
+	FSINFO_NAME(supports),
+	FSINFO_NAME(capabilities),
+	FSINFO_NAME(timestamp_info),
+	FSINFO_NAME(volume_id),
+	FSINFO_NAME(volume_uuid),
+	FSINFO_NAME(volume_name),
+	FSINFO_NAME(cell_name),
+	FSINFO_NAME(domain_name),
+	FSINFO_NAME(realm_name),
+	FSINFO_NAME(server_name),
+	FSINFO_NAME(server_address),
+	FSINFO_NAME(parameter),
+	FSINFO_NAME(source),
+	FSINFO_NAME(name_encoding),
+	FSINFO_NAME(name_codepage),
+	FSINFO_NAME(io_size),
+};
+
+union reply {
+	char buffer[4096];
+	struct fsinfo_statfs statfs;
+	struct fsinfo_fsinfo fsinfo;
+	struct fsinfo_ids ids;
+	struct fsinfo_limits limits;
+	struct fsinfo_supports supports;
+	struct fsinfo_capabilities caps;
+	struct fsinfo_timestamp_info timestamps;
+	struct fsinfo_volume_uuid uuid;
+	struct fsinfo_server_address srv_addr;
+	struct fsinfo_io_size io_size;
+};
+
+static void dump_hex(unsigned int *data, int from, int to)
+{
+	unsigned offset, print_offset = 1, col = 0;
+
+	from /= 4;
+	to = (to + 3) / 4;
+
+	for (offset = from; offset < to; offset++) {
+		if (print_offset) {
+			printf("%04x: ", offset * 8);
+			print_offset = 0;
+		}
+		printf("%08x", data[offset]);
+		col++;
+		if ((col & 3) == 0) {
+			printf("\n");
+			print_offset = 1;
+		} else {
+			printf(" ");
+		}
+	}
+
+	if (!print_offset)
+		printf("\n");
+}
+
+static void dump_attr_statfs(union reply *r, int size)
+{
+	struct fsinfo_statfs *f = &r->statfs;
+
+	printf("\n");
+	printf("\tblocks: n=%llu fr=%llu av=%llu\n",
+	       (unsigned long long)f->f_blocks,
+	       (unsigned long long)f->f_bfree,
+	       (unsigned long long)f->f_bavail);
+
+	printf("\tfiles : n=%llu fr=%llu av=%llu\n",
+	       (unsigned long long)f->f_files,
+	       (unsigned long long)f->f_ffree,
+	       (unsigned long long)f->f_favail);
+	printf("\tbsize : %u\n", f->f_bsize);
+	printf("\tfrsize: %u\n", f->f_frsize);
+}
+
+static void dump_attr_fsinfo(union reply *r, int size)
+{
+	struct fsinfo_fsinfo *f = &r->fsinfo;
+
+	printf("max_attr=%u max_cap=%u\n", f->max_attr, f->max_cap);
+}
+
+static void dump_attr_ids(union reply *r, int size)
+{
+	struct fsinfo_ids *f = &r->ids;
+
+	printf("\n");
+	printf("\tdev   : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
+	printf("\tfs    : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
+	printf("\tflags : %llx\n", (unsigned long long)f->f_flags);
+	printf("\tfsid  : %llx\n", (unsigned long long)f->f_fsid);
+}
+
+static void dump_attr_limits(union reply *r, int size)
+{
+	struct fsinfo_limits *f = &r->limits;
+
+	printf("\n");
+	printf("\tmax file size: %llx\n", f->max_file_size);
+	printf("\tmax ids      : u=%llx g=%llx p=%llx\n",
+	       f->max_uid, f->max_gid, f->max_projid);
+	printf("\tmax dev      : maj=%x min=%x\n",
+	       f->max_dev_major, f->max_dev_minor);
+	printf("\tmax links    : %x\n", f->max_hard_links);
+	printf("\tmax xattr    : n=%x b=%x\n",
+	       f->max_xattr_name_len, f->max_xattr_body_len);
+	printf("\tmax len      : file=%x sym=%x\n",
+	       f->max_filename_len, f->max_symlink_len);
+}
+
+static void dump_attr_supports(union reply *r, int size)
+{
+	struct fsinfo_supports *f = &r->supports;
+
+	printf("\n");
+	printf("\tstx_attr=%llx\n", f->stx_attributes);
+	printf("\tstx_mask=%x\n", f->stx_mask);
+	printf("\tioc_flags=%x\n", f->ioc_flags);
+	printf("\twin_fattrs=%x\n", f->win_file_attrs);
+}
+
+#define FSINFO_CAP_NAME(C) [fsinfo_cap_##C] = #C
+static const char *fsinfo_cap_names[fsinfo_cap__nr] = {
+	FSINFO_CAP_NAME(is_kernel_fs),
+	FSINFO_CAP_NAME(is_block_fs),
+	FSINFO_CAP_NAME(is_flash_fs),
+	FSINFO_CAP_NAME(is_network_fs),
+	FSINFO_CAP_NAME(is_automounter_fs),
+	FSINFO_CAP_NAME(automounts),
+	FSINFO_CAP_NAME(adv_locks),
+	FSINFO_CAP_NAME(mand_locks),
+	FSINFO_CAP_NAME(leases),
+	FSINFO_CAP_NAME(uids),
+	FSINFO_CAP_NAME(gids),
+	FSINFO_CAP_NAME(projids),
+	FSINFO_CAP_NAME(id_names),
+	FSINFO_CAP_NAME(id_guids),
+	FSINFO_CAP_NAME(windows_attrs),
+	FSINFO_CAP_NAME(user_quotas),
+	FSINFO_CAP_NAME(group_quotas),
+	FSINFO_CAP_NAME(project_quotas),
+	FSINFO_CAP_NAME(xattrs),
+	FSINFO_CAP_NAME(journal),
+	FSINFO_CAP_NAME(data_is_journalled),
+	FSINFO_CAP_NAME(o_sync),
+	FSINFO_CAP_NAME(o_direct),
+	FSINFO_CAP_NAME(volume_id),
+	FSINFO_CAP_NAME(volume_uuid),
+	FSINFO_CAP_NAME(volume_name),
+	FSINFO_CAP_NAME(volume_fsid),
+	FSINFO_CAP_NAME(cell_name),
+	FSINFO_CAP_NAME(domain_name),
+	FSINFO_CAP_NAME(realm_name),
+	FSINFO_CAP_NAME(iver_all_change),
+	FSINFO_CAP_NAME(iver_data_change),
+	FSINFO_CAP_NAME(iver_mono_incr),
+	FSINFO_CAP_NAME(symlinks),
+	FSINFO_CAP_NAME(hard_links),
+	FSINFO_CAP_NAME(hard_links_1dir),
+	FSINFO_CAP_NAME(device_files),
+	FSINFO_CAP_NAME(unix_specials),
+	FSINFO_CAP_NAME(resource_forks),
+	FSINFO_CAP_NAME(name_case_indep),
+	FSINFO_CAP_NAME(name_non_utf8),
+	FSINFO_CAP_NAME(name_has_codepage),
+	FSINFO_CAP_NAME(sparse),
+	FSINFO_CAP_NAME(not_persistent),
+	FSINFO_CAP_NAME(no_unix_mode),
+	FSINFO_CAP_NAME(has_atime),
+	FSINFO_CAP_NAME(has_btime),
+	FSINFO_CAP_NAME(has_ctime),
+	FSINFO_CAP_NAME(has_mtime),
+};
+
+static void dump_attr_capabilities(union reply *r, int size)
+{
+	struct fsinfo_capabilities *f = &r->caps;
+	int i;
+
+	for (i = 0; i < sizeof(f->capabilities); i++)
+		printf("%02x", f->capabilities[i]);
+	printf("\n");
+	for (i = 0; i < fsinfo_cap__nr; i++)
+		if (f->capabilities[i / 8] & (1 << (i % 8)))
+			printf("\t- %s\n", fsinfo_cap_names[i]);
+}
+
+static void dump_attr_timestamp_info(union reply *r, int size)
+{
+	struct fsinfo_timestamp_info *f = &r->timestamps;
+
+	printf("range=%llx-%llx\n",
+	       (unsigned long long)f->minimum_timestamp,
+	       (unsigned long long)f->maximum_timestamp);
+
+#define print_time(G) \
+	printf("\t"#G"time : gran=%gs\n",			\
+	       (f->G##time_gran_mantissa *		\
+		pow(10., f->G##time_gran_exponent)))
+	print_time(a);
+	print_time(b);
+	print_time(c);
+	print_time(m);
+}
+
+static void dump_attr_volume_uuid(union reply *r, int size)
+{
+	struct fsinfo_volume_uuid *f = &r->uuid;
+
+	printf("%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
+	       "-%02x%02x%02x%02x%02x%02x\n",
+	       f->uuid[ 0], f->uuid[ 1],
+	       f->uuid[ 2], f->uuid[ 3],
+	       f->uuid[ 4], f->uuid[ 5],
+	       f->uuid[ 6], f->uuid[ 7],
+	       f->uuid[ 8], f->uuid[ 9],
+	       f->uuid[10], f->uuid[11],
+	       f->uuid[12], f->uuid[13],
+	       f->uuid[14], f->uuid[15]);
+}
+
+static void dump_attr_server_address(union reply *r, int size)
+{
+	struct fsinfo_server_address *f = &r->srv_addr;
+
+	printf("family=%u\n", f->address.ss_family);
+}
+
+static void dump_attr_io_size(union reply *r, int size)
+{
+	struct fsinfo_io_size *f = &r->io_size;
+
+	printf("dio_size=%u\n", f->dio_size_gran);
+}
+
+/*
+ *
+ */
+typedef void (*dumper_t)(union reply *r, int size);
+
+#define FSINFO_DUMPER(N) [fsinfo_attr_##N] = dump_attr_##N
+static const dumper_t fsinfo_attr_dumper[fsinfo_attr__nr] = {
+	FSINFO_DUMPER(statfs),
+	FSINFO_DUMPER(fsinfo),
+	FSINFO_DUMPER(ids),
+	FSINFO_DUMPER(limits),
+	FSINFO_DUMPER(supports),
+	FSINFO_DUMPER(capabilities),
+	FSINFO_DUMPER(timestamp_info),
+	FSINFO_DUMPER(volume_uuid),
+	FSINFO_DUMPER(server_address),
+	FSINFO_DUMPER(io_size),
+};
+
+static void dump_fsinfo(enum fsinfo_attribute attr, __u8 about,
+			union reply *r, int size)
+{
+	dumper_t dumper = fsinfo_attr_dumper[attr];
+	unsigned int len;
+
+	if (!dumper) {
+		printf("<no dumper>\n");
+		return;
+	}
+
+	len = (about & 0x3f) * sizeof(__u32);
+	if (size < len) {
+		printf("<short data %u/%u>\n", size, len);
+		return;
+	}
+
+	dumper(r, size);
+}
+
+/*
+ * Try one subinstance of an attribute.
+ */
+static int try_one(const char *file, struct fsinfo_params *params, bool raw)
+{
+	union reply r;
+	char *p;
+	int ret;
+	__u8 about;
+
+	memset(&r.buffer, 0xbd, sizeof(r.buffer));
+
+	errno = 0;
+	ret = fsinfo(AT_FDCWD, file, params, r.buffer, sizeof(r.buffer));
+	if (params->request >= fsinfo_attr__nr) {
+		if (ret == -1 && errno == EOPNOTSUPP)
+			exit(0);
+		fprintf(stderr, "Unexpected error for too-large command %u: %m\n",
+			params->request);
+		exit(1);
+	}
+
+	//printf("fsinfo(%s,%s,%u,%u) = %d: %m\n",
+	//       file, fsinfo_attr_names[params->request],
+	//       params->Nth, params->Mth, ret);
+
+	about = fsinfo_buffer_sizes[params->request];
+	if (ret == -1) {
+		if (errno == ENODATA) {
+			switch (about & 0xc0) {
+			case 0x00:
+				if (params->Nth == 0 && params->Mth == 0) {
+					fprintf(stderr,
+						"Unexpected ENODATA1 (%u[%u][%u])\n",
+						params->request, params->Nth, params->Mth);
+					exit(1);
+				}
+				break;
+			case 0x40:
+				if (params->Nth == 0 && params->Mth == 0) {
+					fprintf(stderr,
+						"Unexpected ENODATA2 (%u[%u][%u])\n",
+						params->request, params->Nth, params->Mth);
+					exit(1);
+				}
+				break;
+			}
+			return (params->Mth == 0) ? 2 : 1;
+		}
+		if (errno == EOPNOTSUPP) {
+			if (params->Nth > 0 || params->Mth > 0) {
+				fprintf(stderr,
+					"Should return -ENODATA (%u[%u][%u])\n",
+					params->request, params->Nth, params->Mth);
+				exit(1);
+			}
+			//printf("\e[33m%s\e[m: <not supported>\n",
+			//       fsinfo_attr_names[attr]);
+			return 2;
+		}
+		perror(file);
+		exit(1);
+	}
+
+	if (raw) {
+		if (ret > 4096)
+			ret = 4096;
+		dump_hex((unsigned int *)&r.buffer, 0, ret);
+		return 0;
+	}
+
+	switch (about & 0xc0) {
+	case 0x00:
+		printf("\e[33m%s\e[m: ",
+		       fsinfo_attr_names[params->request]);
+		break;
+	case 0x40:
+		printf("\e[33m%s[%u]\e[m: ",
+		       fsinfo_attr_names[params->request],
+		       params->Nth);
+		break;
+	case 0x80:
+		printf("\e[33m%s[%u][%u]\e[m: ",
+		       fsinfo_attr_names[params->request],
+		       params->Nth, params->Mth);
+		break;
+	}
+
+	switch (about) {
+		/* Struct */
+	case 0x01 ... 0x3f:
+	case 0x41 ... 0x7f:
+	case 0x81 ... 0xbf:
+		dump_fsinfo(params->request, about, &r, ret);
+		return 0;
+
+		/* String */
+	case 0x00:
+	case 0x40:
+	case 0x80:
+		if (ret >= 4096) {
+			ret = 4096;
+			r.buffer[4092] = '.';
+			r.buffer[4093] = '.';
+			r.buffer[4094] = '.';
+			r.buffer[4095] = 0;
+		} else {
+			r.buffer[ret] = 0;
+		}
+		for (p = r.buffer; *p; p++) {
+			if (!isprint(*p)) {
+				printf("<non-printable>\n");
+				continue;
+			}
+		}
+		printf("%s\n", r.buffer);
+		return 0;
+
+	default:
+		fprintf(stderr, "Fishy about %u %02x\n", params->request, about);
+		exit(1);
+	}
+}
+
+/*
+ *
+ */
+int main(int argc, char **argv)
+{
+	struct fsinfo_params params = {
+		.at_flags = AT_SYMLINK_NOFOLLOW,
+	};
+	unsigned int attr;
+	int raw = 0, opt, Nth, Mth;
+
+	while ((opt = getopt(argc, argv, "alr"))) {
+		switch (opt) {
+		case 'a':
+			params.at_flags |= AT_NO_AUTOMOUNT;
+			continue;
+		case 'l':
+			params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
+			continue;
+		case 'r':
+			raw = 1;
+			continue;
+		}
+		break;
+	}
+
+	argc -= optind;
+	argv += optind;
+
+	if (argc != 1) {
+		printf("Format: test-fsinfo [-alr] <file>\n");
+		exit(2);
+	}
+
+	for (attr = 0; attr <= fsinfo_attr__nr; attr++) {
+		Nth = 0;
+		do {
+			Mth = 0;
+			do {
+				params.request = attr;
+				params.Nth = Nth;
+				params.Mth = Mth;
+
+				switch (try_one(argv[0], &params, raw)) {
+				case 0:
+					continue;
+				case 1:
+					goto done_M;
+				case 2:
+					goto done_N;
+				}
+			} while (++Mth < 100);
+
+		done_M:
+			if (Mth >= 100) {
+				fprintf(stderr, "Fishy: Mth == %u\n", Mth);
+				break;
+			}
+
+		} while (++Nth < 100);
+
+	done_N:
+		if (Nth >= 100) {
+			fprintf(stderr, "Fishy: Nth == %u\n", Nth);
+			break;
+		}
+	}
+
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 30/38] vfs: syscall: Add fsmount() to create a mount for a superblock [ver #10]
  2018-07-27 17:34 ` [PATCH 30/38] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
@ 2018-07-27 19:27   ` Andy Lutomirski
  2018-07-27 19:43     ` Andy Lutomirski
  2018-07-27 22:09     ` David Howells
  2018-07-27 22:06   ` David Howells
  1 sibling, 2 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-07-27 19:27 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-api, torvalds, linux-fsdevel, linux-kernel



> On Jul 27, 2018, at 10:34 AM, David Howells <dhowells@redhat.com> wrote:
> 
> Provide a system call by which a filesystem opened with fsopen() and
> configured by a series of writes can be mounted:
> 
>    int ret = fsmount(int fsfd, unsigned int flags,
>              unsigned int ms_flags);
> 
> where fsfd is the file descriptor returned by fsopen().  flags can be 0 or
> FSMOUNT_CLOEXEC.  ms_flags is a bitwise-OR of the following flags:

I have a potentially silly objection. For the old timers, “mount” means to stick a reel of tape or some similar object onto a reader, which seems to imply that “mount” means to start up the filesystem. For younguns, this meaning is probably lost, and the more obvious meaning is to “mount” it into some location in the VFS hierarchy a la vfsmount. The patch description doesn’t disambiguate it, and obviously people used to mount(2)/mount(8) are just likely to be confused.

At the very least, your description should make it absolutely clear what you mean. Even better IMO would be to drop the use of the word “mount” entirely and maybe rename the syscall.

From a very brief reading, I think you are giving it the meaning that would be implied by fsstart(2).

> 
>    MS_RDONLY
>    MS_NOSUID
>    MS_NODEV
>    MS_NOEXEC
>    MS_NOATIME
>    MS_NODIRATIME
>    MS_RELATIME
>    MS_STRICTATIME
> 
>    MS_UNBINDABLE
>    MS_PRIVATE
>    MS_SLAVE
>    MS_SHARED
> 
> In the event that fsmount() fails, it may be possible to get an error
> message by calling read() on fsfd.  If no message is available, ENODATA
> will be reported.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: linux-api@vger.kernel.org
> ---
> 
> arch/x86/entry/syscalls/syscall_32.tbl |    1 
> arch/x86/entry/syscalls/syscall_64.tbl |    1 
> fs/namespace.c                         |  140 +++++++++++++++++++++++++++++++-
> include/linux/syscalls.h               |    1 
> include/uapi/linux/fs.h                |    2 
> 5 files changed, 141 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index f9970310c126..c78b68256f8a 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -402,3 +402,4 @@
> 388    i386    move_mount        sys_move_mount            __ia32_sys_move_mount
> 389    i386    fsopen            sys_fsopen            __ia32_sys_fsopen
> 390    i386    fsconfig        sys_fsconfig            __ia32_sys_fsconfig
> +391    i386    fsmount            sys_fsmount            __ia32_sys_fsmount
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 4185d36e03bb..d44ead5d4368 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -347,6 +347,7 @@
> 336    common    move_mount        __x64_sys_move_mount
> 337    common    fsopen            __x64_sys_fsopen
> 338    common    fsconfig        __x64_sys_fsconfig
> +339    common    fsmount            __x64_sys_fsmount
> 
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/namespace.c b/fs/namespace.c
> index ea07066a2731..b1661b90256d 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2503,7 +2503,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
> 
>    attached = mnt_has_parent(old);
>    /*
> -     * We need to allow open_tree(OPEN_TREE_CLONE) followed by
> +     * We need to allow open_tree(OPEN_TREE_CLONE) or fsmount() followed by
>     * move_mount(), but mustn't allow "/" to be moved.
>     */
>    if (old->mnt_ns && !attached)
> @@ -3348,9 +3348,141 @@ struct vfsmount *kern_mount(struct file_system_type *type)
> EXPORT_SYMBOL_GPL(kern_mount);
> 
> /*
> - * Move a mount from one place to another.
> - * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
> - * used to copy a mount subtree.
> + * Create a kernel mount representation for a new, prepared superblock
> + * (specified by fs_fd) and attach to an open_tree-like file descriptor.
> + */
> +SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags)
> +{
> +    struct fs_context *fc;
> +    struct file *file;
> +    struct path newmount;
> +    struct fd f;
> +    unsigned int mnt_flags = 0;
> +    long ret;
> +
> +    if (!may_mount())
> +        return -EPERM;
> +
> +    if ((flags & ~(FSMOUNT_CLOEXEC)) != 0)
> +        return -EINVAL;
> +
> +    if (ms_flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC |
> +             MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
> +             MS_STRICTATIME))
> +        return -EINVAL;
> +
> +    if (ms_flags & MS_RDONLY)
> +        mnt_flags |= MNT_READONLY;
> +    if (ms_flags & MS_NOSUID)
> +        mnt_flags |= MNT_NOSUID;
> +    if (ms_flags & MS_NODEV)
> +        mnt_flags |= MNT_NODEV;
> +    if (ms_flags & MS_NOEXEC)
> +        mnt_flags |= MNT_NOEXEC;
> +    if (ms_flags & MS_NODIRATIME)
> +        mnt_flags |= MNT_NODIRATIME;
> +
> +    if (ms_flags & MS_STRICTATIME) {
> +        if (ms_flags & MS_NOATIME)
> +            return -EINVAL;
> +    } else if (ms_flags & MS_NOATIME) {
> +        mnt_flags |= MNT_NOATIME;
> +    } else {
> +        mnt_flags |= MNT_RELATIME;
> +    }
> +
> +    f = fdget(fs_fd);
> +    if (!f.file)
> +        return -EBADF;
> +
> +    ret = -EINVAL;
> +    if (f.file->f_op != &fscontext_fops)
> +        goto err_fsfd;
> +
> +    fc = f.file->private_data;
> +
> +    /* There must be a valid superblock or we can't mount it */
> +    ret = -EINVAL;
> +    if (!fc->root)
> +        goto err_fsfd;
> +
> +    ret = -EPERM;
> +    if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
> +        pr_warn("VFS: Mount too revealing\n");
> +        goto err_fsfd;
> +    }
> +
> +    ret = mutex_lock_interruptible(&fc->uapi_mutex);
> +    if (ret < 0)
> +        goto err_fsfd;
> +
> +    ret = -EBUSY;
> +    if (fc->phase != FS_CONTEXT_AWAITING_MOUNT)
> +        goto err_unlock;
> +
> +    ret = -EPERM;
> +    if ((fc->sb_flags & SB_MANDLOCK) && !may_mandlock())
> +        goto err_unlock;
> +
> +    newmount.mnt = vfs_create_mount(fc, mnt_flags);
> +    if (IS_ERR(newmount.mnt)) {
> +        ret = PTR_ERR(newmount.mnt);
> +        goto err_unlock;
> +    }
> +    newmount.dentry = dget(fc->root);
> +
> +    /* We've done the mount bit - now move the file context into more or
> +     * less the same state as if we'd done an fspick().  We don't want to
> +     * do any memory allocation or anything like that at this point as we
> +     * don't want to have to handle any errors incurred.
> +     */
> +    if (fc->ops && fc->ops->free)
> +        fc->ops->free(fc);
> +    fc->fs_private = NULL;
> +    fc->s_fs_info = NULL;
> +    fc->sb_flags = 0;
> +    fc->sloppy = false;
> +    fc->silent = false;
> +    security_fs_context_free(fc);
> +    fc->security = NULL;
> +    kfree(fc->subtype);
> +    fc->subtype = NULL;
> +    kfree(fc->source);
> +    fc->source = NULL;
> +
> +    fc->purpose = FS_CONTEXT_FOR_RECONFIGURE;
> +    fc->phase = FS_CONTEXT_AWAITING_RECONF;
> +
> +    /* Attach to an apparent O_PATH fd with a note that we need to unmount
> +     * it, not just simply put it.
> +     */
> +    file = dentry_open(&newmount, O_PATH, fc->cred);
> +    if (IS_ERR(file)) {
> +        ret = PTR_ERR(file);
> +        goto err_path;
> +    }
> +    file->f_mode |= FMODE_NEED_UNMOUNT;
> +
> +    ret = get_unused_fd_flags((flags & FSMOUNT_CLOEXEC) ? O_CLOEXEC : 0);
> +    if (ret >= 0)
> +        fd_install(ret, file);
> +    else
> +        fput(file);
> +
> +err_path:
> +    path_put(&newmount);
> +err_unlock:
> +    mutex_unlock(&fc->uapi_mutex);
> +err_fsfd:
> +    fdput(f);
> +    return ret;
> +}
> +
> +/*
> + * Move a mount from one place to another.  In combination with
> + * fsopen()/fsmount() this is used to install a new mount and in combination
> + * with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be used to copy
> + * a mount subtree.
>  *
>  * Note the flags value is a combination of MOVE_MOUNT_* flags.
>  */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 9628d14a7ede..65db661cc2da 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -907,6 +907,7 @@ asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
> asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
> asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
>                 const void __user *value, int aux);
> +asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
> 
> /*
>  * Architecture-specific system calls
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 7c9e165e8689..297362908d01 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -349,6 +349,8 @@ typedef int __bitwise __kernel_rwf_t;
>  */
> #define FSOPEN_CLOEXEC        0x00000001
> 
> +#define FSMOUNT_CLOEXEC        0x00000001
> +
> /*
>  * The type of fsconfig() call made.
>  */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
@ 2018-07-27 19:35   ` Andy Lutomirski
  2018-07-27 22:12   ` David Howells
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-07-27 19:35 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Linux API, Linus Torvalds, Linux FS Devel, LKML

On Fri, Jul 27, 2018 at 10:35 AM, David Howells <dhowells@redhat.com> wrote:
> Add a system call to allow filesystem information to be queried.  A request
> value can be given to indicate the desired attribute.  Support is provided
> for enumerating multi-value attributes.

Has anyone seriously reviewed this?  It might make sense to defer this
to a followup patch set.  Also:

> params->request indicates the attribute/attributes to be queried.  This can
> be one of:
>
>         fsinfo_attr_statfs              - statfs-style info
>         fsinfo_attr_fsinfo              - Information about fsinfo()

Constants are almost always all caps.  Is there any reason these are lowercase?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #10]
  2018-07-27 17:34 ` [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
@ 2018-07-27 19:42   ` Andy Lutomirski
  2018-07-27 21:51   ` David Howells
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-07-27 19:42 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Linux API, Linus Torvalds, Linux FS Devel, LKML

On Fri, Jul 27, 2018 at 10:34 AM, David Howells <dhowells@redhat.com> wrote:
>  (*) fsconfig_set_path: A non-empty path is specified.  The parameter must
>      be expecting a path object.  value points to a NUL-terminated string
>      that is the path and aux is a file descriptor at which to start a
>      relative lookup or AT_FDCWD.
>
>  (*) fsconfig_set_path_empty: As fsconfig_set_path, but with AT_EMPTY_PATH
>      implied.
>
>  (*) fsconfig_set_fd: An open file descriptor is specified.  value must
>      be NULL and aux indicates the file descriptor.

Unless I'm rather confused, you have two or possibly three ways to
pass in an open fd.  Can you clarify what the difference is and/or
remove all but one of them?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 30/38] vfs: syscall: Add fsmount() to create a mount for a superblock [ver #10]
  2018-07-27 19:27   ` Andy Lutomirski
@ 2018-07-27 19:43     ` Andy Lutomirski
  2018-07-27 22:09     ` David Howells
  1 sibling, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-07-27 19:43 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Linux API, Linus Torvalds, Linux FS Devel, LKML

On Fri, Jul 27, 2018 at 12:27 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
>
>> On Jul 27, 2018, at 10:34 AM, David Howells <dhowells@redhat.com> wrote:
>>
>> Provide a system call by which a filesystem opened with fsopen() and
>> configured by a series of writes can be mounted:
>>
>>    int ret = fsmount(int fsfd, unsigned int flags,
>>              unsigned int ms_flags);
>>
>> where fsfd is the file descriptor returned by fsopen().  flags can be 0 or
>> FSMOUNT_CLOEXEC.  ms_flags is a bitwise-OR of the following flags:
>
> I have a potentially silly objection. For the old timers, “mount” means to stick a reel of tape or some similar object onto a reader, which seems to imply that “mount” means to start up the filesystem. For younguns, this meaning is probably lost, and the more obvious meaning is to “mount” it into some location in the VFS hierarchy a la vfsmount. The patch description doesn’t disambiguate it, and obviously people used to mount(2)/mount(8) are just likely to be confused.
>
> At the very least, your description should make it absolutely clear what you mean. Even better IMO would be to drop the use of the word “mount” entirely and maybe rename the syscall.
>
> From a very brief reading, I think you are giving it the meaning that would be implied by fsstart(2).
>

After further reading, maybe what you actually mean is:

int mfd = fsmount(...);

where you pass in an fscontext fd and get out an fd referring to the
root of the filesystem?  In this case, maybe fs_open_root(2) would be
a better name.

This *definitely* needs to be clearer in the description.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #10]
  2018-07-27 17:34 ` [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
  2018-07-27 19:42   ` Andy Lutomirski
@ 2018-07-27 21:51   ` David Howells
  2018-07-27 21:57     ` Andy Lutomirski
  2018-07-27 22:27     ` David Howells
  2018-07-27 22:32   ` Jann Horn
  2018-07-29  8:50   ` David Howells
  3 siblings, 2 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 21:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Al Viro, Linux API, Linus Torvalds, Linux FS Devel,
	LKML

Andy Lutomirski <luto@amacapital.net> wrote:

> Unless I'm rather confused, you have two or possibly three ways to
> pass in an open fd.  Can you clarify what the difference is and/or
> remove all but one of them?

No, they're not equivalent.

> >  (*) fsconfig_set_path: A non-empty path is specified.  The parameter must
> >      be expecting a path object.  value points to a NUL-terminated string
> >      that is the path and aux is a file descriptor at which to start a
> >      relative lookup or AT_FDCWD.

So, an example:

	fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);

I don't want to require that the caller open /dev/sda1 and pass in an fd as
that might prevent the filesystem from "holding" it exclusively.

> >  (*) fsconfig_set_path_empty: As fsconfig_set_path, but with AT_EMPTY_PATH
> >      implied.

You can't do:

	fsconfig(fd, fsconfig_set_path, "source", "", dir_fd);

because AT_EMPTY_PATH cannot be specified directly[*].  What you do instead is:

	fsconfig(fd, fsconfig_set_path_empty, "source", "", dir_fd);

[*] Not without a 6-arg syscall or some other way of passing it.

I *could* require that the caller must call open(O_PATH) or openat(O_PATH)
before calling fsconfig() - so you don't pass a string, but only a path-fd.

> >  (*) fsconfig_set_fd: An open file descriptor is specified.  value must
> >      be NULL and aux indicates the file descriptor.

See fd=%u on fuse.  I think it's cleaner to do:

	fsconfig(fd, fsconfig_set_fd, "source", NULL, control_fd);

saying explicitly that there's an open file to be passed rather than:

	fsconfig(fd, fsconfig_set_path, "source", NULL, control_fd);

which indicates that you are actually providing a path.

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #10]
  2018-07-27 21:51   ` David Howells
@ 2018-07-27 21:57     ` Andy Lutomirski
  2018-07-27 22:27     ` David Howells
  1 sibling, 0 replies; 34+ messages in thread
From: Andy Lutomirski @ 2018-07-27 21:57 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Linux API, Linus Torvalds, Linux FS Devel, LKML

On Fri, Jul 27, 2018 at 2:51 PM, David Howells <dhowells@redhat.com> wrote:
> Andy Lutomirski <luto@amacapital.net> wrote:
>
>> Unless I'm rather confused, you have two or possibly three ways to
>> pass in an open fd.  Can you clarify what the difference is and/or
>> remove all but one of them?
>
> No, they're not equivalent.
>
>> >  (*) fsconfig_set_path: A non-empty path is specified.  The parameter must
>> >      be expecting a path object.  value points to a NUL-terminated string
>> >      that is the path and aux is a file descriptor at which to start a
>> >      relative lookup or AT_FDCWD.
>
> So, an example:
>
>         fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
>
> I don't want to require that the caller open /dev/sda1 and pass in an fd as
> that might prevent the filesystem from "holding" it exclusively.
>
>> >  (*) fsconfig_set_path_empty: As fsconfig_set_path, but with AT_EMPTY_PATH
>> >      implied.
>
> You can't do:
>
>         fsconfig(fd, fsconfig_set_path, "source", "", dir_fd);
>
> because AT_EMPTY_PATH cannot be specified directly[*].  What you do instead is:
>
>         fsconfig(fd, fsconfig_set_path_empty, "source", "", dir_fd);
>
> [*] Not without a 6-arg syscall or some other way of passing it.

Are there still architectures that have problems with 6-arg syscalls?

>
> I *could* require that the caller must call open(O_PATH) or openat(O_PATH)
> before calling fsconfig() - so you don't pass a string, but only a path-fd.
>
>> >  (*) fsconfig_set_fd: An open file descriptor is specified.  value must
>> >      be NULL and aux indicates the file descriptor.
>
> See fd=%u on fuse.  I think it's cleaner to do:
>
>         fsconfig(fd, fsconfig_set_fd, "source", NULL, control_fd);
>
> saying explicitly that there's an open file to be passed rather than:
>
>         fsconfig(fd, fsconfig_set_path, "source", NULL, control_fd);

Hmm.  That should probably be clearly documented.  I suppose that, as
long as there is never a case where fsconfig_set_path and
fsconfig_set_fd both succeed, then it's not a big deal.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 30/38] vfs: syscall: Add fsmount() to create a mount for a superblock [ver #10]
  2018-07-27 17:34 ` [PATCH 30/38] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
  2018-07-27 19:27   ` Andy Lutomirski
@ 2018-07-27 22:06   ` David Howells
  1 sibling, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 22:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, viro, linux-api, torvalds, linux-fsdevel, linux-kernel

Andy Lutomirski <luto@amacapital.net> wrote:

> I have a potentially silly objection. For the old timers, "mount" means to
> stick a reel of tape or some similar object onto a reader, which seems to
> imply that "mount" means to start up the filesystem. For younguns, this
> meaning is probably lost, and the more obvious meaning is to "mount" it into
> some location in the VFS hierarchy a la vfsmount. The patch description
> doesn't disambiguate it, and obviously people used to mount(2)/mount(8) are
> just likely to be confused.

The problem is that inside the kernel it *is* a "mount".

How about I change the first paragraph to:

	Provide a system call by which a filesystem opened with fsopen() and
	configured by a series of fsconfig() calls can have a detached mount
	object created for it.  This mount object can then be attached to the
	VFS mount hierarchy using move_mount() by passing the returned file
	descriptor as the from directory fd.

> At the very least, your description should make it absolutely clear what you
> mean. Even better IMO would be to drop the use of the word "mount" entirely

I'm not sure that's a reasonable idea, given the "mounting" is how this is
done.

Can you suggest a word that encapsulates what it is that fsmount() returns?
It's almost, but not quite identical with what open(O_PATH) returns, since it
has to be torn down if not actually mounted somewhere when the fd is closed.

> and maybe rename the syscall.
> 
> From a very brief reading, I think you are giving it the meaning that would
> be implied by fsstart(2).

Do you have a reference for the manpage for that?  Google doesn't seem to find
it.

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 30/38] vfs: syscall: Add fsmount() to create a mount for a superblock [ver #10]
  2018-07-27 19:27   ` Andy Lutomirski
  2018-07-27 19:43     ` Andy Lutomirski
@ 2018-07-27 22:09     ` David Howells
  1 sibling, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 22:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Al Viro, Linux API, Linus Torvalds, Linux FS Devel,
	LKML

Andy Lutomirski <luto@amacapital.net> wrote:

> int mfd = fsmount(...);
> 
> where you pass in an fscontext fd and get out an fd referring to the
> root of the filesystem?  In this case, maybe fs_open_root(2) would be
> a better name.

It's not necessarily the root of the filesystem in the sense of sb->s_root.
It might be a subset of that, or it might be a part of a filesystem that might
have multiple roots because it doesn't know where the real root is (NFS2, for
example).

> This *definitely* needs to be clearer in the description.

I'm open to suggestions of better wording.  It's a bit hard to explain
because, as you pointed out, the terminology is overloaded.

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
  2018-07-27 19:35   ` Andy Lutomirski
@ 2018-07-27 22:12   ` David Howells
  2018-07-27 23:14   ` Jann Horn
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 22:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Al Viro, Linux API, Linus Torvalds, Linux FS Devel,
	LKML

Andy Lutomirski <luto@amacapital.net> wrote:

> > Add a system call to allow filesystem information to be queried.  A request
> > value can be given to indicate the desired attribute.  Support is provided
> > for enumerating multi-value attributes.
> 
> Has anyone seriously reviewed this?

I don't know.  I've certainly posted it before.

> > params->request indicates the attribute/attributes to be queried.  This can
> > be one of:
> >
> >         fsinfo_attr_statfs              - statfs-style info
> >         fsinfo_attr_fsinfo              - Information about fsinfo()
> 
> Constants are almost always all caps.  Is there any reason these are
> lowercase?

It looks better IMO, particularly for enum constants.  I'm not sure if there
are any rules about this in system vs user definitions.

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #10]
  2018-07-27 21:51   ` David Howells
  2018-07-27 21:57     ` Andy Lutomirski
@ 2018-07-27 22:27     ` David Howells
  1 sibling, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 22:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Al Viro, Linux API, Linus Torvalds, Linux FS Devel,
	LKML

Andy Lutomirski <luto@amacapital.net> wrote:

> > [*] Not without a 6-arg syscall or some other way of passing it.
> 
> Are there still architectures that have problems with 6-arg syscalls?

As I understand it, 6-arg syscalls are frowned upon.

> I suppose that, as long as there is never a case where fsconfig_set_path and
> fsconfig_set_fd both succeed, then it's not a big deal.

fsconfig_set_path/path_empty requires the 'value' argument to point to a
string, possibly "", and fsconfig_set_fd requires it to be NULL.

I can't stop you from doing:

	fd = open("/some/path", O_PATH);
	fsconfig(fsfd, fsconfig_set_fd, "fd", NULL, fd);

or:

	fd = open("/dev/sda6", O_RDWR);
	fsconfig(fsfd, fsconfig_set_path_empty, "foo", "", fd);

The first should fail because I'm using fget() not fget_raw() and the
second will pass the string and fd number to the filesystem, which will
presumably then call fs_lookup_param() to invoke pathwalk upon it - which will
likely also fail.

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #10]
  2018-07-27 17:34 ` [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
  2018-07-27 19:42   ` Andy Lutomirski
  2018-07-27 21:51   ` David Howells
@ 2018-07-27 22:32   ` Jann Horn
  2018-07-29  8:50   ` David Howells
  3 siblings, 0 replies; 34+ messages in thread
From: Jann Horn @ 2018-07-27 22:32 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Linux API, Linus Torvalds, linux-fsdevel, kernel list

On Fri, Jul 27, 2018 at 7:34 PM David Howells <dhowells@redhat.com> wrote:
>
> Add a syscall for configuring a filesystem creation context and triggering
> actions upon it, to be used in conjunction with fsopen, fspick and fsmount.
>
>     long fsconfig(int fs_fd, unsigned int cmd, const char *key,
>                   const void *value, int aux);
>
> Where fs_fd indicates the context, cmd indicates the action to take, key
> indicates the parameter name for parameter-setting actions and, if needed,
> value points to a buffer containing the value and aux can give more
> information for the value.
[...]
> +SYSCALL_DEFINE5(fsconfig,
> +               int, fd,
> +               unsigned int, cmd,
> +               const char __user *, _key,
> +               const void __user *, _value,
> +               int, aux)
> +{
[...]
> +       switch (cmd) {
[...]
> +       case fsconfig_set_binary:
> +               if (!_key || !_value || aux <= 0 || aux > 1024 * 1024)
> +                       return -EINVAL;
> +               break;
[...]
> +       }
> +
> +       f = fdget(fd);
> +       if (!f.file)
> +               return -EBADF;
> +       ret = -EINVAL;
> +       if (f.file->f_op != &fscontext_fops)
> +               goto out_f;

We should probably add an fdget_typed(fd, fops) helper, or something
like that, to file.h at some point... there are probably dozens of
such invocations across the kernel at this point, each one with a
couple lines of boilerplate to deal with the two separate error paths.

[...]
> +       case fsconfig_set_binary:
> +               param.type = fs_value_is_blob;
> +               param.size = aux;
> +               param.blob = memdup_user_nul(_value, aux);
> +               if (IS_ERR(param.blob)) {
> +                       ret = PTR_ERR(param.blob);
> +                       goto out_key;
> +               }
> +               break;

This means that a namespace admin (iow, an unprivileged user) can
allocate 1MB of unswappable kmalloc memory per userspace task, right?
Using userfaultfd or FUSE, you can then stall the task as long as you
want while it has that allocation. Is that problematic, or is that
normal?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
  2018-07-27 19:35   ` Andy Lutomirski
  2018-07-27 22:12   ` David Howells
@ 2018-07-27 23:14   ` Jann Horn
  2018-07-27 23:49   ` David Howells
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Jann Horn @ 2018-07-27 23:14 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Linux API, Linus Torvalds, linux-fsdevel, kernel list

On Fri, Jul 27, 2018 at 7:36 PM David Howells <dhowells@redhat.com> wrote:
>
> Add a system call to allow filesystem information to be queried.  A request
> value can be given to indicate the desired attribute.  Support is provided
> for enumerating multi-value attributes.
[...]
> +static int fsinfo_generic_ids(struct dentry *dentry,
> +                             struct fsinfo_ids *p)
> +{
[...]
> +       strcpy(p->f_fs_name, dentry->d_sb->s_type->name);

Can you use strlcpy() instead? From a quick look, I don't see anything
that actually limits the size of filesystem names, even though
everything in-kernel probably fits into the 16 bytes you've allocated
for the name.

[...]
> +static int fsinfo_generic_name_encoding(struct dentry *dentry, char *buf)
> +{
> +       static const char encoding[] = "utf8";
> +
> +       if (buf)
> +               memcpy(buf, encoding, sizeof(encoding) - 1);
> +       return sizeof(encoding) - 1;
> +}

Is this meant to be "encoding to be used by userspace" or "encoding of
on-disk filenames"? If the former: That's always utf8, right? Are
there any plans to create filesystems that behave differently? If the
latter: This is wrong for e.g. a vfat mount that uses a codepage,
right? Should the default in that case not be "I don't know"?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
                     ` (2 preceding siblings ...)
  2018-07-27 23:14   ` Jann Horn
@ 2018-07-27 23:49   ` David Howells
  2018-07-28  0:14     ` Anton Altaparmakov
  2018-07-27 23:51   ` David Howells
                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: David Howells @ 2018-07-27 23:49 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, Linux API, Linus Torvalds, linux-fsdevel,
	kernel list

Jann Horn <jannh@google.com> wrote:

> > +static int fsinfo_generic_name_encoding(struct dentry *dentry, char *buf)
> > +{
> > +       static const char encoding[] = "utf8";
> > +
> > +       if (buf)
> > +               memcpy(buf, encoding, sizeof(encoding) - 1);
> > +       return sizeof(encoding) - 1;
> > +}
> 
> Is this meant to be "encoding to be used by userspace" or "encoding of
> on-disk filenames"?

The latter.

> Are there any plans to create filesystems that behave differently?

isofs, fat, ntfs, cifs for example.

> If the latter: This is wrong for e.g. a vfat mount that uses a codepage,
> right?  Should the default in that case not be "I don't know"?

Quite possibly.  Note that it could also be what you're interpreting it as
because the codepage got overridden by a mount parameter rather than what's on
the disk (assuming the medium actually records this).

One thing I'm confused about is that fat has both a codepage and a charset and
I'm not sure of the difference.

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
                     ` (3 preceding siblings ...)
  2018-07-27 23:49   ` David Howells
@ 2018-07-27 23:51   ` David Howells
  2018-07-27 23:58     ` Jann Horn
  2018-07-28  0:08     ` David Howells
  2018-07-30 14:48   ` David Howells
                     ` (5 subsequent siblings)
  10 siblings, 2 replies; 34+ messages in thread
From: David Howells @ 2018-07-27 23:51 UTC (permalink / raw)
  Cc: dhowells, Jann Horn, Al Viro, Linux API, Linus Torvalds,
	linux-fsdevel, kernel list

David Howells <dhowells@redhat.com> wrote:

> One thing I'm confused about is that fat has both a codepage and a charset and
> I'm not sure of the difference.

In fact, it's not clear that the codepage is actually used.

	warthog>git grep '[.>]codepage'
	fs/fat/inode.c: opts->codepage = fat_default_codepage;
	fs/fat/inode.c:                 opts->codepage = option;
	fs/fat/inode.c: sprintf(buf, "cp%d", sbi->options.codepage);

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 23:51   ` David Howells
@ 2018-07-27 23:58     ` Jann Horn
  2018-07-28  0:08     ` David Howells
  1 sibling, 0 replies; 34+ messages in thread
From: Jann Horn @ 2018-07-27 23:58 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Linux API, Linus Torvalds, linux-fsdevel, kernel list

On Sat, Jul 28, 2018 at 1:51 AM David Howells <dhowells@redhat.com> wrote:
> David Howells <dhowells@redhat.com> wrote:
>
> > One thing I'm confused about is that fat has both a codepage and a charset and
> > I'm not sure of the difference.
>
> In fact, it's not clear that the codepage is actually used.
>
>         warthog>git grep '[.>]codepage'
>         fs/fat/inode.c: opts->codepage = fat_default_codepage;
>         fs/fat/inode.c:                 opts->codepage = option;
>         fs/fat/inode.c: sprintf(buf, "cp%d", sbi->options.codepage);

        sprintf(buf, "cp%d", sbi->options.codepage);
        sbi->nls_disk = load_nls(buf);
        if (!sbi->nls_disk) {
                fat_msg(sb, KERN_ERR, "codepage %s not found", buf);
                goto out_fail;
        }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 23:51   ` David Howells
  2018-07-27 23:58     ` Jann Horn
@ 2018-07-28  0:08     ` David Howells
  1 sibling, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-28  0:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, Linux API, Linus Torvalds, linux-fsdevel,
	kernel list

Jann Horn <jannh@google.com> wrote:

> >         fs/fat/inode.c: sprintf(buf, "cp%d", sbi->options.codepage);
> 
>         sprintf(buf, "cp%d", sbi->options.codepage);
>         sbi->nls_disk = load_nls(buf);
>         if (!sbi->nls_disk) {
>                 fat_msg(sb, KERN_ERR, "codepage %s not found", buf);
>                 goto out_fail;
>         }

Sorry, yes.  I was reading the print as the part of the display of superblock
parameters.

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 23:49   ` David Howells
@ 2018-07-28  0:14     ` Anton Altaparmakov
  0 siblings, 0 replies; 34+ messages in thread
From: Anton Altaparmakov @ 2018-07-28  0:14 UTC (permalink / raw)
  To: David Howells
  Cc: Jann Horn, Al Viro, Linux API, Linus Torvalds,
	linux-fsdevel@vger.kernel.org, kernel list

Hi David,

> On 28 Jul 2018, at 00:49, David Howells <dhowells@redhat.com> wrote:
> Jann Horn <jannh@google.com> wrote:
>>> +static int fsinfo_generic_name_encoding(struct dentry *dentry, char *buf)
>>> +{
>>> +       static const char encoding[] = "utf8";
>>> +
>>> +       if (buf)
>>> +               memcpy(buf, encoding, sizeof(encoding) - 1);
>>> +       return sizeof(encoding) - 1;
>>> +}
>> 
>> Is this meant to be "encoding to be used by userspace" or "encoding of
>> on-disk filenames"?
> 
> The latter.
> 
>> Are there any plans to create filesystems that behave differently?
> 
> isofs, fat, ntfs, cifs for example.
> 
>> If the latter: This is wrong for e.g. a vfat mount that uses a codepage,
>> right?  Should the default in that case not be "I don't know"?
> 
> Quite possibly.  Note that it could also be what you're interpreting it as
> because the codepage got overridden by a mount parameter rather than what's on
> the disk (assuming the medium actually records this).

No, nothing like that is recorded on disk.  That would have been way too helpful!  (-;  The only place Windows records such information is, you may have guessed this: in the registry which of course is local to the computer and unrelated to what removable media is attached...

> One thing I'm confused about is that fat has both a codepage and a charset and
> I'm not sure of the difference.

Oh that is quite simple.  (-:

The codepage is what is used to translate from/to the on-disk DOS 8.3 style names into the kernel's Unicode character representation.  The correct codepage for a particular volume is not stored on disk so it can lead to all sorts of fun if you for example create some names on for example a Japanese Windows on a FAT formatted USB stick and then plug that into a US or European Windows where the default code pages are completely different - all your filenames will appear totally corrupt.  (Note this ONLY affects 8.3 style/DOS/short names or whatever you want to call them.)

The charset on the other hand is what is used to convert strings coming in from/going out to userspace into the kernel's Unicode character representation.

The one nice thing about VFAT (and there aren't many nice things about it!) is that for long names (i.e. not the 8.3 style/DOS/short names), it actually stores on-disk little-endian UTF-16 (since Windows 2000, before that it used little endian UCS-2 - the change was needed to support things like Emojis and some languages that go outside the UCS-2 range of fixed 16-bit unicode).

Hope this clears that up.

Best regards,

	Anton

> David

-- 
Anton Altaparmakov <anton at tuxera.com> (replace at with @)
Lead in File System Development, Tuxera Inc., http://www.tuxera.com/
Linux NTFS maintainer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #10]
  2018-07-27 17:34 ` [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
                     ` (2 preceding siblings ...)
  2018-07-27 22:32   ` Jann Horn
@ 2018-07-29  8:50   ` David Howells
  2018-07-29 11:14     ` Jann Horn
  2018-07-30 12:32     ` David Howells
  3 siblings, 2 replies; 34+ messages in thread
From: David Howells @ 2018-07-29  8:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, Linux API, Linus Torvalds, linux-fsdevel,
	kernel list

Jann Horn <jannh@google.com> wrote:

> [...]
> > +       case fsconfig_set_binary:
> > +               param.type = fs_value_is_blob;
> > +               param.size = aux;
> > +               param.blob = memdup_user_nul(_value, aux);
> > +               if (IS_ERR(param.blob)) {
> > +                       ret = PTR_ERR(param.blob);
> > +                       goto out_key;
> > +               }
> > +               break;
> 
> This means that a namespace admin (iow, an unprivileged user) can
> allocate 1MB of unswappable kmalloc memory per userspace task, right?
> Using userfaultfd or FUSE, you can then stall the task as long as you
> want while it has that allocation. Is that problematic, or is that
> normal?

That's not exactly the case.  A userspace task can make a temporary
allocation, but unless the filesystem grabs it, it's released again on exit
from the system call.

Note that I should probably use vmalloc() rather than kmalloc(), but that
doesn't really affect your point.  I could also pass the user pointer through
to the filesystem instead - I wanted to avoid that for this interface, but it
make sense in this instance.

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #10]
  2018-07-29  8:50   ` David Howells
@ 2018-07-29 11:14     ` Jann Horn
  2018-07-30 12:32     ` David Howells
  1 sibling, 0 replies; 34+ messages in thread
From: Jann Horn @ 2018-07-29 11:14 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Linux API, Linus Torvalds, linux-fsdevel, kernel list

On Sun, Jul 29, 2018 at 10:50 AM David Howells <dhowells@redhat.com> wrote:
>
> Jann Horn <jannh@google.com> wrote:
>
> > [...]
> > > +       case fsconfig_set_binary:
> > > +               param.type = fs_value_is_blob;
> > > +               param.size = aux;
> > > +               param.blob = memdup_user_nul(_value, aux);
> > > +               if (IS_ERR(param.blob)) {
> > > +                       ret = PTR_ERR(param.blob);
> > > +                       goto out_key;
> > > +               }
> > > +               break;
> >
> > This means that a namespace admin (iow, an unprivileged user) can
> > allocate 1MB of unswappable kmalloc memory per userspace task, right?
> > Using userfaultfd or FUSE, you can then stall the task as long as you
> > want while it has that allocation. Is that problematic, or is that
> > normal?
>
> That's not exactly the case.  A userspace task can make a temporary
> allocation, but unless the filesystem grabs it, it's released again on exit
> from the system call.

That's what I said. Each userspace task can make a 1MB allocation by
calling this syscall, and this temporary allocation stays allocated
until the end of the syscall. But the runtime of the syscall is
unbounded - even just the memdup_user_nul() can stall forever if the
copy_from_user() call inside it faults on e.g. a userfault region or a
memory-mapped file from a FUSE filesystem.

> Note that I should probably use vmalloc() rather than kmalloc(), but that
> doesn't really affect your point.  I could also pass the user pointer through
> to the filesystem instead - I wanted to avoid that for this interface, but it
> make sense in this instance.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #10]
  2018-07-29  8:50   ` David Howells
  2018-07-29 11:14     ` Jann Horn
@ 2018-07-30 12:32     ` David Howells
  1 sibling, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-30 12:32 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, Linux API, Linus Torvalds, linux-fsdevel,
	kernel list

Jann Horn <jannh@google.com> wrote:

> > > This means that a namespace admin (iow, an unprivileged user) can
> > > allocate 1MB of unswappable kmalloc memory per userspace task, right?
> > > Using userfaultfd or FUSE, you can then stall the task as long as you
> > > want while it has that allocation. Is that problematic, or is that
> > > normal?
> >
> > That's not exactly the case.  A userspace task can make a temporary
> > allocation, but unless the filesystem grabs it, it's released again on exit
> > from the system call.
> 
> That's what I said.

Sorry, I wasn't clear what you meant.  I assumed you were thinking it was then
automatically attached to the context, say:

	fd = fsopen("fuse", 0);
	fsconfig(fd, fsconfig_set_binary, "foo", buffer, size);

> Each userspace task can make a 1MB allocation by calling this syscall, and
> this temporary allocation stays allocated until the end of the syscall. But
> the runtime of the syscall is unbounded - even just the memdup_user_nul()
> can stall forever if the copy_from_user() call inside it faults on e.g. a
> userfault region or a memory-mapped file from a FUSE filesystem.

Okay, I see what you're getting at.  Note that this affects other syscalls
too, keyctl, module loading and read() with readahead for example.  Not sure
what the answer should be.

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
                     ` (4 preceding siblings ...)
  2018-07-27 23:51   ` David Howells
@ 2018-07-30 14:48   ` David Howells
  2018-07-31  4:16   ` Al Viro
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-30 14:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Al Viro, Linux API, Linus Torvalds, Linux FS Devel,
	LKML

Andy Lutomirski <luto@amacapital.net> wrote:

> Constants are almost always all caps.  Is there any reason these are
> lowercase?

It's also easier to create macros that pair structs or functions with the
contants by name.

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
                     ` (5 preceding siblings ...)
  2018-07-30 14:48   ` David Howells
@ 2018-07-31  4:16   ` Al Viro
  2018-07-31 12:39   ` David Howells
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Al Viro @ 2018-07-31  4:16 UTC (permalink / raw)
  To: David Howells; +Cc: linux-api, torvalds, linux-fsdevel, linux-kernel

On Fri, Jul 27, 2018 at 06:35:10PM +0100, David Howells wrote:
> params->request indicates the attribute/attributes to be queried.  This can
> be one of:
> 
> 	fsinfo_attr_statfs		- statfs-style info
> 	fsinfo_attr_fsinfo		- Information about fsinfo()
> 	fsinfo_attr_ids			- Filesystem IDs
> 	fsinfo_attr_limits		- Filesystem limits
> 	fsinfo_attr_supports		- What's supported in statx(), IOC flags
> 	fsinfo_attr_capabilities	- Filesystem capabilities
> 	fsinfo_attr_timestamp_info	- Inode timestamp info
> 	fsinfo_attr_volume_id		- Volume ID (string)
> 	fsinfo_attr_volume_uuid		- Volume UUID
> 	fsinfo_attr_volume_name		- Volume name (string)
> 	fsinfo_attr_cell_name		- Cell name (string)
> 	fsinfo_attr_domain_name		- Domain name (string)
> 	fsinfo_attr_realm_name		- Realm name (string)
> 	fsinfo_attr_server_name		- Name of the Nth server (string)
> 	fsinfo_attr_server_address	- Mth address of the Nth server
> 	fsinfo_attr_parameter		- Nth mount parameter (string)
> 	fsinfo_attr_source		- Nth mount source name (string)
> 	fsinfo_attr_name_encoding	- Filename encoding (string)
> 	fsinfo_attr_name_codepage	- Filename codepage (string)
> 	fsinfo_attr_io_size		- I/O size hints

Umm...  What's so special about cell/volume/domain/realm?  And
what do we do when a random filesystem gets added - should its
parameters go into catch-all pile (attr_parameter), or should they
get classes of their own?

For Cthulhu sake, who's going to maintain that enum in face of
random out-of-tree filesystems, each wanting a class or two its own?
We'd tried that with device numbers; ask hpa how well has that
worked and how much did he love the whole experience...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
                     ` (6 preceding siblings ...)
  2018-07-31  4:16   ` Al Viro
@ 2018-07-31 12:39   ` David Howells
  2018-07-31 13:20   ` David Howells
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-31 12:39 UTC (permalink / raw)
  To: Al Viro; +Cc: dhowells, linux-api, torvalds, linux-fsdevel, linux-kernel

Al Viro <viro@ZenIV.linux.org.uk> wrote:

> Umm...  What's so special about cell/volume/domain/realm?

Nothing particularly.  But they're something various network filesystems might
find useful.  cell for AFS, domain for CIFS, realm for things that use
kerberos.

volume_id/uuid/name would be usable by ext4 too, for example.

> And what do we do when a random filesystem gets added - should its
> parameters go into catch-all pile (attr_parameter),

FSINFO_ATTR_PARAMETER is a way to enumerate the configuration parameters
passed to mount, as an alternative to parsing /proc/mounts.  So, for example,
afs has:

	enum afs_param {
		Opt_autocell,
		Opt_dyn,
		Opt_source,
		nr__afs_params
	};

	static const struct fs_parameter_spec afs_param_specs[nr__afs_params] = {
		[Opt_autocell]	= { fs_param_takes_no_value },
		[Opt_dyn]	= { fs_param_takes_no_value },
		[Opt_source]	= { fs_param_is_string },
	};

	static const struct constant_table afs_param_keys[] = {
		{ "autocell",	Opt_autocell },
		{ "dyn",	Opt_dyn },
		{ "source",	Opt_source },
	};

My thought is that calling fsinfo(..., "/some/afs/file", &params, ...) with:

	struct fsinfo_params params = {
		.request = FSINFO_ATTR_PARAMETER,
		.Nth	 = <parameter-number>,
	};

would get you back, for example:

	Nth	Result
	=======	==========================================
	0	"autocell" (or "" if not set)
	1	"dyn" (or "" if not set)
	2	"source=%#grand.central.org:root.cell."
	3+	-ENODATA (ie. there are no more)

where Nth corresponds to the parameter specified by
FSINFO_ATTR_PARAM_DESCRIPTION and Nth.

Now for some filesystems, cgroups-v1 for example, there are parameters beyond
the list (the subsystem name) and these can be listed after the predefined
parameters, eg.:

	Nth	Result
	=======	==========================================
	0	"all" or ""
	1	"clone_children" or ""
	2	"cpuset_v2_mode" or ""
	3	"name" or ""
	4	"none" or ""
	5	"noprefix" or ""
	6	"release_agent" or ""
	7	"xattr" or ""
	8	"<subsys0>" or ""
	9	"<subsys1>" or ""
	10	"<subsys2>" or ""
	...	-ENODATA

> or should they get classes of their own?

Yes.

> For Cthulhu sake, who's going to maintain that enum in face of
> random out-of-tree filesystems, each wanting a class or two its own?

They don't get their own numbers unless they're in-tree.  Full stop.  We have
the same issue with system calls and not-yet-upstream new syscalls.

Note that, as I have the code now, the "type" of return value for each
attribute must also be declared to the fsinfo() core, and the fsinfo core does
the copy to/from userspace.

> We'd tried that with device numbers; ask hpa how well has that
> worked and how much did he love the whole experience...

What would you do instead?  I would prefer to avoid using text strings as keys
because then I need a big lookup table, and possibly this gets devolved to
each filesystem to handle - which ends up even more of a mess because then
there's nothing to hold consistency.

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
                     ` (7 preceding siblings ...)
  2018-07-31 12:39   ` David Howells
@ 2018-07-31 13:20   ` David Howells
  2018-07-31 23:49   ` Darrick J. Wong
  2018-08-01  1:07   ` David Howells
  10 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2018-07-31 13:20 UTC (permalink / raw)
  To: Jann Horn
  Cc: dhowells, Al Viro, Linux API, Linus Torvalds, linux-fsdevel,
	kernel list

Jann Horn <jannh@google.com> wrote:

> > +       strcpy(p->f_fs_name, dentry->d_sb->s_type->name);
> 
> Can you use strlcpy() instead? From a quick look, I don't see anything
> that actually limits the size of filesystem names, even though
> everything in-kernel probably fits into the 16 bytes you've allocated
> for the name.

Sure.  Should I increase the field size to 32, I wonder?

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
                     ` (8 preceding siblings ...)
  2018-07-31 13:20   ` David Howells
@ 2018-07-31 23:49   ` Darrick J. Wong
  2018-08-01  1:07   ` David Howells
  10 siblings, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2018-07-31 23:49 UTC (permalink / raw)
  To: David Howells; +Cc: viro, linux-api, torvalds, linux-fsdevel, linux-kernel

On Fri, Jul 27, 2018 at 06:35:10PM +0100, David Howells wrote:
> Add a system call to allow filesystem information to be queried.  A request
> value can be given to indicate the desired attribute.  Support is provided
> for enumerating multi-value attributes.
> 
> ===============
> NEW SYSTEM CALL
> ===============
> 
> The new system call looks like:
> 
> 	int ret = fsinfo(int dfd,
> 			 const char *filename,
> 			 const struct fsinfo_params *params,
> 			 void *buffer,
> 			 size_t buf_size);
> 
> The params parameter optionally points to a block of parameters:
> 
> 	struct fsinfo_params {
> 		__u32	at_flags;
> 		__u32	request;
> 		__u32	Nth;
> 		__u32	Mth;
> 		__u32	__reserved[6];
> 	};
> 
> If params is NULL, it is assumed params->request should be
> fsinfo_attr_statfs, params->Nth should be 0, params->Mth should be 0 and
> params->at_flags should be 0.
> 
> If params is given, all of params->__reserved[] must be 0.
> 
> dfd, filename and params->at_flags indicate the file to query.  There is no
> equivalent of lstat() as that can be emulated with fsinfo() by setting
> AT_SYMLINK_NOFOLLOW in params->at_flags.  There is also no equivalent of
> fstat() as that can be emulated by passing a NULL filename to fsinfo() with
> the fd of interest in dfd.  AT_NO_AUTOMOUNT can also be used to an allow
> automount point to be queried without triggering it.
> 
> params->request indicates the attribute/attributes to be queried.  This can
> be one of:
> 
> 	fsinfo_attr_statfs		- statfs-style info
> 	fsinfo_attr_fsinfo		- Information about fsinfo()
> 	fsinfo_attr_ids			- Filesystem IDs
> 	fsinfo_attr_limits		- Filesystem limits
> 	fsinfo_attr_supports		- What's supported in statx(), IOC flags
> 	fsinfo_attr_capabilities	- Filesystem capabilities
> 	fsinfo_attr_timestamp_info	- Inode timestamp info
> 	fsinfo_attr_volume_id		- Volume ID (string)
> 	fsinfo_attr_volume_uuid		- Volume UUID
> 	fsinfo_attr_volume_name		- Volume name (string)
> 	fsinfo_attr_cell_name		- Cell name (string)
> 	fsinfo_attr_domain_name		- Domain name (string)
> 	fsinfo_attr_realm_name		- Realm name (string)
> 	fsinfo_attr_server_name		- Name of the Nth server (string)
> 	fsinfo_attr_server_address	- Mth address of the Nth server
> 	fsinfo_attr_parameter		- Nth mount parameter (string)
> 	fsinfo_attr_source		- Nth mount source name (string)
> 	fsinfo_attr_name_encoding	- Filename encoding (string)
> 	fsinfo_attr_name_codepage	- Filename codepage (string)
> 	fsinfo_attr_io_size		- I/O size hints
> 
> Some attributes (such as the servers backing a network filesystem) can have
> multiple values.  These can be enumerated by setting params->Nth and
> params->Mth to 0, 1, ... until ENODATA is returned.
> 
> buffer and buf_size point to the reply buffer.  The buffer is filled up to
> the specified size, even if this means truncating the reply.  The full size
> of the reply is returned.  In future versions, this will allow extra fields
> to be tacked on to the end of the reply, but anyone not expecting them will
> only get the subset they're expecting.  If either buffer of buf_size are 0,
> no copy will take place and the data size will be returned.
> 
> At the moment, this will only work on x86_64 and i386 as it requires the
> system call to be wired up.

<snip> I only have time today to review the user interface bits...

> diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
> new file mode 100644
> index 000000000000..abcf414dd3be
> --- /dev/null
> +++ b/include/uapi/linux/fsinfo.h
> @@ -0,0 +1,234 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/* fsinfo() definitions.
> + *
> + * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +#ifndef _UAPI_LINUX_FSINFO_H
> +#define _UAPI_LINUX_FSINFO_H
> +
> +#include <linux/types.h>
> +#include <linux/socket.h>
> +
> +/*
> + * The filesystem attributes that can be requested.  Note that some attributes
> + * may have multiple instances which can be switched in the parameter block.
> + */
> +enum fsinfo_attribute {
> +	fsinfo_attr_statfs		= 0,	/* statfs()-style state */
> +	fsinfo_attr_fsinfo		= 1,	/* Information about fsinfo() */
> +	fsinfo_attr_ids			= 2,	/* Filesystem IDs */
> +	fsinfo_attr_limits		= 3,	/* Filesystem limits */
> +	fsinfo_attr_supports		= 4,	/* What's supported in statx, iocflags, ... */
> +	fsinfo_attr_capabilities	= 5,	/* Filesystem capabilities (bits) */
> +	fsinfo_attr_timestamp_info	= 6,	/* Inode timestamp info */
> +	fsinfo_attr_volume_id		= 7,	/* Volume ID (string) */
> +	fsinfo_attr_volume_uuid		= 8,	/* Volume UUID (LE uuid) */
> +	fsinfo_attr_volume_name		= 9,	/* Volume name (string) */

What's the difference between a volume name and a volume string?

XFS has a uuid and a label that can be set by userspace (sort of);
should we return the label for volume_id and volume_name?

Hmmm, I see that the default implementations set volume_id from s_id,
and s_id (for block device filesystems anyway) tends to be the device, I
guess?

So if blkid told me that:
/dev/sda1: LABEL="music" UUID="8d9e5b1e-a094-49e5-a179-6d94f7fd8399" TYPE="xfs"

volume_id == sda1, volume_uuid == 8d9e5b1e-a094-49e5-a179-6d94f7fd8399,
and volume_name == "music" ?

> +	fsinfo_attr_cell_name		= 10,	/* Cell name (string) */
> +	fsinfo_attr_domain_name		= 11,	/* Domain name (string) */
> +	fsinfo_attr_realm_name		= 12,	/* Realm name (string) */
> +	fsinfo_attr_server_name		= 13,	/* Name of the Nth server */
> +	fsinfo_attr_server_address	= 14,	/* Mth address of the Nth server */
> +	fsinfo_attr_parameter		= 15,	/* Nth mount parameter (string) */
> +	fsinfo_attr_source		= 16,	/* Nth mount source name (string) */

Hmm, so I guess external log devices and realtime device(s) go here?

> +	fsinfo_attr_name_encoding	= 17,	/* Filename encoding (string) */
> +	fsinfo_attr_name_codepage	= 18,	/* Filename codepage (string) */
> +	fsinfo_attr_io_size		= 19,	/* Optimal I/O sizes */

Are we tied to this enum forever, or do you plan to split up the number
space to allow filesystems to define their own attributes without having
to add them here?

For example, say you let the upper 8 bits be some sort of per-fs code
(like how _IO{,R,W} work) and the lower 24 bits can be the subcommand.
0x00 would be the generic space; XFS could (say) reserve 0x58000000 -
0x58ffffff for XFS (0x58 is the prefix code used for xfs ioctls).  If
there ever are subdivisions of the number space it might be nice to have
fsinfo_fsinfo return prefix number of the fs-specific subcommands, and
how many fs-specific subcommands there are.

I mean, I guess each fs' ->fsinfo function can do that privately but I
suggest having some mechanism in mind to handle these things.  XFS's
geometry ioctl structure is nearly out of space and (some day soon) we
will have to expand and maybe we can use fsinfo instead.

> +	fsinfo_attr__nr
> +};
> +
> +/*
> + * Optional fsinfo() parameter structure.
> + *
> + * If this is not given, it is assumed that fsinfo_attr_statfs instance 0,0 is
> + * desired.
> + */
> +struct fsinfo_params {
> +	__u32	at_flags;	/* AT_SYMLINK_NOFOLLOW and similar flags */
> +	__u32	request;	/* What is being asking for (enum fsinfo_attribute) */
> +	__u32	Nth;		/* Instance of it (some may have multiple) */
> +	__u32	Mth;		/* Subinstance of Nth instance */
> +	__u32	__reserved[6];	/* Reserved params; all must be 0 */
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_statfs).
> + * - This gives extended filesystem information.
> + */
> +struct fsinfo_statfs {
> +	__u64	f_blocks;	/* Total number of blocks in fs */
> +	__u64	f_bfree;	/* Total number of free blocks */
> +	__u64	f_bavail;	/* Number of free blocks available to ordinary user */
> +	__u64	f_files;	/* Total number of file nodes in fs */
> +	__u64	f_ffree;	/* Number of free file nodes */
> +	__u64	f_favail;	/* Number of free file nodes available to ordinary user */
> +	__u32	f_bsize;	/* Optimal block size */
> +	__u32	f_frsize;	/* Fragment size */
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_ids).
> + *
> + * List of basic identifiers as is normally found in statfs().
> + */
> +struct fsinfo_ids {
> +	char	f_fs_name[15 + 1];
> +	__u64	f_flags;	/* Filesystem mount flags (MS_*) */
> +	__u64	f_fsid;		/* Short 64-bit Filesystem ID (as statfs) */
> +	__u64	f_sb_id;	/* Internal superblock ID for sbnotify()/mntnotify() */
> +	__u32	f_fstype;	/* Filesystem type from linux/magic.h [uncond] */
> +	__u32	f_dev_major;	/* As st_dev_* from struct statx [uncond] */
> +	__u32	f_dev_minor;
> +};

This structure doesn't end on a 64-bit boundary and may cause padding
problems...

> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_limits).
> + *
> + * List of supported filesystem limits.
> + */
> +struct fsinfo_limits {
> +	__u64	max_file_size;			/* Maximum file size */
> +	__u64	max_uid;			/* Maximum UID supported */
> +	__u64	max_gid;			/* Maximum GID supported */
> +	__u64	max_projid;			/* Maximum project ID supported */
> +	__u32	max_dev_major;			/* Maximum device major representable */
> +	__u32	max_dev_minor;			/* Maximum device minor representable */
> +	__u32	max_hard_links;			/* Maximum number of hard links on a file */
> +	__u32	max_xattr_body_len;		/* Maximum xattr content length */
> +	__u32	max_xattr_name_len;		/* Maximum xattr name length */
> +	__u32	max_filename_len;		/* Maximum filename length */
> +	__u32	max_symlink_len;		/* Maximum symlink content length */
> +	__u32	__reserved[1];

Maximum inode number possible, for filesystems that can allocate inodes
dynamically?

Granted, XFS will probably only ever advertise "0xffffffffffffffff"...

> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_supports).
> + *
> + * What's supported in various masks, such as statx() attribute and mask bits
> + * and IOC flags.
> + */
> +struct fsinfo_supports {
> +	__u64	stx_attributes;		/* What statx::stx_attributes are supported */
> +	__u32	stx_mask;		/* What statx::stx_mask bits are supported */
> +	__u32	ioc_flags;		/* What FS_IOC_* flags are supported */
> +	__u32	win_file_attrs;		/* What DOS/Windows FILE_* attributes are supported */
> +	__u32	__reserved[1];
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_capabilities).
> + *
> + * Bitmask indicating filesystem capabilities where renderable as single bits.
> + */
> +enum fsinfo_capability {
> +	fsinfo_cap_is_kernel_fs		= 0,	/* fs is kernel-special filesystem */
> +	fsinfo_cap_is_block_fs		= 1,	/* fs is block-based filesystem */
> +	fsinfo_cap_is_flash_fs		= 2,	/* fs is flash filesystem */
> +	fsinfo_cap_is_network_fs	= 3,	/* fs is network filesystem */
> +	fsinfo_cap_is_automounter_fs	= 4,	/* fs is automounter special filesystem */
> +	fsinfo_cap_automounts		= 5,	/* fs supports automounts */
> +	fsinfo_cap_adv_locks		= 6,	/* fs supports advisory file locking */
> +	fsinfo_cap_mand_locks		= 7,	/* fs supports mandatory file locking */
> +	fsinfo_cap_leases		= 8,	/* fs supports file leases */
> +	fsinfo_cap_uids			= 9,	/* fs supports numeric uids */
> +	fsinfo_cap_gids			= 10,	/* fs supports numeric gids */
> +	fsinfo_cap_projids		= 11,	/* fs supports numeric project ids */
> +	fsinfo_cap_id_names		= 12,	/* fs supports user names */
> +	fsinfo_cap_id_guids		= 13,	/* fs supports user guids */
> +	fsinfo_cap_windows_attrs	= 14,	/* fs has windows attributes */
> +	fsinfo_cap_user_quotas		= 15,	/* fs has per-user quotas */
> +	fsinfo_cap_group_quotas		= 16,	/* fs has per-group quotas */
> +	fsinfo_cap_project_quotas	= 17,	/* fs has per-project quotas */
> +	fsinfo_cap_xattrs		= 18,	/* fs has xattrs */
> +	fsinfo_cap_journal		= 19,	/* fs has a journal */
> +	fsinfo_cap_data_is_journalled	= 20,	/* fs is using data journalling */
> +	fsinfo_cap_o_sync		= 21,	/* fs supports O_SYNC */
> +	fsinfo_cap_o_direct		= 22,	/* fs supports O_DIRECT */
> +	fsinfo_cap_volume_id		= 23,	/* fs has a volume ID */
> +	fsinfo_cap_volume_uuid		= 24,	/* fs has a volume UUID */
> +	fsinfo_cap_volume_name		= 25,	/* fs has a volume name */
> +	fsinfo_cap_volume_fsid		= 26,	/* fs has a volume FSID */
> +	fsinfo_cap_cell_name		= 27,	/* fs has a cell name */
> +	fsinfo_cap_domain_name		= 28,	/* fs has a domain name */
> +	fsinfo_cap_realm_name		= 29,	/* fs has a realm name */
> +	fsinfo_cap_iver_all_change	= 30,	/* i_version represents data + meta changes */
> +	fsinfo_cap_iver_data_change	= 31,	/* i_version represents data changes only */
> +	fsinfo_cap_iver_mono_incr	= 32,	/* i_version incremented monotonically */
> +	fsinfo_cap_symlinks		= 33,	/* fs supports symlinks */
> +	fsinfo_cap_hard_links		= 34,	/* fs supports hard links */
> +	fsinfo_cap_hard_links_1dir	= 35,	/* fs supports hard links in same dir only */
> +	fsinfo_cap_device_files		= 36,	/* fs supports bdev, cdev */
> +	fsinfo_cap_unix_specials	= 37,	/* fs supports pipe, fifo, socket */
> +	fsinfo_cap_resource_forks	= 38,	/* fs supports resource forks/streams */
> +	fsinfo_cap_name_case_indep	= 39,	/* Filename case independence is mandatory */
> +	fsinfo_cap_name_non_utf8	= 40,	/* fs has non-utf8 names */
> +	fsinfo_cap_name_has_codepage	= 41,	/* fs has a filename codepage */
> +	fsinfo_cap_sparse		= 42,	/* fs supports sparse files */
> +	fsinfo_cap_not_persistent	= 43,	/* fs is not persistent */
> +	fsinfo_cap_no_unix_mode		= 44,	/* fs does not support unix mode bits */
> +	fsinfo_cap_has_atime		= 45,	/* fs supports access time */
> +	fsinfo_cap_has_btime		= 46,	/* fs supports birth/creation time */
> +	fsinfo_cap_has_ctime		= 47,	/* fs supports change time */
> +	fsinfo_cap_has_mtime		= 48,	/* fs supports modification time */
> +	fsinfo_cap__nr
> +};
> +
> +struct fsinfo_capabilities {
> +	__u8	capabilities[(fsinfo_cap__nr + 7) / 8];
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_timestamp_info).
> + */
> +struct fsinfo_timestamp_info {
> +	__s64	minimum_timestamp;	/* Minimum timestamp value in seconds */
> +	__s64	maximum_timestamp;	/* Maximum timestamp value in seconds */
> +	__u16	atime_gran_mantissa;	/* Granularity(secs) = mant * 10^exp */
> +	__u16	btime_gran_mantissa;
> +	__u16	ctime_gran_mantissa;
> +	__u16	mtime_gran_mantissa;
> +	__s8	atime_gran_exponent;
> +	__s8	btime_gran_exponent;
> +	__s8	ctime_gran_exponent;
> +	__s8	mtime_gran_exponent;
> +	__u32	__reserved[1];
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_volume_uuid).
> + */
> +struct fsinfo_volume_uuid {
> +	__u8	uuid[16];
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_server_addresses).
> + *
> + * Find the Mth address of the Nth server for a network mount.
> + */
> +struct fsinfo_server_address {
> +	struct __kernel_sockaddr_storage address;
> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_io_size).
> + *
> + * Retrieve I/O size hints for a filesystem.
> + */
> +struct fsinfo_io_size {
> +	__u32		dio_size_gran;	/* Size granularity for O_DIRECT */
> +	__u32		dio_mem_align;	/* Memory alignment for O_DIRECT */

max io size too?

64-bit too, in case we ever get that insane?

--D

> +};
> +
> +/*
> + * Information struct for fsinfo(fsinfo_attr_fsinfo).
> + *
> + * This gives information about fsinfo() itself.
> + */
> +struct fsinfo_fsinfo {
> +	__u32	max_attr;	/* Number of supported attributes (fsinfo_attr__nr) */
> +	__u32	max_cap;	/* Number of supported capabilities (fsinfo_cap__nr) */
> +};
> +
> +#endif /* _UAPI_LINUX_FSINFO_H */
> diff --git a/samples/statx/Makefile b/samples/statx/Makefile
> index 59df7c25a9d1..9cb9a88e3a10 100644
> --- a/samples/statx/Makefile
> +++ b/samples/statx/Makefile
> @@ -1,7 +1,10 @@
>  # List of programs to build
> -hostprogs-$(CONFIG_SAMPLE_STATX) := test-statx
> +hostprogs-$(CONFIG_SAMPLE_STATX) := test-statx test-fsinfo
>  
>  # Tell kbuild to always build the programs
>  always := $(hostprogs-y)
>  
>  HOSTCFLAGS_test-statx.o += -I$(objtree)/usr/include
> +
> +HOSTCFLAGS_test-fsinfo.o += -I$(objtree)/usr/include
> +HOSTLOADLIBES_test-fsinfo += -lm
> diff --git a/samples/statx/test-fsinfo.c b/samples/statx/test-fsinfo.c
> new file mode 100644
> index 000000000000..deab0081ecd1
> --- /dev/null
> +++ b/samples/statx/test-fsinfo.c
> @@ -0,0 +1,539 @@
> +/* Test the fsinfo() system call
> + *
> + * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#define _GNU_SOURCE
> +#define _ATFILE_SOURCE
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <ctype.h>
> +#include <errno.h>
> +#include <time.h>
> +#include <math.h>
> +#include <fcntl.h>
> +#include <sys/syscall.h>
> +#include <linux/fsinfo.h>
> +#include <linux/socket.h>
> +#include <sys/stat.h>
> +
> +static __attribute__((unused))
> +ssize_t fsinfo(int dfd, const char *filename, struct fsinfo_params *params,
> +	       void *buffer, size_t buf_size)
> +{
> +	return syscall(__NR_fsinfo, dfd, filename, params, buffer, buf_size);
> +}
> +
> +#define FSINFO_STRING(N)	 [fsinfo_attr_##N] = 0x00
> +#define FSINFO_STRUCT(N)	 [fsinfo_attr_##N] = sizeof(struct fsinfo_##N)/sizeof(__u32)
> +#define FSINFO_STRING_N(N)	 [fsinfo_attr_##N] = 0x40
> +#define FSINFO_STRUCT_N(N)	 [fsinfo_attr_##N] = 0x40 | sizeof(struct fsinfo_##N)/sizeof(__u32)
> +#define FSINFO_STRUCT_NM(N)	 [fsinfo_attr_##N] = 0x80 | sizeof(struct fsinfo_##N)/sizeof(__u32)
> +static const __u8 fsinfo_buffer_sizes[fsinfo_attr__nr] = {
> +	FSINFO_STRUCT		(statfs),
> +	FSINFO_STRUCT		(fsinfo),
> +	FSINFO_STRUCT		(ids),
> +	FSINFO_STRUCT		(limits),
> +	FSINFO_STRUCT		(supports),
> +	FSINFO_STRUCT		(capabilities),
> +	FSINFO_STRUCT		(timestamp_info),
> +	FSINFO_STRING		(volume_id),
> +	FSINFO_STRUCT		(volume_uuid),
> +	FSINFO_STRING		(volume_name),
> +	FSINFO_STRING		(cell_name),
> +	FSINFO_STRING		(domain_name),
> +	FSINFO_STRING		(realm_name),
> +	FSINFO_STRING_N		(server_name),
> +	FSINFO_STRUCT_NM	(server_address),
> +	FSINFO_STRING_N		(parameter),
> +	FSINFO_STRING_N		(source),
> +	FSINFO_STRING		(name_encoding),
> +	FSINFO_STRING		(name_codepage),
> +	FSINFO_STRUCT		(io_size),
> +};
> +
> +#define FSINFO_NAME(N) [fsinfo_attr_##N] = #N
> +static const char *fsinfo_attr_names[fsinfo_attr__nr] = {
> +	FSINFO_NAME(statfs),
> +	FSINFO_NAME(fsinfo),
> +	FSINFO_NAME(ids),
> +	FSINFO_NAME(limits),
> +	FSINFO_NAME(supports),
> +	FSINFO_NAME(capabilities),
> +	FSINFO_NAME(timestamp_info),
> +	FSINFO_NAME(volume_id),
> +	FSINFO_NAME(volume_uuid),
> +	FSINFO_NAME(volume_name),
> +	FSINFO_NAME(cell_name),
> +	FSINFO_NAME(domain_name),
> +	FSINFO_NAME(realm_name),
> +	FSINFO_NAME(server_name),
> +	FSINFO_NAME(server_address),
> +	FSINFO_NAME(parameter),
> +	FSINFO_NAME(source),
> +	FSINFO_NAME(name_encoding),
> +	FSINFO_NAME(name_codepage),
> +	FSINFO_NAME(io_size),
> +};
> +
> +union reply {
> +	char buffer[4096];
> +	struct fsinfo_statfs statfs;
> +	struct fsinfo_fsinfo fsinfo;
> +	struct fsinfo_ids ids;
> +	struct fsinfo_limits limits;
> +	struct fsinfo_supports supports;
> +	struct fsinfo_capabilities caps;
> +	struct fsinfo_timestamp_info timestamps;
> +	struct fsinfo_volume_uuid uuid;
> +	struct fsinfo_server_address srv_addr;
> +	struct fsinfo_io_size io_size;
> +};
> +
> +static void dump_hex(unsigned int *data, int from, int to)
> +{
> +	unsigned offset, print_offset = 1, col = 0;
> +
> +	from /= 4;
> +	to = (to + 3) / 4;
> +
> +	for (offset = from; offset < to; offset++) {
> +		if (print_offset) {
> +			printf("%04x: ", offset * 8);
> +			print_offset = 0;
> +		}
> +		printf("%08x", data[offset]);
> +		col++;
> +		if ((col & 3) == 0) {
> +			printf("\n");
> +			print_offset = 1;
> +		} else {
> +			printf(" ");
> +		}
> +	}
> +
> +	if (!print_offset)
> +		printf("\n");
> +}
> +
> +static void dump_attr_statfs(union reply *r, int size)
> +{
> +	struct fsinfo_statfs *f = &r->statfs;
> +
> +	printf("\n");
> +	printf("\tblocks: n=%llu fr=%llu av=%llu\n",
> +	       (unsigned long long)f->f_blocks,
> +	       (unsigned long long)f->f_bfree,
> +	       (unsigned long long)f->f_bavail);
> +
> +	printf("\tfiles : n=%llu fr=%llu av=%llu\n",
> +	       (unsigned long long)f->f_files,
> +	       (unsigned long long)f->f_ffree,
> +	       (unsigned long long)f->f_favail);
> +	printf("\tbsize : %u\n", f->f_bsize);
> +	printf("\tfrsize: %u\n", f->f_frsize);
> +}
> +
> +static void dump_attr_fsinfo(union reply *r, int size)
> +{
> +	struct fsinfo_fsinfo *f = &r->fsinfo;
> +
> +	printf("max_attr=%u max_cap=%u\n", f->max_attr, f->max_cap);
> +}
> +
> +static void dump_attr_ids(union reply *r, int size)
> +{
> +	struct fsinfo_ids *f = &r->ids;
> +
> +	printf("\n");
> +	printf("\tdev   : %02x:%02x\n", f->f_dev_major, f->f_dev_minor);
> +	printf("\tfs    : type=%x name=%s\n", f->f_fstype, f->f_fs_name);
> +	printf("\tflags : %llx\n", (unsigned long long)f->f_flags);
> +	printf("\tfsid  : %llx\n", (unsigned long long)f->f_fsid);
> +}
> +
> +static void dump_attr_limits(union reply *r, int size)
> +{
> +	struct fsinfo_limits *f = &r->limits;
> +
> +	printf("\n");
> +	printf("\tmax file size: %llx\n", f->max_file_size);
> +	printf("\tmax ids      : u=%llx g=%llx p=%llx\n",
> +	       f->max_uid, f->max_gid, f->max_projid);
> +	printf("\tmax dev      : maj=%x min=%x\n",
> +	       f->max_dev_major, f->max_dev_minor);
> +	printf("\tmax links    : %x\n", f->max_hard_links);
> +	printf("\tmax xattr    : n=%x b=%x\n",
> +	       f->max_xattr_name_len, f->max_xattr_body_len);
> +	printf("\tmax len      : file=%x sym=%x\n",
> +	       f->max_filename_len, f->max_symlink_len);
> +}
> +
> +static void dump_attr_supports(union reply *r, int size)
> +{
> +	struct fsinfo_supports *f = &r->supports;
> +
> +	printf("\n");
> +	printf("\tstx_attr=%llx\n", f->stx_attributes);
> +	printf("\tstx_mask=%x\n", f->stx_mask);
> +	printf("\tioc_flags=%x\n", f->ioc_flags);
> +	printf("\twin_fattrs=%x\n", f->win_file_attrs);
> +}
> +
> +#define FSINFO_CAP_NAME(C) [fsinfo_cap_##C] = #C
> +static const char *fsinfo_cap_names[fsinfo_cap__nr] = {
> +	FSINFO_CAP_NAME(is_kernel_fs),
> +	FSINFO_CAP_NAME(is_block_fs),
> +	FSINFO_CAP_NAME(is_flash_fs),
> +	FSINFO_CAP_NAME(is_network_fs),
> +	FSINFO_CAP_NAME(is_automounter_fs),
> +	FSINFO_CAP_NAME(automounts),
> +	FSINFO_CAP_NAME(adv_locks),
> +	FSINFO_CAP_NAME(mand_locks),
> +	FSINFO_CAP_NAME(leases),
> +	FSINFO_CAP_NAME(uids),
> +	FSINFO_CAP_NAME(gids),
> +	FSINFO_CAP_NAME(projids),
> +	FSINFO_CAP_NAME(id_names),
> +	FSINFO_CAP_NAME(id_guids),
> +	FSINFO_CAP_NAME(windows_attrs),
> +	FSINFO_CAP_NAME(user_quotas),
> +	FSINFO_CAP_NAME(group_quotas),
> +	FSINFO_CAP_NAME(project_quotas),
> +	FSINFO_CAP_NAME(xattrs),
> +	FSINFO_CAP_NAME(journal),
> +	FSINFO_CAP_NAME(data_is_journalled),
> +	FSINFO_CAP_NAME(o_sync),
> +	FSINFO_CAP_NAME(o_direct),
> +	FSINFO_CAP_NAME(volume_id),
> +	FSINFO_CAP_NAME(volume_uuid),
> +	FSINFO_CAP_NAME(volume_name),
> +	FSINFO_CAP_NAME(volume_fsid),
> +	FSINFO_CAP_NAME(cell_name),
> +	FSINFO_CAP_NAME(domain_name),
> +	FSINFO_CAP_NAME(realm_name),
> +	FSINFO_CAP_NAME(iver_all_change),
> +	FSINFO_CAP_NAME(iver_data_change),
> +	FSINFO_CAP_NAME(iver_mono_incr),
> +	FSINFO_CAP_NAME(symlinks),
> +	FSINFO_CAP_NAME(hard_links),
> +	FSINFO_CAP_NAME(hard_links_1dir),
> +	FSINFO_CAP_NAME(device_files),
> +	FSINFO_CAP_NAME(unix_specials),
> +	FSINFO_CAP_NAME(resource_forks),
> +	FSINFO_CAP_NAME(name_case_indep),
> +	FSINFO_CAP_NAME(name_non_utf8),
> +	FSINFO_CAP_NAME(name_has_codepage),
> +	FSINFO_CAP_NAME(sparse),
> +	FSINFO_CAP_NAME(not_persistent),
> +	FSINFO_CAP_NAME(no_unix_mode),
> +	FSINFO_CAP_NAME(has_atime),
> +	FSINFO_CAP_NAME(has_btime),
> +	FSINFO_CAP_NAME(has_ctime),
> +	FSINFO_CAP_NAME(has_mtime),
> +};
> +
> +static void dump_attr_capabilities(union reply *r, int size)
> +{
> +	struct fsinfo_capabilities *f = &r->caps;
> +	int i;
> +
> +	for (i = 0; i < sizeof(f->capabilities); i++)
> +		printf("%02x", f->capabilities[i]);
> +	printf("\n");
> +	for (i = 0; i < fsinfo_cap__nr; i++)
> +		if (f->capabilities[i / 8] & (1 << (i % 8)))
> +			printf("\t- %s\n", fsinfo_cap_names[i]);
> +}
> +
> +static void dump_attr_timestamp_info(union reply *r, int size)
> +{
> +	struct fsinfo_timestamp_info *f = &r->timestamps;
> +
> +	printf("range=%llx-%llx\n",
> +	       (unsigned long long)f->minimum_timestamp,
> +	       (unsigned long long)f->maximum_timestamp);
> +
> +#define print_time(G) \
> +	printf("\t"#G"time : gran=%gs\n",			\
> +	       (f->G##time_gran_mantissa *		\
> +		pow(10., f->G##time_gran_exponent)))
> +	print_time(a);
> +	print_time(b);
> +	print_time(c);
> +	print_time(m);
> +}
> +
> +static void dump_attr_volume_uuid(union reply *r, int size)
> +{
> +	struct fsinfo_volume_uuid *f = &r->uuid;
> +
> +	printf("%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x"
> +	       "-%02x%02x%02x%02x%02x%02x\n",
> +	       f->uuid[ 0], f->uuid[ 1],
> +	       f->uuid[ 2], f->uuid[ 3],
> +	       f->uuid[ 4], f->uuid[ 5],
> +	       f->uuid[ 6], f->uuid[ 7],
> +	       f->uuid[ 8], f->uuid[ 9],
> +	       f->uuid[10], f->uuid[11],
> +	       f->uuid[12], f->uuid[13],
> +	       f->uuid[14], f->uuid[15]);
> +}
> +
> +static void dump_attr_server_address(union reply *r, int size)
> +{
> +	struct fsinfo_server_address *f = &r->srv_addr;
> +
> +	printf("family=%u\n", f->address.ss_family);
> +}
> +
> +static void dump_attr_io_size(union reply *r, int size)
> +{
> +	struct fsinfo_io_size *f = &r->io_size;
> +
> +	printf("dio_size=%u\n", f->dio_size_gran);
> +}
> +
> +/*
> + *
> + */
> +typedef void (*dumper_t)(union reply *r, int size);
> +
> +#define FSINFO_DUMPER(N) [fsinfo_attr_##N] = dump_attr_##N
> +static const dumper_t fsinfo_attr_dumper[fsinfo_attr__nr] = {
> +	FSINFO_DUMPER(statfs),
> +	FSINFO_DUMPER(fsinfo),
> +	FSINFO_DUMPER(ids),
> +	FSINFO_DUMPER(limits),
> +	FSINFO_DUMPER(supports),
> +	FSINFO_DUMPER(capabilities),
> +	FSINFO_DUMPER(timestamp_info),
> +	FSINFO_DUMPER(volume_uuid),
> +	FSINFO_DUMPER(server_address),
> +	FSINFO_DUMPER(io_size),
> +};
> +
> +static void dump_fsinfo(enum fsinfo_attribute attr, __u8 about,
> +			union reply *r, int size)
> +{
> +	dumper_t dumper = fsinfo_attr_dumper[attr];
> +	unsigned int len;
> +
> +	if (!dumper) {
> +		printf("<no dumper>\n");
> +		return;
> +	}
> +
> +	len = (about & 0x3f) * sizeof(__u32);
> +	if (size < len) {
> +		printf("<short data %u/%u>\n", size, len);
> +		return;
> +	}
> +
> +	dumper(r, size);
> +}
> +
> +/*
> + * Try one subinstance of an attribute.
> + */
> +static int try_one(const char *file, struct fsinfo_params *params, bool raw)
> +{
> +	union reply r;
> +	char *p;
> +	int ret;
> +	__u8 about;
> +
> +	memset(&r.buffer, 0xbd, sizeof(r.buffer));
> +
> +	errno = 0;
> +	ret = fsinfo(AT_FDCWD, file, params, r.buffer, sizeof(r.buffer));
> +	if (params->request >= fsinfo_attr__nr) {
> +		if (ret == -1 && errno == EOPNOTSUPP)
> +			exit(0);
> +		fprintf(stderr, "Unexpected error for too-large command %u: %m\n",
> +			params->request);
> +		exit(1);
> +	}
> +
> +	//printf("fsinfo(%s,%s,%u,%u) = %d: %m\n",
> +	//       file, fsinfo_attr_names[params->request],
> +	//       params->Nth, params->Mth, ret);
> +
> +	about = fsinfo_buffer_sizes[params->request];
> +	if (ret == -1) {
> +		if (errno == ENODATA) {
> +			switch (about & 0xc0) {
> +			case 0x00:
> +				if (params->Nth == 0 && params->Mth == 0) {
> +					fprintf(stderr,
> +						"Unexpected ENODATA1 (%u[%u][%u])\n",
> +						params->request, params->Nth, params->Mth);
> +					exit(1);
> +				}
> +				break;
> +			case 0x40:
> +				if (params->Nth == 0 && params->Mth == 0) {
> +					fprintf(stderr,
> +						"Unexpected ENODATA2 (%u[%u][%u])\n",
> +						params->request, params->Nth, params->Mth);
> +					exit(1);
> +				}
> +				break;
> +			}
> +			return (params->Mth == 0) ? 2 : 1;
> +		}
> +		if (errno == EOPNOTSUPP) {
> +			if (params->Nth > 0 || params->Mth > 0) {
> +				fprintf(stderr,
> +					"Should return -ENODATA (%u[%u][%u])\n",
> +					params->request, params->Nth, params->Mth);
> +				exit(1);
> +			}
> +			//printf("\e[33m%s\e[m: <not supported>\n",
> +			//       fsinfo_attr_names[attr]);
> +			return 2;
> +		}
> +		perror(file);
> +		exit(1);
> +	}
> +
> +	if (raw) {
> +		if (ret > 4096)
> +			ret = 4096;
> +		dump_hex((unsigned int *)&r.buffer, 0, ret);
> +		return 0;
> +	}
> +
> +	switch (about & 0xc0) {
> +	case 0x00:
> +		printf("\e[33m%s\e[m: ",
> +		       fsinfo_attr_names[params->request]);
> +		break;
> +	case 0x40:
> +		printf("\e[33m%s[%u]\e[m: ",
> +		       fsinfo_attr_names[params->request],
> +		       params->Nth);
> +		break;
> +	case 0x80:
> +		printf("\e[33m%s[%u][%u]\e[m: ",
> +		       fsinfo_attr_names[params->request],
> +		       params->Nth, params->Mth);
> +		break;
> +	}
> +
> +	switch (about) {
> +		/* Struct */
> +	case 0x01 ... 0x3f:
> +	case 0x41 ... 0x7f:
> +	case 0x81 ... 0xbf:
> +		dump_fsinfo(params->request, about, &r, ret);
> +		return 0;
> +
> +		/* String */
> +	case 0x00:
> +	case 0x40:
> +	case 0x80:
> +		if (ret >= 4096) {
> +			ret = 4096;
> +			r.buffer[4092] = '.';
> +			r.buffer[4093] = '.';
> +			r.buffer[4094] = '.';
> +			r.buffer[4095] = 0;
> +		} else {
> +			r.buffer[ret] = 0;
> +		}
> +		for (p = r.buffer; *p; p++) {
> +			if (!isprint(*p)) {
> +				printf("<non-printable>\n");
> +				continue;
> +			}
> +		}
> +		printf("%s\n", r.buffer);
> +		return 0;
> +
> +	default:
> +		fprintf(stderr, "Fishy about %u %02x\n", params->request, about);
> +		exit(1);
> +	}
> +}
> +
> +/*
> + *
> + */
> +int main(int argc, char **argv)
> +{
> +	struct fsinfo_params params = {
> +		.at_flags = AT_SYMLINK_NOFOLLOW,
> +	};
> +	unsigned int attr;
> +	int raw = 0, opt, Nth, Mth;
> +
> +	while ((opt = getopt(argc, argv, "alr"))) {
> +		switch (opt) {
> +		case 'a':
> +			params.at_flags |= AT_NO_AUTOMOUNT;
> +			continue;
> +		case 'l':
> +			params.at_flags &= ~AT_SYMLINK_NOFOLLOW;
> +			continue;
> +		case 'r':
> +			raw = 1;
> +			continue;
> +		}
> +		break;
> +	}
> +
> +	argc -= optind;
> +	argv += optind;
> +
> +	if (argc != 1) {
> +		printf("Format: test-fsinfo [-alr] <file>\n");
> +		exit(2);
> +	}
> +
> +	for (attr = 0; attr <= fsinfo_attr__nr; attr++) {
> +		Nth = 0;
> +		do {
> +			Mth = 0;
> +			do {
> +				params.request = attr;
> +				params.Nth = Nth;
> +				params.Mth = Mth;
> +
> +				switch (try_one(argv[0], &params, raw)) {
> +				case 0:
> +					continue;
> +				case 1:
> +					goto done_M;
> +				case 2:
> +					goto done_N;
> +				}
> +			} while (++Mth < 100);
> +
> +		done_M:
> +			if (Mth >= 100) {
> +				fprintf(stderr, "Fishy: Mth == %u\n", Mth);
> +				break;
> +			}
> +
> +		} while (++Nth < 100);
> +
> +	done_N:
> +		if (Nth >= 100) {
> +			fprintf(stderr, "Fishy: Nth == %u\n", Nth);
> +			break;
> +		}
> +	}
> +
> +	return 0;
> +}
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information [ver #10]
  2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
                     ` (9 preceding siblings ...)
  2018-07-31 23:49   ` Darrick J. Wong
@ 2018-08-01  1:07   ` David Howells
  10 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2018-08-01  1:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: dhowells, viro, linux-api, torvalds, linux-fsdevel, linux-kernel

Darrick J. Wong <darrick.wong@oracle.com> wrote:

> <snip> I only have time today to review the user interface bits...

Thanks:-)

> > +	fsinfo_attr_volume_id		= 7,	/* Volume ID (string) */
> > +	fsinfo_attr_volume_uuid		= 8,	/* Volume UUID (LE uuid) */
> > +	fsinfo_attr_volume_name		= 9,	/* Volume name (string) */
> 
> What's the difference between a volume name and a volume string?

Um?  There is no "volume string" defined.

What the parenthesis in the comment means is that fsinfo_attr_volume_name
returns a variable-length string rather than a fixed structure - ie. it's type
information.

> XFS has a uuid and a label that can be set by userspace (sort of);
> should we return the label for volume_id and volume_name?
> 
> Hmmm, I see that the default implementations set volume_id from s_id,
> and s_id (for block device filesystems anyway) tends to be the device, I
> guess?
> 
> So if blkid told me that:
> /dev/sda1: LABEL="music" UUID="8d9e5b1e-a094-49e5-a179-6d94f7fd8399" TYPE="xfs"
> 
> volume_id == sda1, volume_uuid == 8d9e5b1e-a094-49e5-a179-6d94f7fd8399,
> and volume_name == "music" ?

I would do it like that.  Note that these things are described in the manual
page that I posted previously.  I'll attach that here (note that it needs
updating).

> > +	fsinfo_attr_source		= 16,	/* Nth mount source name (string) */
> 
> Hmm, so I guess external log devices and realtime device(s) go here?

Ummm...  Not sure.  I feel like they should, but they can also go in
FSINFO_ATTR_PARAMETER if they're already described by a mount parameter.

I was thinking more of bcachefs where the "source" parameter to mount(2) looks
something like "/dev/sda1:/dev/sda2".

One of the important considerations for setting up the parser is that we still
have to handle mount(2) for existing filesystems.

> Are we tied to this enum forever, or do you plan to split up the number
> space to allow filesystems to define their own attributes without having
> to add them here?
> 
> For example, say you let the upper 8 bits be some sort of per-fs code
> (like how _IO{,R,W} work) and the lower 24 bits can be the subcommand.
> 0x00 would be the generic space; XFS could (say) reserve 0x58000000 -
> 0x58ffffff for XFS (0x58 is the prefix code used for xfs ioctls).  If
> there ever are subdivisions of the number space it might be nice to have
> fsinfo_fsinfo return prefix number of the fs-specific subcommands, and
> how many fs-specific subcommands there are.
> 
> I mean, I guess each fs' ->fsinfo function can do that privately but I
> suggest having some mechanism in mind to handle these things.  XFS's
> geometry ioctl structure is nearly out of space and (some day soon) we
> will have to expand and maybe we can use fsinfo instead.

I was planning on requiring them to be added here and also listed in:

	static const u16 fsinfo_buffer_sizes[FSINFO_ATTR__NR] = {
		FSINFO_STRUCT		(STATFS,		statfs),
		FSINFO_STRUCT		(FSINFO,		fsinfo),
		...
	};

in fs/statfs.c.

> > +	__u32	f_dev_minor;
> > +};
> 
> This structure doesn't end on a 64-bit boundary and may cause padding
> problems...

I've fixed that, thanks.

> Maximum inode number possible, for filesystems that can allocate inodes
> dynamically?
> 
> Granted, XFS will probably only ever advertise "0xffffffffffffffff"...

Is that possible with any of our current interfaces?

It's something I can add, but I can imagine circumstances where the inode
number space has holes in it that can't be allocated (say inode numbers
correspond to particular blocks on disk).  I wonder if that's something I need
worry about.

Btw, note that the fsinfo() interface is constructed such that it's practical
to expand any particular struct in the future, provided any new fields are
tagged on the end and don't mind defaulting to 0.  fsinfo() returns just the
data you asked for, truncating the returned data to fit your request.  If you
ask for more than it has, then it clears the excess space (hence the
defaulting to 0 condition above).

> > +struct fsinfo_io_size {
> > +	__u32		dio_size_gran;	/* Size granularity for O_DIRECT */
> > +	__u32		dio_mem_align;	/* Memory alignment for O_DIRECT */
> 
> max io size too?

That needs more discussion, I think, particularly involving Dave Chinner.

> 64-bit too, in case we ever get that insane?

If you really want.  4GiB alignment and granularity is a bit insane, though.

David
---
'\" t
.\" Copyright (c) 2018 David Howells <dhowells@redhat.com>
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date.  The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein.  The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.\"
.TH FSINFO 2 2018-06-06 "Linux" "Linux Programmer's Manual"
.SH NAME
fsinfo \- Get filesystem information
.SH SYNOPSIS
.nf
.B #include <sys/types.h>
.br
.B #include <sys/fsinfo.h>
.br
.B #include <unistd.h>
.br
.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
.PP
.BI "int fsinfo(int " dirfd ", const char *" pathname ","
.BI "           struct fsinfo_params *" params ","
.BI "           void *" buffer ", size_t " buf_size );
.fi
.PP
.IR Note :
There is no glibc wrapper for
.BR fsinfo ();
see NOTES.
.SH DESCRIPTION
.PP
fsinfo() retrieves the desired filesystem attribute, as selected by the
parameters pointed to by
.IR params ,
and stores its value in the buffer pointed to by
.IR buffer .
.PP
The parameter structure is optional, defaulting to all the parameters being 0
if the pointer is NULL.  The structure looks like the following:
.PP
.in +4n
.nf
struct fsinfo_params {
    __u32 at_flags;     /* AT_SYMLINK_NOFOLLOW and similar flags */
    __u32 request;      /* Requested attribute */
    __u32 Nth;          /* Instance of attribute */
    __u32 Mth;          /* Subinstance of Nth instance */
    __u32 __reserved[6]; /* Reserved params; all must be 0 */
};
.fi
.in
.PP
The filesystem to be queried is looked up using a combination of
.IR dfd ", " pathname " and " params->at_flags.
This is discussed in more detail below.
.PP
The desired attribute is indicated by
.IR params->request .
If
.I params
is NULL, this will default to
.BR fsinfo_attr_statfs ,
which retrieves some of the information returned by
.BR statfs ().
The available attributes are described below in the "THE ATTRIBUTES" section.
.PP
Some attributes can have multiple values and some can even have multiple
instances with multiple values.  For example, a network filesystem might use
multiple servers.  The names of each of these servers can be retrieved by
using
.I params->Nth
to iterate through all the instances until error
.B ENODATA
occurs, indicating the end of the list.  Further, each server might have
multiple addresses available; these can be enumerated using
.I params->Nth
to iterate the servers and
.I params->Mth
to iterate the addresses of the Nth server.
.PP
The amount of data written into the buffer depends on the attribute selected.
Some attributes return variable-length strings and some return fixed-size
structures.  If either
.IR buffer " is  NULL  or " buf_size " is 0"
then the size of the attribute value will be returned and nothing will be
written into the buffer.
.PP
The
.I params->__reserved
parameters must all be 0.
.\"_______________________________________________________
.SS
Allowance for Future Attribute Expansion
.PP
To allow for the future expansion and addition of fields to any fixed-size
structure attribute,
.BR fsinfo ()
makes the following guarantees:
.RS 4m
.IP (1) 4m
It will always clear any excess space in the buffer.
.IP (2) 4m
It will always return the actual size of the data.
.IP (3) 4m
It will truncate the data to fit it into the buffer rather than giving an
error.
.IP (4) 4m
Any new version of a structure will incorporate all the fields from the old
version at same offsets.
.RE
.PP
So, for example, if the caller is running on an older version of the kernel
with an older, smaller version of the structure than was asked for, the kernel
will write the smaller version into the buffer and will clear the remainder of
the buffer to make sure any additional fields are set to 0.  The function will
return the actual size of the data.
.PP
On the other hand, if the caller is running on a newer version of the kernel
with a newer version of the structure that is larger than the buffer, the write
to the buffer will be truncated to fit as necessary and the actual size of the
data will be returned.
.PP
Note that this doesn't apply to variable-length string attributes.

.\"_______________________________________________________
.SS
Invoking \fBfsinfo\fR():
.PP
To access a file's status, no permissions are required on the file itself, but
in the case of
.BR fsinfo ()
with a path, execute (search) permission is required on all of the directories
in
.I pathname
that lead to the file.
.PP
.BR fsinfo ()
uses
.IR pathname ", " dirfd " and " params->at_flags
to locate the target file in one of a variety of ways:
.TP
[*] By absolute path.
.I pathname
points to an absolute path and
.I dirfd
is ignored.  The file is looked up by name, starting from the root of the
filesystem as seen by the calling process.
.TP
[*] By cwd-relative path.
.I pathname
points to a relative path and
.IR dirfd " is " AT_FDCWD .
The file is looked up by name, starting from the current working directory.
.TP
[*] By dir-relative path.
.I pathname
points to relative path and
.I dirfd
indicates a file descriptor pointing to a directory.  The file is looked up by
name, starting from the directory specified by
.IR dirfd .
.TP
[*] By file descriptor.
.IR pathname " is " NULL " and " dirfd
indicates a file descriptor.  The file attached to the file descriptor is
queried directly.  The file descriptor may point to any type of file, not just
a directory.
.PP
.I flags
can be used to influence a path-based lookup.  A value for
.I flags
is constructed by OR'ing together zero or more of the following constants:
.TP
.BR AT_EMPTY_PATH
.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
If
.I pathname
is an empty string, operate on the file referred to by
.IR dirfd
(which may have been obtained using the
.BR open (2)
.B O_PATH
flag).
If
.I dirfd
is
.BR AT_FDCWD ,
the call operates on the current working directory.
In this case,
.I dirfd
can refer to any type of file, not just a directory.
This flag is Linux-specific; define
.B _GNU_SOURCE
.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
to obtain its definition.
.TP
.BR AT_NO_AUTOMOUNT
Don't automount the terminal ("basename") component of
.I pathname
if it is a directory that is an automount point.  This allows the caller to
gather attributes of the filesystem holding an automount point (rather than
the filesystem it would mount).  This flag can be used in tools that scan
directories to prevent mass-automounting of a directory of automount points.
The
.B AT_NO_AUTOMOUNT
flag has no effect if the mount point has already been mounted over.
This flag is Linux-specific; define
.B _GNU_SOURCE
.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
to obtain its definition.
.TP
.B AT_SYMLINK_NOFOLLOW
If
.I pathname
is a symbolic link, do not dereference it:
instead return information about the link itself, like
.BR lstat ().
.SH THE ATTRIBUTES
.PP
There is a range of attributes that can be selected from.  These are:

.\" __________________ fsinfo_attr_statfs __________________
.TP
.B fsinfo_attr_statfs
This retrieves the "dynamic"
.B statfs
information, such as block and file counts, that are expected to change whilst
a filesystem is being used.  This fills in the following structure:
.PP
.RS
.in +4n
.nf
struct fsinfo_statfs {
    __u64 f_blocks;	/* Total number of blocks in fs */
    __u64 f_bfree;	/* Total number of free blocks */
    __u64 f_bavail;	/* Number of free blocks available to ordinary user */
    __u64 f_files;	/* Total number of file nodes in fs */
    __u64 f_ffree;	/* Number of free file nodes */
    __u64 f_favail;	/* Number of free file nodes available to ordinary user */
    __u32 f_bsize;	/* Optimal block size */
    __u32 f_frsize;	/* Fragment size */
};
.fi
.in
.RE
.IP
The fields correspond to those of the same name returned by
.BR statfs ().

.\" __________________ fsinfo_attr_fsinfo __________________
.TP
.B fsinfo_attr_fsinfo
This retrieves information about the
.BR fsinfo ()
system call itself.  This fills in the following structure:
.PP
.RS
.in +4n
.nf
struct fsinfo_fsinfo {
    __u32 max_attr;
    __u32 max_cap;
};
.fi
.in
.RE
.IP
The
.I max_attr
value indicates the number of attributes supported by the
.BR fsinfo ()
system call, and
.I max_cap
indicates the number of capability bits supported by the
.B fsinfo_attr_capabilities
attribute.  The first corresponds to
.I fsinfo_attr__nr
and the second to
.I fsinfo_cap__nr
in the header file.

.\" __________________ fsinfo_attr_ids __________________
.TP
.B fsinfo_attr_ids
This retrieves a number of fixed IDs and other static information otherwise
available through
.BR statfs ().
The following structure is filled in:
.PP
.RS
.in +4n
.nf
struct fsinfo_ids {
    char  f_fs_name[15 + 1]; /* Filesystem name */
    __u64 f_flags;	/* Filesystem mount flags (MS_*) */
    __u64 f_fsid;	/* Short 64-bit Filesystem ID */
    __u64 f_sb_id;	/* Internal superblock ID */
    __u32 f_fstype;	/* Filesystem type from linux/magic.h */
    __u32 f_dev_major;	/* As st_dev_* from struct statx */
    __u32 f_dev_minor;
};
.fi
.in
.RE
.IP
Most of these are filled in as for
.BR statfs (),
with the addition of the filesystem's symbolic name in
.I f_fs_name
and an identifier for use in notifications in
.IR f_sb_id .

.\" __________________ fsinfo_attr_limits __________________
.TP
.B fsinfo_attr_limits
This retrieves information about the limits of what a filesystem can support.
The following structure is filled in:
.PP
.RS
.in +4n
.nf
struct fsinfo_limits {
    __u64 max_file_size;
    __u64 max_uid;
    __u64 max_gid;
    __u64 max_projid;
    __u32 max_dev_major;
    __u32 max_dev_minor;
    __u32 max_hard_links;
    __u32 max_xattr_body_len;
    __u16 max_xattr_name_len;
    __u16 max_filename_len;
    __u16 max_symlink_len;
    __u16 __reserved[1];
};
.fi
.in
.RE
.IP
These indicate the maximum supported sizes for a variety of filesystem objects,
including the file size, the extended attribute name length and body length,
the filename length and the symlink body length.
.IP
It also indicates the maximum representable values for a User ID, a Group ID,
a Project ID, a device major number and a device minor number.
.IP
And finally, it indicates the maximum number of hard links that can be made to
a file.
.IP
Note that some of these values may be zero if the underlying object or concept
is not supported by the filesystem or the medium.

.\" __________________ fsinfo_attr_supports __________________
.TP
.B fsinfo_attr_supports
This retrieves information about what bits a filesystem supports in various
masks.  The following structure is filled in:
.PP
.RS
.in +4n
.nf
struct fsinfo_supports {
    __u64 stx_attributes;
    __u32 stx_mask;
    __u32 ioc_flags;
    __u32 win_file_attrs;
    __u32 __reserved[1];
};
.fi
.in
.RE
.IP
The
.IR stx_attributes " and " stx_mask
fields indicate what bits in the struct statx fields of the matching names
are supported by the filesystem.
.IP
The
.I ioc_flags
field indicates what FS_*_FL flag bits as used through the FS_IOC_GET/SETFLAGS
ioctls are supported by the filesystem.
.IP
The
.I win_file_attrs
indicates what DOS/Windows file attributes a filesystem supports, if any.

.\" __________________ fsinfo_attr_capabilities __________________
.TP
.B fsinfo_attr_capabilities
This retrieves information about what features a filesystem supports as a
series of single bit indicators.  The following structure is filled in:
.PP
.RS
.in +4n
.nf
struct fsinfo_capabilities {
    __u8 capabilities[(fsinfo_cap__nr + 7) / 8];
};
.fi
.in
.RE
.IP
where the bit of interest can be found by:
.PP
.RS
.in +4n
.nf
	p->capabilities[bit / 8] & (1 << (bit % 8)))
.fi
.in
.RE
.IP
The bits are listed by
.I enum fsinfo_capability
and
.B fsinfo_cap__nr
is one more than the last capability bit listed in the header file.
.IP
Note that the number of capability bits actually supported by the kernel can be
found using the
.B fsinfo_attr_fsinfo
attribute.
.IP
The capability bits and their meanings are listed below in the "THE
CAPABILITIES" section.

.\" __________________ fsinfo_attr_timestamp_info __________________
.TP
.B fsinfo_attr_timestamp_info
This retrieves information about what timestamp resolution and scope is
supported by a filesystem for each of the file timestamps.  The following
structure is filled in:
.PP
.RS
.in +4n
.nf
struct fsinfo_timestamp_info {
	__s64 minimum_timestamp;
	__s64 maximum_timestamp;
	__u16 atime_gran_mantissa;
	__u16 btime_gran_mantissa;
	__u16 ctime_gran_mantissa;
	__u16 mtime_gran_mantissa;
	__s8  atime_gran_exponent;
	__s8  btime_gran_exponent;
	__s8  ctime_gran_exponent;
	__s8  mtime_gran_exponent;
	__u32 __reserved[1];
};
.fi
.in
.RE
.IP
where
.IR minimum_timestamp " and " maximum_timestamp
are the limits on the timestamps that the filesystem supports and
.IR *time_gran_mantissa " and " *time_gran_exponent
indicate the granularity of each timestamp in terms of seconds, using the
formula:
.PP
.RS
.in +4n
.nf
mantissa * pow(10, exponent) Seconds
.fi
.in
.RE
.IP
where exponent may be negative and the result may be a fraction of a second.
.IP
Four timestamps are detailed: \fBA\fPccess time, \fBB\fPirth/creation time,
\fBC\fPhange time and \fBM\fPodification time.  Capability bits are defined
that specify whether each of these exist in the filesystem or not.
.IP
Note that the timestamp description may be approximated or inaccurate if the
file is actually remote or is the union of multiple objects.

.\" __________________ fsinfo_attr_volume_id __________________
.TP
.B fsinfo_attr_volume_id
This retrieves the system's superblock volume identifier as a variable-length
string.  This does not necessarily represent a value stored in the medium but
might be constructed on the fly.
.IP
For instance, for a block device this is the block device identifier
(eg. "sdb2"); for AFS this would be the numeric volume identifier.

.\" __________________ fsinfo_attr_volume_uuid __________________
.TP
.B fsinfo_attr_volume_uuid
This retrieves the volume UUID, if there is one, as a little-endian binary
UUID.  This fills in the following structure:
.PP
.RS
.in +4n
.nf
struct fsinfo_volume_uuid {
    __u8 uuid[16];
};
.fi
.in
.RE
.IP

.\" __________________ fsinfo_attr_volume_name __________________
.TP
.B fsinfo_attr_volume_name
This retrieves the filesystem's volume name as a variable-length string.  This
is expected to represent a name stored in the medium.
.IP
For a block device, this might be a label stored in the superblock.  For a
network filesystem, this might be a logical volume name of some sort.

.\" __________________ fsinfo_attr_cell/domain __________________
.PP
.B fsinfo_attr_cell_name
.br
.B fsinfo_attr_domain_name
.br
.IP
These two attributes are variable-length string attributes that may be used to
obtain information about network filesystems.  An AFS volume, for instance,
belongs to a named cell.  CIFS shares may belong to a domain.

.\" __________________ fsinfo_attr_realm_name __________________
.TP
.B fsinfo_attr_realm_name
This attribute is variable-length string that indicates the Kerberos realm that
a filesystem's authentication tokens should come from.

.\" __________________ fsinfo_attr_server_name __________________
.TP
.B fsinfo_attr_server_name
This attribute is a multiple-value attribute that lists the names of the
servers that are backing a network filesystem.  Each value is a variable-length
string.  The values are enumerated by calling
.BR fsinfo ()
multiple times, incrementing
.I params->Nth
each time until an ENODATA error occurs, thereby indicating the end of the
list.

.\" __________________ fsinfo_attr_server_address __________________
.TP
.B fsinfo_attr_server_address
This attribute is a multiple-instance, multiple-value attribute that lists the
addresses of the servers that are backing a network filesystem.  Each value is
a structure of the following type:
.PP
.RS
.in +4n
.nf
struct fsinfo_server_address {
    struct __kernel_sockaddr_storage address;
};
.fi
.in
.RE
.IP
Where the address may be AF_INET, AF_INET6, AF_RXRPC or any other type as
appropriate to the filesystem.
.IP
The values are enumerated by calling
.IR fsinfo ()
multiple times, incrementing
.I params->Nth
to step through the servers and
.I params->Mth
to step through the addresses of the Nth server each time until ENODATA errors
occur, thereby indicating either the end of a server's address list or the end
of the server list.
.IP
Barring the server list changing whilst being accessed, it is expected that the
.I params->Nth
will correspond to
.I params->Nth
for
.BR fsinfo_attr_server_name .

.\" __________________ fsinfo_attr_parameter __________________
.TP
.B fsinfo_attr_parameter
This attribute is a multiple-value attribute that lists the values of the mount
parameters for a filesystem as variable-length strings.
.IP
The parameters are enumerated by calling
.BR fsinfo ()
multiple times, incrementing
.I params->Nth
to step through them until error ENODATA is given.
.IP
Parameter strings are presented in a form akin to the way they're passed to the
context created by the
.BR fsopen ()
system call.  For example, straight text parameters will be rendered as
something like:
.PP
.RS
.in +4n
.nf
"o data=journal"
"o noquota"
.fi
.in
.RE
.IP
Where the initial "word" indicates the option form.

.\" __________________ fsinfo_attr_source __________________
.TP
.B fsinfo_attr_source
This attribute is a multiple-value attribute that lists the mount sources for a
filesystem as variable-length strings.  Normally only one source will be
available, but the possibility of having more than one is allowed for.
.IP
The sources are enumerated by calling
.BR fsinfo ()
multiple times, incrementing
.I params->Nth
to step through them until error ENODATA is given.
.IP
Source strings are presented in a form akin to the way they're passed to the
context created by the
.BR fsopen ()
system call.  For example, they will be rendered as something like:
.PP
.RS
.in +4n
.nf
"s /dev/sda1"
"s example.com/pub/linux/"
.fi
.in
.RE
.IP
Where the initial "word" indicates the option form.

.\" __________________ fsinfo_attr_name_encoding __________________
.TP
.B fsinfo_attr_name_encoding
This attribute is variable-length string that indicates the filename encoding
used by the filesystem.  The default is "utf8".  Note that this may indicate a
non-8-bit encoding if that's what the underlying filesystem actually supports.

.\" __________________ fsinfo_attr_name_codepage __________________
.TP
.B fsinfo_attr_name_codepage
This attribute is variable-length string that indicates the codepage used to
translate filenames from the filesystem to the system if this is applicable to
the filesystem.

.\" __________________ fsinfo_attr_io_size __________________
.TP
.B fsinfo_attr_io_size
This retrieves information about the I/O sizes supported by the filesystem.
The following structure is filled in:
.PP
.RS
.in +4n
.nf
struct fsinfo_io_size {
    __u32 block_size;
    __u32 max_single_read_size;
    __u32 max_single_write_size;
    __u32 best_read_size;
    __u32 best_write_size;
};
.fi
.in
.RE
.IP
Where
.I block_size
indicates the fundamental I/O block size of the filesystem as something
O_DIRECT read/write sizes must be a multiple of;
.IR max_single_write_size " and " max_single_write_size
indicate the maximum sizes for individual unbuffered data transfer operations;
and
.IR best_read_size " and " best_write_size
indicate the recommended I/O sizes.
.IP
Note that any of these may be zero if inapplicable or indeterminable.



.SH THE CAPABILITIES
.PP
There are number of capability bits in a bit array that can be retrieved using
.BR fsinfo_attr_capabilities .
These give information about features of the filesystem driver and the specific
filesystem.

.\" __________________ fsinfo_cap_is_*_fs __________________
.PP
.B fsinfo_cap_is_kernel_fs
.br
.B fsinfo_cap_is_block_fs
.br
.B fsinfo_cap_is_flash_fs
.br
.B fsinfo_cap_is_network_fs
.br
.B fsinfo_cap_is_automounter_fs
.IP
These indicate the primary type of the filesystem.
.B kernel
filesystems are special communication interfaces that substitute files for
system calls; examples include procfs and sysfs.
.B block
filesystems require a block device on which to operate; examples include ext4
and XFS.
.B flash
filesystems require an MTD device on which to operate; examples include JFFS2.
.B network
filesystems require access to the network and contact one or more servers;
examples include NFS and AFS.
.B automounter
filesystems are kernel special filesystems that host automount points and
triggers to dynamically create automount points.  Examples include autofs and
AFS's dynamic root.

.\" __________________ fsinfo_cap_automounts __________________
.TP
.B fsinfo_cap_automounts
The filesystem may have automount points that can be triggered by pathwalk.

.\" __________________ fsinfo_cap_adv_locks __________________
.TP
.B fsinfo_cap_adv_locks
The filesystem supports advisory file locks.  For a network filesystem, this
indicates that the advisory file locks are cross-client (and also between
server and its local filesystem on something like NFS).

.\" __________________ fsinfo_cap_mand_locks __________________
.TP
.B fsinfo_cap_mand_locks
The filesystem supports mandatory file locks.  For a network filesystem, this
indicates that the mandatory file locks are cross-client (and also between
server and its local filesystem on something like NFS).

.\" __________________ fsinfo_cap_leases __________________
.TP
.B fsinfo_cap_leases
The filesystem supports leases.  For a network filesystem, this means that the
server will tell the client to clean up its state on a file before passing the
lease to another client.

.\" __________________ fsinfo_cap_*ids __________________
.PP
.B fsinfo_cap_uids
.br
.B fsinfo_cap_gids
.br
.B fsinfo_cap_projids
.IP
These indicate that the filesystem supports numeric user IDs, group IDs and
project IDs respectively.

.\" __________________ fsinfo_cap_id_* __________________
.PP
.B fsinfo_cap_id_names
.br
.B fsinfo_cap_id_guids
.IP
These indicate that the filesystem employs textual names and/or GUIDs as
identifiers.

.\" __________________ fsinfo_cap_windows_attrs __________________
.TP
.B fsinfo_cap_windows_attrs
Indicates that the filesystem supports some Windows FILE_* attributes.

.\" __________________ fsinfo_cap_*_quotas __________________
.PP
.B fsinfo_cap_user_quotas
.br
.B fsinfo_cap_group_quotas
.br
.B fsinfo_cap_project_quotas
.IP
These indicate that the filesystem supports quotas for users, groups and
projects respectively.

.\" __________________ fsinfo_cap_xattrs/filetypes __________________
.PP
.B fsinfo_cap_xattrs
.br
.B fsinfo_cap_symlinks
.br
.B fsinfo_cap_hard_links
.br
.B fsinfo_cap_hard_links_1dir
.br
.B fsinfo_cap_device_files
.br
.B fsinfo_cap_unix_specials
.IP
These indicate that the filesystem supports respectively extended attributes;
symbolic links; hard links spanning direcories; hard links, but only within a
directory; block and character device files; and UNIX special files, such as
FIFO and socket.

.\" __________________ fsinfo_cap_*journal* __________________
.PP
.B fsinfo_cap_journal
.br
.B fsinfo_cap_data_is_journalled
.IP
The first of these indicates that the filesystem has a journal and the second
that the file data changes are being journalled.

.\" __________________ fsinfo_cap_o_* __________________
.PP
.B fsinfo_cap_o_sync
.br
.B fsinfo_cap_o_direct
.IP
These indicate that O_SYNC and O_DIRECT are supported respectively.

.\" __________________ fsinfo_cap_o_* __________________
.PP
.B fsinfo_cap_volume_id
.br
.B fsinfo_cap_volume_uuid
.br
.B fsinfo_cap_volume_name
.br
.B fsinfo_cap_volume_fsid
.br
.B fsinfo_cap_cell_name
.br
.B fsinfo_cap_domain_name
.br
.B fsinfo_cap_realm_name
.IP
These indicate if various attributes are supported by the filesystem, where
.B fsinfo_cap_X
here corresponds to
.BR fsinfo_attr_X .

.\" __________________ fsinfo_cap_iver_* __________________
.PP
.B fsinfo_cap_iver_all_change
.br
.B fsinfo_cap_iver_data_change
.br
.B fsinfo_cap_iver_mono_incr
.IP
These indicate if
.I i_version
on an inode in the filesystem is supported and
how it behaves.
.B all_change
indicates that i_version is incremented on metadata changes as well as data
changes.
.B data_change
indicates that i_version is only incremented on data changes, including
truncation.
.B mono_incr
indicates that i_version is incremented by exactly 1 for each change made.

.\" __________________ fsinfo_cap_resource_forks __________________
.TP
.B fsinfo_cap_resource_forks
This indicates that the filesystem supports some sort of resource fork or
alternate data stream on a file.  This isn't the same as an extended attribute.

.\" __________________ fsinfo_cap_name_* __________________
.PP
.B fsinfo_cap_name_case_indep
.br
.B fsinfo_cap_name_non_utf8
.br
.B fsinfo_cap_name_has_codepage
.IP
These indicate certain facts about the filenames in a filesystem: whether
they're case-independent; if they're not UTF-8; and if there's a codepage
employed to map the names.

.\" __________________ fsinfo_cap_sparse __________________
.TP
.B fsinfo_cap_sparse
This indicates that the filesystem supports sparse files.

.\" __________________ fsinfo_cap_not_persistent __________________
.TP
.B fsinfo_cap_not_persistent
This indicates that the filesystem is not persistent, and that any data stored
here will not be saved in the event that the filesystem is unmounted, the
machine is rebooted or the machine loses power.

.\" __________________ fsinfo_cap_no_unix_mode __________________
.TP
.B fsinfo_cap_no_unix_mode
This indicates that the filesystem doesn't support the UNIX mode permissions
bits.

.\" __________________ fsinfo_cap_has_*time __________________
.PP
.B fsinfo_cap_has_atime
.br
.B fsinfo_cap_has_btime
.br
.B fsinfo_cap_has_ctime
.br
.B fsinfo_cap_has_mtime
.IP
These indicate as to what timestamps a filesystem supports, including: Access
time, Birth/creation time, Change time (metadata and data) and Modification
time (data only).


.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
.SH RETURN VALUE
On success, the size of the value that the kernel has available is returned,
irrespective of whether the buffer is large enough to hold that.  The data
written to the buffer will be truncated if it is not.  On error, \-1 is
returned, and
.I errno
is set appropriately.
.SH ERRORS
.TP
.B EACCES
Search permission is denied for one of the directories
in the path prefix of
.IR pathname .
(See also
.BR path_resolution (7).)
.TP
.B EBADF
.I dirfd
is not a valid open file descriptor.
.TP
.B EFAULT
.I pathname
is NULL or
.IR pathname ", " params " or " buffer
point to a location outside the process's accessible address space.
.TP
.B EINVAL
Reserved flag specified in
.IR params->at_flags " or one of " params->__reserved[]
is not 0.
.TP
.B EOPNOTSUPP
Unsupported attribute requested in
.IR params->request .
This may be beyond the limit of the supported attribute set or may just not be
one that's supported by the filesystem.
.TP
.B ENODATA
Unavailable attribute value requested by
.IR params->Nth " and/or " params->Mth .
.TP
.B ELOOP
Too many symbolic links encountered while traversing the pathname.
.TP
.B ENAMETOOLONG
.I pathname
is too long.
.TP
.B ENOENT
A component of
.I pathname
does not exist, or
.I pathname
is an empty string and
.B AT_EMPTY_PATH
was not specified in
.IR params->at_flags .
.TP
.B ENOMEM
Out of memory (i.e., kernel memory).
.TP
.B ENOTDIR
A component of the path prefix of
.I pathname
is not a directory or
.I pathname
is relative and
.I dirfd
is a file descriptor referring to a file other than a directory.
.SH VERSIONS
.BR fsinfo ()
was added to Linux in kernel 4.18.
.SH CONFORMING TO
.BR fsinfo ()
is Linux-specific.
.SH NOTES
Glibc does not (yet) provide a wrapper for the
.BR fsinfo ()
system call; call it using
.BR syscall (2).
.SH SEE ALSO
.BR ioctl_iflags (2),
.BR statx (2),
.BR statfs (2)

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-08-01  1:07 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-27 17:31 [PATCH 00/38] VFS: Introduce filesystem context [ver #10] David Howells
2018-07-27 17:31 ` [PATCH 01/38] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
2018-07-27 17:31 ` [PATCH 02/38] vfs: syscall: Add move_mount(2) to move mounts around " David Howells
2018-07-27 17:34 ` [PATCH 26/38] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
2018-07-27 17:34 ` [PATCH 29/38] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
2018-07-27 19:42   ` Andy Lutomirski
2018-07-27 21:51   ` David Howells
2018-07-27 21:57     ` Andy Lutomirski
2018-07-27 22:27     ` David Howells
2018-07-27 22:32   ` Jann Horn
2018-07-29  8:50   ` David Howells
2018-07-29 11:14     ` Jann Horn
2018-07-30 12:32     ` David Howells
2018-07-27 17:34 ` [PATCH 30/38] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
2018-07-27 19:27   ` Andy Lutomirski
2018-07-27 19:43     ` Andy Lutomirski
2018-07-27 22:09     ` David Howells
2018-07-27 22:06   ` David Howells
2018-07-27 17:34 ` [PATCH 31/38] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
2018-07-27 17:35 ` [PATCH 34/38] vfs: syscall: Add fsinfo() to query filesystem information " David Howells
2018-07-27 19:35   ` Andy Lutomirski
2018-07-27 22:12   ` David Howells
2018-07-27 23:14   ` Jann Horn
2018-07-27 23:49   ` David Howells
2018-07-28  0:14     ` Anton Altaparmakov
2018-07-27 23:51   ` David Howells
2018-07-27 23:58     ` Jann Horn
2018-07-28  0:08     ` David Howells
2018-07-30 14:48   ` David Howells
2018-07-31  4:16   ` Al Viro
2018-07-31 12:39   ` David Howells
2018-07-31 13:20   ` David Howells
2018-07-31 23:49   ` Darrick J. Wong
2018-08-01  1:07   ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).