* [PATCH 00/33] VFS: Introduce filesystem context [ver #11]
@ 2018-08-01 15:23 David Howells
2018-08-01 15:24 ` [PATCH 01/33] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
` (8 more replies)
0 siblings, 9 replies; 70+ messages in thread
From: David Howells @ 2018-08-01 15:23 UTC (permalink / raw)
To: viro
Cc: John Johansen, Tejun Heo, Eric W. Biederman, selinux, Paul Moore,
Li Zefan, linux-api, apparmor, Casey Schaufler, fenghua.yu,
Greg Kroah-Hartman, Eric Biggers, linux-security-module,
Tetsuo Handa, Johannes Weiner, Stephen Smalley, tomoyo-dev-en,
cgroups, torvalds, dhowells, linux-fsdevel, linux-kernel
Hi Al,
Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, creating
the superblock and then effecting the mount. This is also used for remount
since much of the parsing stuff is common in many filesystems.
This allows namespaces and other information to be conveyed through the
mount procedure.
This also allows Miklós Szeredi's idea of doing:
fd = fsopen("nfs");
fsconfig(fd, FSCONFIG_SET_STRING, "option", "val", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
mfd = fsmount(fd, MS_NODEV);
move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
that he presented at LSF-2017 to be implemented (see the relevant patches
in the series).
I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.
I've implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.
Unconverted filesystems are handled by a legacy filesystem wrapper.
====================
WHY DO WE WANT THIS?
====================
Firstly, there's a bunch of problems with the mount(2) syscall:
(1) It's actually six or seven different interfaces rolled into one and
weird combinations of flags make it do different things beyond the
original specification of the syscall.
(2) It produces a particularly large and diverse set of errors, which have
to be mapped back to a small error code. Yes, there's dmesg - if you
have it configured - but you can't necessarily see that if you're
doing a mount inside of a container.
(3) It copies a PAGE_SIZE block of data for each of the type, device name
and options.
(4) The size of the buffers is PAGE_SIZE - and this is arch dependent.
(5) You can't mount into another mount namespace. I could, for example,
build a container without having to be in that container's namespace
if I can do it from outside.
(6) It's not really geared for the specification of multiple sources, but
some filesystems really want that - overlayfs, for example.
and some problems in the internal kernel api:
(1) There's no defined way to supply namespace configuration for the
superblock - so, for instance, I can't say that I want to create a
superblock in a particular network namespace (on automount, say).
NFS hacks around this by creating multiple shadow file_system_types
with different ->mount() ops.
(2) When calling mount internally, unless you have NFS-like hacks, you
have to generate or otherwise provide text config data which then gets
parsed, when some of the time you could bypass the parsing stage
entirely.
(3) The amount of data in the data buffer is not known, but the data
buffer might be on a kernel stack somewhere, leading to the
possibility of tripping the stack underrun guard.
and other issues too:
(1) Superblock remount in some filesystems applies options on an as-parsed
basis, so if there's a parse failure, a partial alteration with no
rollback is effected.
(2) Under some circumstances, the mount data may get copied multiple times
so that it can have multiple parsers applied to it or because it has
to be parsed multiple times - for instance, once to get the
preliminary info required to access the on-disk superblock and then
again to update the superblock record in the kernel.
I want to be able to add support for a bunch of things:
(1) UID, GID and Project ID mapping/translation. I want to be able to
install a translation table of some sort on the superblock to
translate source identifiers (which may be foreign numeric UIDs/GIDs,
text names, GUIDs) into system identifiers. This needs to be done
before the superblock is published[*].
Note that this may, for example, involve using the context and the
superblock held therein to issue an RPC to a server to look up
translations.
[*] By "published" I mean made available through mount so that other
userspace processes can access it by path.
Maybe specifying a translation range element with something like:
fsconfig(fd, fsconfig_translate_uid, "<srcuid> <nsuid> <count>", 0, 0);
The translation information also needs to propagate over an automount
in some circumstances.
(2) Namespace configuration. I want to be able to tell the superblock
creation process what namespaces should be applied when it created (in
particular the userns and netns) for containerisation purposes, e.g.:
fsconfig(fd, FSCONFIG_SET_NAMESPACE, "user", 0, userns_fd);
fsconfig(fd, FSCONFIG_SET_NAMESPACE, "net", 0, netns_fd);
(3) Namespace propagation. I want to have a properly defined mechanism
for propagating namespace configuration over automounts within the
kernel. This will be particularly useful for network filesystems.
(4) Pre-mount attribute query. A chunk of the changes is actually the
fsinfo() syscall to query attributes of the filesystem beyond what's
available in statx() and statfs(). This will allow a created
superblock to be queried before it is published.
(5) Upcall for configuration. I would like to be able to query
configuration that's stored in userspace when an automount is made.
For instance, to look up network parameters for NFS or to find a cache
selector for fscache.
The internal fs_context could be passed to the upcall process or the
kernel could read a config file directly if named appropriately for the
superblock, perhaps:
[/etc/fscontext.d/afs/example.com/cell.cfg]
realm = EXAMPLE.COM
translation = uid,3000,4000,100
fscache = tag=fred
(6) Event notifications. I want to be able to install a watch on a
superblock before it is published to catch things like quota events
and EIO.
(7) Large and binary parameters. There might be at some point a need to
pass large/binary objects like Microsoft PACs around. If I understand
PACs correctly, you can obtain these from the Kerberos server and then
pass them to the file server when you connect.
Having it possible to pass large or binary objects as individual
fsconfig calls make parsing these trivial. OTOH, some or all of this
can potentially be handled with the use of the keyrings interface - as
the afs filesystem does for passing kerberos tokens around; it's just
that that seems overkill for a parameter you may only need once.
===================
SIGNIFICANT CHANGES
===================
ver #11:
(*) Fixed AppArmor.
(*) Capitalised all the UAPI constants.
(*) Explicitly numbered the FSCONFIG_* UAPI constants.
(*) Removed all the places ANON_INODES is selected.
(*) Fixed a bug whereby the context gets freed twice (which broke mounts of
procfs).
(*) Split fsinfo() off into its own patch series.
ver #10:
(*) Renamed "option" to "parameter" in a number of places.
(*) Replaced the use of write() to drive the configuration with an fsconfig()
syscall. This also allows at-style paths and fds to be presented as typed
object.
(*) Routed the key=value parameter concept all the way through from the
fsconfig() system call to the LSM and filesystem.
(*) Added a parameter-description concept and helper functions to help
interpret a parameter and possibly convert the value.
(*) Made it possible to query the parameter description using the fsinfo()
syscall. Added a test-fs-query sample to dump the parameters used by a
filesystem.
ver #9:
(*) Dropped the fd cookie stuff and the FMODE_*/O_* split stuff.
(*) Al added an open_tree() system call to allow a mount tree to be picked
referenced or cloned into an O_PATH-style fd. This can then be used
with sys_move_mount(). Dropped the O_CLONE_MOUNT and O_NON_RECURSIVE
open() flags.
(*) Brought error logging back in, though only in the fs_context and not
in the task_struct.
(*) Separated MS_REMOUNT|MS_BIND handling from MS_REMOUNT handling.
(*) Used anon_inodes for the fd returned by fsopen() and fspick(). This
requires making it unconditional.
(*) Fixed lots of bugs. Especial thanks to Al and Eric Biggers for
finding them and providing patches.
(*) Wrote manual pages, which I'll post separately.
ver #8:
(*) Changed the way fsmount() mounts into the namespace according to some
of Al's ideas.
(*) Put better typing on the fd cookie obtained from __fdget() & co..
(*) Stored the fd cookie in struct nameidata rather than the dfd number.
(*) Changed sys_fsmount() to return an O_PATH-style fd rather than
actually mounting into the mount namespace.
(*) Separated internal FMODE_* handling from O_* handling to free up
certain O_* flag numbers.
(*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use
with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd.
(*) Added a new syscall, sys_move_mount(), to move a mount from an
dfd+path source to a dfd+path destination.
(*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the
vfsmount attached to file->f_path needs 'unmounting' if set.
(*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful.
[!] This doesn't work quite right.
(*) Added a new syscall, fsinfo(), to query information about a
filesystem. The idea being that this will, in future, work with the
fd from fsopen() too and permit querying of the parameters and
metadata before fsmount() is called.
ver #7:
(*) Undo an incorrect MS_* -> SB_* conversion.
(*) Pass the mount data buffer size to all the mount-related functions that
take the data pointer. This fixes a problem where someone (say SELinux)
tries to copy the mount data, assuming it to be a page in size, and
overruns the buffer - thereby incurring an oops by hitting a guard page.
(*) Made the AFS filesystem use them as an example. This is a much easier to
deal with than with NFS or Ext4 as there are very few mount options.
ver #6:
(*) Dropped the supplementary error string facility for the moment.
(*) Dropped the NFS patches for the moment.
(*) Dropped the reserved file descriptor argument from fsopen() and
replaced it with three reserved pointers that must be NULL.
ver #5:
(*) Renamed sb_config -> fs_context and adjusted variable names.
(*) Differentiated the flags in sb->s_flags (now named SB_*) from those
passed to mount(2) (named MS_*).
(*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
caller always provide a struct file_system_type pointer and the
parameters required.
(*) Got rid of vfs_submount_fc() in favour of passing
FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context(). The purpose is now
used more.
(*) Call ->validate() on the remount path.
(*) Got rid of the inode locking in sys_fsmount().
(*) Call security_sb_mountpoint() in the mount(2) path.
ver #4:
(*) Split the sb_config patch up somewhat.
(*) Made the supplementary error string facility something attached to the
task_struct rather than the sb_config so that error messages can be
obtained from NFS doing a mount-root-and-pathwalk inside the
nfs_get_tree() operation.
Further, made this managed and read by prctl rather than through the
mount fd so that it's more generally available.
ver #3:
(*) Rebased on 4.12-rc1.
(*) Split the NFS patch up somewhat.
ver #2:
(*) Removed the ->fill_super() from sb_config_operations and passed it in
directly to functions that want to call it. NFS now calls
nfs_fill_super() directly rather than jumping through a pointer to it
since there's only the one option at the moment.
(*) Removed ->mnt_ns and ->sb from sb_config and moved ->pid_ns into
proc_sb_config.
(*) Renamed create_super -> get_tree.
(*) Renamed struct mount_context to struct sb_config and amended various
variable names.
(*) sys_fsmount() acquired AT_* flags and MS_* flags (for MNT_* flags)
arguments.
ver #1:
(*) Split the sb_config stuff out into its own header.
(*) Support non-context aware filesystems through a special set of
sb_config operations.
(*) Stored the created superblock and root dentry into the sb_config after
creation rather than directly into a vfsmount. This allows some
arguments to be removed to various NFS functions.
(*) Added an explicit superblock-creation step. This allows a created
superblock to then be mounted multiple times.
(*) Added a flag to say that the sb_config is degraded and cannot have
another go at having a superblock creation whilst getting rid of the
one that says it's already mounted.
Possible further developments:
(*) Implement sb reconfiguration (for now it returns ENOANO).
(*) Implement mount context support in more filesystems, ext4 being next
on my list.
(*) Move the walk-from-root stuff that nfs has to generic code so that you
can do something akin to:
mount /dev/sda1:/foo/bar /mnt
See nfs_follow_remote_path() and mount_subtree(). This is slightly
tricky in NFS as we have to prevent referral loops.
(*) Work out how to get at the error message incurred by submounts
encountered during nfs_follow_remote_path().
Should the error message be moved to task_struct and made more
general, perhaps retrieved with a prctl() function?
(*) Clean up/consolidate the security functions. Possibly add a
validation hook to be called at the same time as the mount context
validate op.
The patches can be found here also:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
tagged as:
mount-api-20180801
on branch:
mount-api
David
---
Al Viro (2):
vfs: syscall: Add open_tree(2) to reference or clone a mount
teach move_mount(2) to work with OPEN_TREE_CLONE
David Howells (31):
vfs: syscall: Add move_mount(2) to move mounts around
vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled
vfs: Introduce the basic header for the new mount API's filesystem context
vfs: Introduce logging functions
vfs: Add configuration parser helpers
vfs: Add LSM hooks for the new mount API
selinux: Implement the new mount API LSM hooks
smack: Implement filesystem context security hooks
apparmor: Implement security hooks for the new mount API
tomoyo: Implement security hooks for the new mount API
vfs: Separate changing mount flags full remount
vfs: Implement a filesystem superblock creation/configuration context
vfs: Remove unused code after filesystem context changes
procfs: Move proc_fill_super() to fs/proc/root.c
proc: Add fs_context support to procfs
ipc: Convert mqueue fs to fs_context
cpuset: Use fs_context
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
hugetlbfs: Convert to fs_context
vfs: Remove kern_mount_data()
vfs: Provide documentation for new mount API
Make anon_inodes unconditional
vfs: syscall: Add fsopen() to prepare for superblock creation
vfs: Implement logging through fs_context
vfs: Add some logging to the core users of the fs_context log
vfs: syscall: Add fsconfig() for configuring and managing a context
vfs: syscall: Add fsmount() to create a mount for a superblock
vfs: syscall: Add fspick() to select a superblock for reconfiguration
afs: Add fs_context support
afs: Use fs_context to pass parameters over automount
vfs: Add a sample program for the new mount API
Documentation/filesystems/mount_api.txt | 706 ++++++++++++++++++++++++
arch/arc/kernel/setup.c | 1
arch/arm/kernel/atags_parse.c | 1
arch/arm/kvm/Kconfig | 1
arch/arm64/kvm/Kconfig | 1
arch/mips/kvm/Kconfig | 1
arch/powerpc/kvm/Kconfig | 1
arch/s390/kvm/Kconfig | 1
arch/sh/kernel/setup.c | 1
arch/sparc/kernel/setup_32.c | 1
arch/sparc/kernel/setup_64.c | 1
arch/x86/Kconfig | 1
arch/x86/entry/syscalls/syscall_32.tbl | 6
arch/x86/entry/syscalls/syscall_64.tbl | 6
arch/x86/kernel/cpu/intel_rdt.h | 15 +
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 184 ++++--
arch/x86/kernel/setup.c | 1
arch/x86/kvm/Kconfig | 1
drivers/base/Kconfig | 1
drivers/base/devtmpfs.c | 1
drivers/char/tpm/Kconfig | 1
drivers/dma-buf/Kconfig | 1
drivers/gpio/Kconfig | 1
drivers/iio/Kconfig | 1
drivers/infiniband/Kconfig | 1
drivers/vfio/Kconfig | 1
fs/Kconfig | 7
fs/Makefile | 5
fs/afs/internal.h | 9
fs/afs/mntpt.c | 148 +++--
fs/afs/super.c | 470 +++++++++-------
fs/afs/volume.c | 4
fs/f2fs/super.c | 2
fs/file_table.c | 9
fs/filesystems.c | 4
fs/fs_context.c | 779 +++++++++++++++++++++++++++
fs/fs_parser.c | 476 ++++++++++++++++
fs/fsopen.c | 491 +++++++++++++++++
fs/hugetlbfs/inode.c | 392 ++++++++------
fs/internal.h | 13
fs/kernfs/mount.c | 87 +--
fs/libfs.c | 19 +
fs/namei.c | 4
fs/namespace.c | 867 +++++++++++++++++++++++-------
fs/notify/fanotify/Kconfig | 1
fs/notify/inotify/Kconfig | 1
fs/pnode.c | 1
fs/proc/inode.c | 51 --
fs/proc/internal.h | 6
fs/proc/root.c | 245 ++++++--
fs/super.c | 368 ++++++++++---
fs/sysfs/mount.c | 67 ++
include/linux/cgroup.h | 3
include/linux/fs.h | 21 +
include/linux/fs_context.h | 208 +++++++
include/linux/fs_parser.h | 116 ++++
include/linux/kernfs.h | 39 +
include/linux/lsm_hooks.h | 70 ++
include/linux/module.h | 6
include/linux/mount.h | 5
include/linux/security.h | 61 ++
include/linux/syscalls.h | 9
include/uapi/linux/fcntl.h | 2
include/uapi/linux/fs.h | 82 +--
include/uapi/linux/mount.h | 75 +++
init/Kconfig | 10
init/do_mounts.c | 1
init/do_mounts_initrd.c | 1
ipc/mqueue.c | 121 +++-
kernel/cgroup/cgroup-internal.h | 50 +-
kernel/cgroup/cgroup-v1.c | 347 +++++++-----
kernel/cgroup/cgroup.c | 256 ++++++---
kernel/cgroup/cpuset.c | 68 ++
samples/Kconfig | 6
samples/Makefile | 2
samples/mount_api/Makefile | 7
samples/mount_api/test-fsmount.c | 118 ++++
security/apparmor/include/mount.h | 11
security/apparmor/lsm.c | 108 ++++
security/apparmor/mount.c | 47 ++
security/security.c | 51 ++
security/selinux/hooks.c | 311 ++++++++++-
security/smack/smack.h | 11
security/smack/smack_lsm.c | 370 ++++++++++++-
security/tomoyo/common.h | 3
security/tomoyo/mount.c | 46 ++
security/tomoyo/tomoyo.c | 15 +
87 files changed, 6637 insertions(+), 1484 deletions(-)
create mode 100644 Documentation/filesystems/mount_api.txt
create mode 100644 fs/fs_context.c
create mode 100644 fs/fs_parser.c
create mode 100644 fs/fsopen.c
create mode 100644 include/linux/fs_context.h
create mode 100644 include/linux/fs_parser.h
create mode 100644 include/uapi/linux/mount.h
create mode 100644 samples/mount_api/Makefile
create mode 100644 samples/mount_api/test-fsmount.c
^ permalink raw reply [flat|nested] 70+ messages in thread
* [PATCH 01/33] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #11]
2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
@ 2018-08-01 15:24 ` David Howells
2018-08-02 17:31 ` Alan Jenkins
2018-08-02 21:51 ` David Howells
2018-08-01 15:24 ` [PATCH 02/33] vfs: syscall: Add move_mount(2) to move mounts around " David Howells
` (7 subsequent siblings)
8 siblings, 2 replies; 70+ messages in thread
From: David Howells @ 2018-08-01 15:24 UTC (permalink / raw)
To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel
From: Al Viro <viro@zeniv.linux.org.uk>
open_tree(dfd, pathname, flags)
Returns an O_PATH-opened file descriptor or an error.
dfd and pathname specify the location to open, in usual
fashion (see e.g. fstatat(2)). flags should be an OR of
some of the following:
* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
same meanings as usual
* OPEN_TREE_CLOEXEC - make the resulting descriptor
close-on-exec
* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
instead of opening the location in question, create a detached
mount tree matching the subtree rooted at location specified by
dfd/pathname. With AT_RECURSIVE the entire subtree is cloned,
without it - only the part within in the mount containing the
location in question. In other words, the same as mount --rbind
or mount --bind would've taken. The detached tree will be
dissolved on the final close of obtained file. Creation of such
detached trees requires the same capabilities as doing mount --bind.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/file_table.c | 9 +-
fs/internal.h | 1
fs/namespace.c | 132 +++++++++++++++++++++++++++-----
include/linux/fs.h | 3 +
include/linux/syscalls.h | 1
include/uapi/linux/fcntl.h | 2
include/uapi/linux/mount.h | 10 ++
9 files changed, 135 insertions(+), 25 deletions(-)
create mode 100644 include/uapi/linux/mount.h
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..ea1b413afd47 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,4 @@
384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl
385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
386 i386 rseq sys_rseq __ia32_sys_rseq
+387 i386 open_tree sys_open_tree __ia32_sys_open_tree
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..0545bed581dc 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
332 common statx __x64_sys_statx
333 common io_pgetevents __x64_sys_io_pgetevents
334 common rseq __x64_sys_rseq
+335 common open_tree __x64_sys_open_tree
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/file_table.c b/fs/file_table.c
index 7ec0b3e5f05d..7480271a0d21 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -189,6 +189,7 @@ static void __fput(struct file *file)
struct dentry *dentry = file->f_path.dentry;
struct vfsmount *mnt = file->f_path.mnt;
struct inode *inode = file->f_inode;
+ fmode_t mode = file->f_mode;
might_sleep();
@@ -209,14 +210,14 @@ static void __fput(struct file *file)
file->f_op->release(inode, file);
security_file_free(file);
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
- !(file->f_mode & FMODE_PATH))) {
+ !(mode & FMODE_PATH))) {
cdev_put(inode->i_cdev);
}
fops_put(file->f_op);
put_pid(file->f_owner.pid);
- if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
+ if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
i_readcount_dec(inode);
- if (file->f_mode & FMODE_WRITER) {
+ if (mode & FMODE_WRITER) {
put_write_access(inode);
__mnt_drop_write(mnt);
}
@@ -224,6 +225,8 @@ static void __fput(struct file *file)
file->f_path.mnt = NULL;
file->f_inode = NULL;
file_free(file);
+ if (unlikely(mode & FMODE_NEED_UNMOUNT))
+ dissolve_on_fput(mnt);
dput(dentry);
mntput(mnt);
}
diff --git a/fs/internal.h b/fs/internal.h
index 56533b08532e..383ee4724f77 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -85,6 +85,7 @@ extern void __mnt_drop_write(struct vfsmount *);
extern void __mnt_drop_write_file(struct file *);
extern void mnt_drop_write_file_path(struct file *);
+extern void dissolve_on_fput(struct vfsmount *);
/*
* fs_struct.c
*/
diff --git a/fs/namespace.c b/fs/namespace.c
index 03cc3b5bcf00..a4a01ecbcacd 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -20,12 +20,14 @@
#include <linux/init.h> /* init_rootfs */
#include <linux/fs_struct.h> /* get_fs_root et.al. */
#include <linux/fsnotify.h> /* fsnotify_vfsmount_delete */
+#include <linux/file.h>
#include <linux/uaccess.h>
#include <linux/proc_ns.h>
#include <linux/magic.h>
#include <linux/bootmem.h>
#include <linux/task_work.h>
#include <linux/sched/task.h>
+#include <uapi/linux/mount.h>
#include "pnode.h"
#include "internal.h"
@@ -1840,6 +1842,16 @@ struct vfsmount *collect_mounts(const struct path *path)
return &tree->mnt;
}
+void dissolve_on_fput(struct vfsmount *mnt)
+{
+ namespace_lock();
+ lock_mount_hash();
+ mntget(mnt);
+ umount_tree(real_mount(mnt), UMOUNT_SYNC);
+ unlock_mount_hash();
+ namespace_unlock();
+}
+
void drop_collected_mounts(struct vfsmount *mnt)
{
namespace_lock();
@@ -2199,6 +2211,30 @@ static bool has_locked_children(struct mount *mnt, struct dentry *dentry)
return false;
}
+static struct mount *__do_loopback(struct path *old_path, int recurse)
+{
+ struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt);
+
+ if (IS_MNT_UNBINDABLE(old))
+ return mnt;
+
+ if (!check_mnt(old) && old_path->dentry->d_op != &ns_dentry_operations)
+ return mnt;
+
+ if (!recurse && has_locked_children(old, old_path->dentry))
+ return mnt;
+
+ if (recurse)
+ mnt = copy_tree(old, old_path->dentry, CL_COPY_MNT_NS_FILE);
+ else
+ mnt = clone_mnt(old, old_path->dentry, 0);
+
+ if (!IS_ERR(mnt))
+ mnt->mnt.mnt_flags &= ~MNT_LOCKED;
+
+ return mnt;
+}
+
/*
* do loopback mount.
*/
@@ -2206,7 +2242,7 @@ static int do_loopback(struct path *path, const char *old_name,
int recurse)
{
struct path old_path;
- struct mount *mnt = NULL, *old, *parent;
+ struct mount *mnt = NULL, *parent;
struct mountpoint *mp;
int err;
if (!old_name || !*old_name)
@@ -2220,38 +2256,21 @@ static int do_loopback(struct path *path, const char *old_name,
goto out;
mp = lock_mount(path);
- err = PTR_ERR(mp);
- if (IS_ERR(mp))
+ if (IS_ERR(mp)) {
+ err = PTR_ERR(mp);
goto out;
+ }
- old = real_mount(old_path.mnt);
parent = real_mount(path->mnt);
-
- err = -EINVAL;
- if (IS_MNT_UNBINDABLE(old))
- goto out2;
-
if (!check_mnt(parent))
goto out2;
- if (!check_mnt(old) && old_path.dentry->d_op != &ns_dentry_operations)
- goto out2;
-
- if (!recurse && has_locked_children(old, old_path.dentry))
- goto out2;
-
- if (recurse)
- mnt = copy_tree(old, old_path.dentry, CL_COPY_MNT_NS_FILE);
- else
- mnt = clone_mnt(old, old_path.dentry, 0);
-
+ mnt = __do_loopback(&old_path, recurse);
if (IS_ERR(mnt)) {
err = PTR_ERR(mnt);
goto out2;
}
- mnt->mnt.mnt_flags &= ~MNT_LOCKED;
-
err = graft_tree(mnt, parent, mp);
if (err) {
lock_mount_hash();
@@ -2265,6 +2284,75 @@ static int do_loopback(struct path *path, const char *old_name,
return err;
}
+SYSCALL_DEFINE3(open_tree, int, dfd, const char *, filename, unsigned, flags)
+{
+ struct file *file;
+ struct path path;
+ int lookup_flags = LOOKUP_AUTOMOUNT | LOOKUP_FOLLOW;
+ bool detached = flags & OPEN_TREE_CLONE;
+ int error;
+ int fd;
+
+ BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
+
+ if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE |
+ AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE |
+ OPEN_TREE_CLOEXEC))
+ return -EINVAL;
+
+ if ((flags & (AT_RECURSIVE | OPEN_TREE_CLONE)) == AT_RECURSIVE)
+ return -EINVAL;
+
+ if (flags & AT_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (flags & AT_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (flags & AT_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+
+ if (detached && !may_mount())
+ return -EPERM;
+
+ fd = get_unused_fd_flags(flags & O_CLOEXEC);
+ if (fd < 0)
+ return fd;
+
+ error = user_path_at(dfd, filename, lookup_flags, &path);
+ if (error)
+ goto out;
+
+ if (detached) {
+ struct mount *mnt = __do_loopback(&path, flags & AT_RECURSIVE);
+ if (IS_ERR(mnt)) {
+ error = PTR_ERR(mnt);
+ goto out2;
+ }
+ mntput(path.mnt);
+ path.mnt = &mnt->mnt;
+ }
+
+ file = dentry_open(&path, O_PATH, current_cred());
+ if (IS_ERR(file)) {
+ error = PTR_ERR(file);
+ goto out3;
+ }
+
+ if (detached)
+ file->f_mode |= FMODE_NEED_UNMOUNT;
+ path_put(&path);
+ fd_install(fd, file);
+ return fd;
+
+out3:
+ if (detached)
+ dissolve_on_fput(path.mnt);
+out2:
+ path_put(&path);
+out:
+ put_unused_fd(fd);
+ return error;
+}
+
static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
{
int error = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3a18cddb74e..067f0e31aec7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -154,6 +154,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File is capable of returning -EAGAIN if I/O will block */
#define FMODE_NOWAIT ((__force fmode_t)0x8000000)
+/* File represents mount that needs unmounting */
+#define FMODE_NEED_UNMOUNT ((__force fmode_t)0x10000000)
+
/*
* Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
* that indicates that they should check the contents of the iovec are
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 73810808cdf2..3cc6b8f8bd2f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -900,6 +900,7 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
int flags, uint32_t sig);
+asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..594b85f7cb86 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -90,5 +90,7 @@
#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */
#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */
+#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
+
#endif /* _UAPI_LINUX_FCNTL_H */
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
new file mode 100644
index 000000000000..e8db2911adca
--- /dev/null
+++ b/include/uapi/linux/mount.h
@@ -0,0 +1,10 @@
+#ifndef _UAPI_LINUX_MOUNT_H
+#define _UAPI_LINUX_MOUNT_H
+
+/*
+ * open_tree() flags.
+ */
+#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */
+#define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */
+
+#endif /* _UAPI_LINUX_MOUNT_H */
^ permalink raw reply related [flat|nested] 70+ messages in thread
* [PATCH 02/33] vfs: syscall: Add move_mount(2) to move mounts around [ver #11]
2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
2018-08-01 15:24 ` [PATCH 01/33] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
@ 2018-08-01 15:24 ` David Howells
2018-08-01 15:26 ` [PATCH 25/33] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
` (6 subsequent siblings)
8 siblings, 0 replies; 70+ messages in thread
From: David Howells @ 2018-08-01 15:24 UTC (permalink / raw)
To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel
Add a move_mount() system call that will move a mount from one place to
another and, in the next commit, allow to attach an unattached mount tree.
The new system call looks like the following:
int move_mount(int from_dfd, const char *from_path,
int to_dfd, const char *to_path,
unsigned int flags);
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/namespace.c | 102 ++++++++++++++++++++++++++------
include/linux/lsm_hooks.h | 6 ++
include/linux/security.h | 7 ++
include/linux/syscalls.h | 3 +
include/uapi/linux/mount.h | 11 +++
security/security.c | 5 ++
8 files changed, 118 insertions(+), 18 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ea1b413afd47..76d092b7d1b0 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -399,3 +399,4 @@
385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents
386 i386 rseq sys_rseq __ia32_sys_rseq
387 i386 open_tree sys_open_tree __ia32_sys_open_tree
+388 i386 move_mount sys_move_mount __ia32_sys_move_mount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 0545bed581dc..37ba4e65eee6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -344,6 +344,7 @@
333 common io_pgetevents __x64_sys_io_pgetevents
334 common rseq __x64_sys_rseq
335 common open_tree __x64_sys_open_tree
+336 common move_mount __x64_sys_move_mount
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index a4a01ecbcacd..e2934a4f342b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2447,43 +2447,37 @@ static inline int tree_contains_unbindable(struct mount *mnt)
return 0;
}
-static int do_move_mount(struct path *path, const char *old_name)
+static int do_move_mount(struct path *old_path, struct path *new_path)
{
- struct path old_path, parent_path;
+ struct path parent_path = {.mnt = NULL, .dentry = NULL};
struct mount *p;
struct mount *old;
struct mountpoint *mp;
int err;
- if (!old_name || !*old_name)
- return -EINVAL;
- err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
- if (err)
- return err;
- mp = lock_mount(path);
+ mp = lock_mount(new_path);
err = PTR_ERR(mp);
if (IS_ERR(mp))
goto out;
- old = real_mount(old_path.mnt);
- p = real_mount(path->mnt);
+ old = real_mount(old_path->mnt);
+ p = real_mount(new_path->mnt);
err = -EINVAL;
if (!check_mnt(p) || !check_mnt(old))
goto out1;
- if (old->mnt.mnt_flags & MNT_LOCKED)
+ if (!mnt_has_parent(old))
goto out1;
- err = -EINVAL;
- if (old_path.dentry != old_path.mnt->mnt_root)
+ if (old->mnt.mnt_flags & MNT_LOCKED)
goto out1;
- if (!mnt_has_parent(old))
+ if (old_path->dentry != old_path->mnt->mnt_root)
goto out1;
- if (d_is_dir(path->dentry) !=
- d_is_dir(old_path.dentry))
+ if (d_is_dir(new_path->dentry) !=
+ d_is_dir(old_path->dentry))
goto out1;
/*
* Don't move a mount residing in a shared parent.
@@ -2501,7 +2495,8 @@ static int do_move_mount(struct path *path, const char *old_name)
if (p == old)
goto out1;
- err = attach_recursive_mnt(old, real_mount(path->mnt), mp, &parent_path);
+ err = attach_recursive_mnt(old, real_mount(new_path->mnt), mp,
+ &parent_path);
if (err)
goto out1;
@@ -2513,6 +2508,22 @@ static int do_move_mount(struct path *path, const char *old_name)
out:
if (!err)
path_put(&parent_path);
+ return err;
+}
+
+static int do_move_mount_old(struct path *path, const char *old_name)
+{
+ struct path old_path;
+ int err;
+
+ if (!old_name || !*old_name)
+ return -EINVAL;
+
+ err = kern_path(old_name, LOOKUP_FOLLOW, &old_path);
+ if (err)
+ return err;
+
+ err = do_move_mount(&old_path, path);
path_put(&old_path);
return err;
}
@@ -2934,7 +2945,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
else if (flags & (MS_SHARED | MS_PRIVATE | MS_SLAVE | MS_UNBINDABLE))
retval = do_change_type(&path, flags);
else if (flags & MS_MOVE)
- retval = do_move_mount(&path, dev_name);
+ retval = do_move_mount_old(&path, dev_name);
else
retval = do_new_mount(&path, type_page, sb_flags, mnt_flags,
dev_name, data_page, data_size);
@@ -3169,6 +3180,61 @@ SYSCALL_DEFINE5(mount, char __user *, dev_name, char __user *, dir_name,
return ksys_mount(dev_name, dir_name, type, flags, data);
}
+/*
+ * Move a mount from one place to another.
+ *
+ * Note the flags value is a combination of MOVE_MOUNT_* flags.
+ */
+SYSCALL_DEFINE5(move_mount,
+ int, from_dfd, const char *, from_pathname,
+ int, to_dfd, const char *, to_pathname,
+ unsigned int, flags)
+{
+ struct path from_path, to_path;
+ unsigned int lflags;
+ int ret = 0;
+
+ if (!may_mount())
+ return -EPERM;
+
+ if (flags & ~MOVE_MOUNT__MASK)
+ return -EINVAL;
+
+ /* If someone gives a pathname, they aren't permitted to move
+ * from an fd that requires unmount as we can't get at the flag
+ * to clear it afterwards.
+ */
+ lflags = 0;
+ if (flags & MOVE_MOUNT_F_SYMLINKS) lflags |= LOOKUP_FOLLOW;
+ if (flags & MOVE_MOUNT_F_AUTOMOUNTS) lflags |= LOOKUP_AUTOMOUNT;
+ if (flags & MOVE_MOUNT_F_EMPTY_PATH) lflags |= LOOKUP_EMPTY;
+
+ ret = user_path_at(from_dfd, from_pathname, lflags, &from_path);
+ if (ret < 0)
+ return ret;
+
+ lflags = 0;
+ if (flags & MOVE_MOUNT_T_SYMLINKS) lflags |= LOOKUP_FOLLOW;
+ if (flags & MOVE_MOUNT_T_AUTOMOUNTS) lflags |= LOOKUP_AUTOMOUNT;
+ if (flags & MOVE_MOUNT_T_EMPTY_PATH) lflags |= LOOKUP_EMPTY;
+
+ ret = user_path_at(to_dfd, to_pathname, lflags, &to_path);
+ if (ret < 0)
+ goto out_from;
+
+ ret = security_move_mount(&from_path, &to_path);
+ if (ret < 0)
+ goto out_to;
+
+ ret = do_move_mount(&from_path, &to_path);
+
+out_to:
+ path_put(&to_path);
+out_from:
+ path_put(&from_path);
+ return ret;
+}
+
/*
* Return true if path is reachable from root
*
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index b43bbc893074..924424e7be8f 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -147,6 +147,10 @@
* Parse a string of security data filling in the opts structure
* @options string containing all mount options known by the LSM
* @opts binary data structure usable by the LSM
+ * @move_mount:
+ * Check permission before a mount is moved.
+ * @from_path indicates the mount that is going to be moved.
+ * @to_path indicates the mountpoint that will be mounted upon.
* @dentry_init_security:
* Compute a context for a dentry as the inode is not yet available
* since NFSv4 has no label backed by an EA anyway.
@@ -1480,6 +1484,7 @@ union security_list_options {
unsigned long kern_flags,
unsigned long *set_kern_flags);
int (*sb_parse_opts_str)(char *options, struct security_mnt_opts *opts);
+ int (*move_mount)(const struct path *from_path, const struct path *to_path);
int (*dentry_init_security)(struct dentry *dentry, int mode,
const struct qstr *name, void **ctx,
u32 *ctxlen);
@@ -1811,6 +1816,7 @@ struct security_hook_heads {
struct hlist_head sb_set_mnt_opts;
struct hlist_head sb_clone_mnt_opts;
struct hlist_head sb_parse_opts_str;
+ struct hlist_head move_mount;
struct hlist_head dentry_init_security;
struct hlist_head dentry_create_files_as;
#ifdef CONFIG_SECURITY_PATH
diff --git a/include/linux/security.h b/include/linux/security.h
index 1498b9e0539b..9bb5bc6d596c 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -245,6 +245,7 @@ int security_sb_clone_mnt_opts(const struct super_block *oldsb,
unsigned long kern_flags,
unsigned long *set_kern_flags);
int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts);
+int security_move_mount(const struct path *from_path, const struct path *to_path);
int security_dentry_init_security(struct dentry *dentry, int mode,
const struct qstr *name, void **ctx,
u32 *ctxlen);
@@ -599,6 +600,12 @@ static inline int security_sb_parse_opts_str(char *options, struct security_mnt_
return 0;
}
+static inline int security_move_mount(const struct path *from_path,
+ const struct path *to_path)
+{
+ return 0;
+}
+
static inline int security_inode_alloc(struct inode *inode)
{
return 0;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3cc6b8f8bd2f..3c0855d9b105 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -901,6 +901,9 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
int flags, uint32_t sig);
asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
+asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
+ int to_dfd, const char __user *to_path,
+ unsigned int ms_flags);
/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index e8db2911adca..89adf0d731ab 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -7,4 +7,15 @@
#define OPEN_TREE_CLONE 1 /* Clone the target tree and attach the clone */
#define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */
+/*
+ * move_mount() flags.
+ */
+#define MOVE_MOUNT_F_SYMLINKS 0x00000001 /* Follow symlinks on from path */
+#define MOVE_MOUNT_F_AUTOMOUNTS 0x00000002 /* Follow automounts on from path */
+#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
+#define MOVE_MOUNT_T_SYMLINKS 0x00000010 /* Follow symlinks on to path */
+#define MOVE_MOUNT_T_AUTOMOUNTS 0x00000020 /* Follow automounts on to path */
+#define MOVE_MOUNT_T_EMPTY_PATH 0x00000040 /* Empty to path permitted */
+#define MOVE_MOUNT__MASK 0x00000077
+
#endif /* _UAPI_LINUX_MOUNT_H */
diff --git a/security/security.c b/security/security.c
index 7cafc1c90d16..5149c2cbe8a7 100644
--- a/security/security.c
+++ b/security/security.c
@@ -439,6 +439,11 @@ int security_sb_parse_opts_str(char *options, struct security_mnt_opts *opts)
}
EXPORT_SYMBOL(security_sb_parse_opts_str);
+int security_move_mount(const struct path *from_path, const struct path *to_path)
+{
+ return call_int_hook(move_mount, 0, from_path, to_path);
+}
+
int security_inode_alloc(struct inode *inode)
{
inode->i_security = NULL;
^ permalink raw reply related [flat|nested] 70+ messages in thread
* [PATCH 25/33] vfs: syscall: Add fsopen() to prepare for superblock creation [ver #11]
2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
2018-08-01 15:24 ` [PATCH 01/33] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
2018-08-01 15:24 ` [PATCH 02/33] vfs: syscall: Add move_mount(2) to move mounts around " David Howells
@ 2018-08-01 15:26 ` David Howells
2018-08-01 15:27 ` [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
` (5 subsequent siblings)
8 siblings, 0 replies; 70+ messages in thread
From: David Howells @ 2018-08-01 15:26 UTC (permalink / raw)
To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel
Provide an fsopen() system call that starts the process of preparing to
create a superblock that will then be mountable, using an fd as a context
handle. fsopen() is given the name of the filesystem that will be used:
int mfd = fsopen(const char *fsname, unsigned int flags);
where flags can be 0 or FSOPEN_CLOEXEC.
For example:
sfd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(sfd, FSCONFIG_SET_PATH, "source", "/dev/sda1", AT_FDCWD);
fsconfig(sfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
fsconfig(sfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
fsconfig(sfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
fsconfig(sfd, FSCONFIG_SET_STRING, "sb", "1", 0);
fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
fsinfo(sfd, NULL, ...); // query new superblock attributes
mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
sfd = fsopen("afs", -1);
fsconfig(fd, FSCONFIG_SET_STRING, "source",
"#grand.central.org:root.cell", 0);
fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
mfd = fsmount(sfd, 0, MS_NODEV);
move_mount(mfd, "", sfd, AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
If an error is reported at any step, an error message may be available to be
read() back (ENODATA will be reported if there isn't an error available) in
the form:
"e <subsys>:<problem>"
"e SELinux:Mount on mountpoint not permitted"
Once fsmount() has been called, further fsconfig() calls will incur EBUSY,
even if the fsmount() fails. read() is still possible to retrieve error
information.
The fsopen() syscall creates a mount context and hangs it of the fd that it
returns.
Netlink is not used because it is optional and would make the core VFS
dependent on the networking layer and also potentially add network
namespace issues.
Note that, for the moment, the caller must have SYS_CAP_ADMIN to use
fsopen().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/Makefile | 2 -
fs/fs_context.c | 4 +
fs/fsopen.c | 87 ++++++++++++++++++++++++++++++++
include/linux/fs_context.h | 4 +
include/linux/syscalls.h | 1
include/uapi/linux/fs.h | 5 ++
8 files changed, 104 insertions(+), 1 deletion(-)
create mode 100644 fs/fsopen.c
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 76d092b7d1b0..1647fefd2969 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
386 i386 rseq sys_rseq __ia32_sys_rseq
387 i386 open_tree sys_open_tree __ia32_sys_open_tree
388 i386 move_mount sys_move_mount __ia32_sys_move_mount
+389 i386 fsopen sys_fsopen __ia32_sys_fsopen
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 37ba4e65eee6..235d33dbccb2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
334 common rseq __x64_sys_rseq
335 common open_tree __x64_sys_open_tree
336 common move_mount __x64_sys_move_mount
+337 common fsopen __x64_sys_fsopen
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index ae681523b4b1..e3ea8093b178 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -13,7 +13,7 @@ obj-y := open.o read_write.o file_table.o super.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o splice.o sync.o utimes.o d_path.o \
stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
- fs_context.o fs_parser.o
+ fs_context.o fs_parser.o fsopen.o
ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o block_dev.o direct-io.o mpage.o
diff --git a/fs/fs_context.c b/fs/fs_context.c
index 8f040a20b320..90db81d7008c 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -263,6 +263,8 @@ struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
fc->fs_type = get_filesystem(fs_type);
fc->cred = get_current_cred();
+ mutex_init(&fc->uapi_mutex);
+
switch (purpose) {
case FS_CONTEXT_FOR_KERNEL_MOUNT:
fc->sb_flags |= SB_KERNMOUNT;
@@ -348,6 +350,8 @@ struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc)
if (!fc)
return ERR_PTR(-ENOMEM);
+ mutex_init(&fc->uapi_mutex);
+
fc->fs_private = NULL;
fc->s_fs_info = NULL;
fc->source = NULL;
diff --git a/fs/fsopen.c b/fs/fsopen.c
new file mode 100644
index 000000000000..f30080e1ebc4
--- /dev/null
+++ b/fs/fsopen.c
@@ -0,0 +1,87 @@
+/* Filesystem access-by-fd.
+ *
+ * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fs_context.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/security.h>
+#include <linux/anon_inodes.h>
+#include <linux/namei.h>
+#include <linux/file.h>
+#include "mount.h"
+
+static int fscontext_release(struct inode *inode, struct file *file)
+{
+ struct fs_context *fc = file->private_data;
+
+ if (fc) {
+ file->private_data = NULL;
+ put_fs_context(fc);
+ }
+ return 0;
+}
+
+const struct file_operations fscontext_fops = {
+ .release = fscontext_release,
+ .llseek = no_llseek,
+};
+
+/*
+ * Attach a filesystem context to a file and an fd.
+ */
+static int fscontext_create_fd(struct fs_context *fc, unsigned int o_flags)
+{
+ int fd;
+
+ fd = anon_inode_getfd("fscontext", &fscontext_fops, fc,
+ O_RDWR | o_flags);
+ if (fd < 0)
+ put_fs_context(fc);
+ return fd;
+}
+
+/*
+ * Open a filesystem by name so that it can be configured for mounting.
+ *
+ * We are allowed to specify a container in which the filesystem will be
+ * opened, thereby indicating which namespaces will be used (notably, which
+ * network namespace will be used for network filesystems).
+ */
+SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
+{
+ struct file_system_type *fs_type;
+ struct fs_context *fc;
+ const char *fs_name;
+
+ if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (flags & ~FSOPEN_CLOEXEC)
+ return -EINVAL;
+
+ fs_name = strndup_user(_fs_name, PAGE_SIZE);
+ if (IS_ERR(fs_name))
+ return PTR_ERR(fs_name);
+
+ fs_type = get_fs_type(fs_name);
+ kfree(fs_name);
+ if (!fs_type)
+ return -ENODEV;
+
+ fc = vfs_new_fs_context(fs_type, NULL, 0, FS_CONTEXT_FOR_USER_MOUNT);
+ put_filesystem(fs_type);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ fc->phase = FS_CONTEXT_CREATE_PARAMS;
+ return fscontext_create_fd(fc, flags & FSOPEN_CLOEXEC ? O_CLOEXEC : 0);
+}
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index bbb8114f2fdc..5445889f705b 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -14,6 +14,7 @@
#include <linux/kernel.h>
#include <linux/errno.h>
+#include <linux/mutex.h>
struct cred;
struct dentry;
@@ -87,6 +88,7 @@ struct fs_parameter {
*/
struct fs_context {
const struct fs_context_operations *ops;
+ struct mutex uapi_mutex; /* Userspace access mutex */
struct file_system_type *fs_type;
void *fs_private; /* The filesystem's context */
struct dentry *root; /* The root and superblock */
@@ -143,6 +145,8 @@ extern int vfs_get_super(struct fs_context *fc,
int (*fill_super)(struct super_block *sb,
struct fs_context *fc));
+extern const struct file_operations fscontext_fops;
+
#define logfc(FC, FMT, ...) pr_notice(FMT, ## __VA_ARGS__)
/**
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3c0855d9b105..ad6c7ff33c01 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -904,6 +904,7 @@ asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
int to_dfd, const char __user *to_path,
unsigned int ms_flags);
+asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 1c982eb44ff4..f8818e6cddd6 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -344,4 +344,9 @@ typedef int __bitwise __kernel_rwf_t;
#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
RWF_APPEND)
+/*
+ * Flags for fsopen() and co.
+ */
+#define FSOPEN_CLOEXEC 0x00000001
+
#endif /* _UAPI_LINUX_FS_H */
^ permalink raw reply related [flat|nested] 70+ messages in thread
* [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #11]
2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
` (2 preceding siblings ...)
2018-08-01 15:26 ` [PATCH 25/33] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
@ 2018-08-01 15:27 ` David Howells
2018-08-06 17:28 ` Eric W. Biederman
` (2 more replies)
2018-08-01 15:27 ` [PATCH 29/33] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
` (4 subsequent siblings)
8 siblings, 3 replies; 70+ messages in thread
From: David Howells @ 2018-08-01 15:27 UTC (permalink / raw)
To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel
Add a syscall for configuring a filesystem creation context and triggering
actions upon it, to be used in conjunction with fsopen, fspick and fsmount.
long fsconfig(int fs_fd, unsigned int cmd, const char *key,
const void *value, int aux);
Where fs_fd indicates the context, cmd indicates the action to take, key
indicates the parameter name for parameter-setting actions and, if needed,
value points to a buffer containing the value and aux can give more
information for the value.
The following command IDs are proposed:
(*) FSCONFIG_SET_FLAG: No value is specified. The parameter must be
boolean in nature. The key may be prefixed with "no" to invert the
setting. value must be NULL and aux must be 0.
(*) FSCONFIG_SET_STRING: A string value is specified. The parameter can
be expecting boolean, integer, string or take a path. A conversion to
an appropriate type will be attempted (which may include looking up as
a path). value points to a NUL-terminated string and aux must be 0.
(*) FSCONFIG_SET_BINARY: A binary blob is specified. value points to
the blob and aux indicates its size. The parameter must be expecting
a blob.
(*) FSCONFIG_SET_PATH: A non-empty path is specified. The parameter must
be expecting a path object. value points to a NUL-terminated string
that is the path and aux is a file descriptor at which to start a
relative lookup or AT_FDCWD.
(*) FSCONFIG_SET_PATH_EMPTY: As fsconfig_set_path, but with AT_EMPTY_PATH
implied.
(*) FSCONFIG_SET_FD: An open file descriptor is specified. value must
be NULL and aux indicates the file descriptor.
(*) FSCONFIG_CMD_CREATE: Trigger superblock creation.
(*) FSCONFIG_CMD_RECONFIGURE: Trigger superblock reconfiguration.
For the "set" command IDs, the idea is that the file_system_type will point
to a list of parameters and the types of value that those parameters expect
to take. The core code can then do the parse and argument conversion and
then give the LSM and FS a cooked option or array of options to use.
Source specification is also done the same way same way, using special keys
"source", "source1", "source2", etc..
[!] Note that, for the moment, the key and value are just glued back
together and handed to the filesystem. Every filesystem that uses options
uses match_token() and co. to do this, and this will need to be changed -
but not all at once.
Example usage:
fd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_path, "source", "/dev/sda1", AT_FDCWD);
fsconfig(fd, fsconfig_set_path_empty, "journal_path", "", journal_fd);
fsconfig(fd, fsconfig_set_fd, "journal_fd", "", journal_fd);
fsconfig(fd, fsconfig_set_flag, "user_xattr", NULL, 0);
fsconfig(fd, fsconfig_set_flag, "noacl", NULL, 0);
fsconfig(fd, fsconfig_set_string, "sb", "1", 0);
fsconfig(fd, fsconfig_set_string, "errors", "continue", 0);
fsconfig(fd, fsconfig_set_string, "data", "journal", 0);
fsconfig(fd, fsconfig_set_string, "context", "unconfined_u:...", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
or:
fd = fsopen("ext4", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "source", "/dev/sda1", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
or:
fd = fsopen("afs", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "source", "#grand.central.org:root.cell", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
or:
fd = fsopen("jffs2", FSOPEN_CLOEXEC);
fsconfig(fd, fsconfig_set_string, "source", "mtd0", 0);
fsconfig(fd, fsconfig_cmd_create, NULL, NULL, 0);
mfd = fsmount(fd, FSMOUNT_CLOEXEC, MS_NOEXEC);
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/fsopen.c | 279 ++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 2
include/uapi/linux/fs.h | 14 ++
5 files changed, 297 insertions(+)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 1647fefd2969..f9970310c126 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -401,3 +401,4 @@
387 i386 open_tree sys_open_tree __ia32_sys_open_tree
388 i386 move_mount sys_move_mount __ia32_sys_move_mount
389 i386 fsopen sys_fsopen __ia32_sys_fsopen
+390 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 235d33dbccb2..4185d36e03bb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -346,6 +346,7 @@
335 common open_tree __x64_sys_open_tree
336 common move_mount __x64_sys_move_mount
337 common fsopen __x64_sys_fsopen
+338 common fsconfig __x64_sys_fsconfig
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 7a25b4c3bc18..5d8560e78ce1 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -10,6 +10,7 @@
*/
#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
@@ -17,6 +18,7 @@
#include <linux/anon_inodes.h>
#include <linux/namei.h>
#include <linux/file.h>
+#include "internal.h"
#include "mount.h"
/*
@@ -152,3 +154,280 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
put_fs_context(fc);
return ret;
}
+
+/*
+ * Check the state and apply the configuration. Note that this function is
+ * allowed to 'steal' the value by setting param->xxx to NULL before returning.
+ */
+static int vfs_fsconfig(struct fs_context *fc, struct fs_parameter *param)
+{
+ int ret;
+
+ /* We need to reinitialise the context if we have reconfiguration
+ * pending after creation or a previous reconfiguration.
+ */
+ if (fc->phase == FS_CONTEXT_AWAITING_RECONF) {
+ if (fc->fs_type->init_fs_context) {
+ ret = fc->fs_type->init_fs_context(fc, fc->root);
+ if (ret < 0) {
+ fc->phase = FS_CONTEXT_FAILED;
+ return ret;
+ }
+ fc->need_free = true;
+ } else {
+ /* Leave legacy context ops in place */
+ }
+
+ /* Do the security check last because ->init_fs_context may
+ * change the namespace subscriptions.
+ */
+ ret = security_fs_context_alloc(fc, fc->root);
+ if (ret < 0) {
+ fc->phase = FS_CONTEXT_FAILED;
+ return ret;
+ }
+
+ fc->phase = FS_CONTEXT_RECONF_PARAMS;
+ }
+
+ if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
+ fc->phase != FS_CONTEXT_RECONF_PARAMS)
+ return -EBUSY;
+
+ return vfs_parse_fs_param(fc, param);
+}
+
+/*
+ * Perform an action on a context.
+ */
+static int vfs_fsconfig_action(struct fs_context *fc, enum fsconfig_command cmd)
+{
+ int ret = -EINVAL;
+
+ switch (cmd) {
+ case FSCONFIG_CMD_CREATE:
+ if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
+ return -EBUSY;
+ fc->phase = FS_CONTEXT_CREATING;
+ ret = vfs_get_tree(fc);
+ if (ret == 0)
+ fc->phase = FS_CONTEXT_AWAITING_MOUNT;
+ else
+ fc->phase = FS_CONTEXT_FAILED;
+ return ret;
+
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
+/**
+ * sys_fsconfig - Set parameters and trigger actions on a context
+ * @fd: The filesystem context to act upon
+ * @cmd: The action to take
+ * @_key: Where appropriate, the parameter key to set
+ * @_value: Where appropriate, the parameter value to set
+ * @aux: Additional information for the value
+ *
+ * This system call is used to set parameters on a context, including
+ * superblock settings, data source and security labelling.
+ *
+ * Actions include triggering the creation of a superblock and the
+ * reconfiguration of the superblock attached to the specified context.
+ *
+ * When setting a parameter, @cmd indicates the type of value being proposed
+ * and @_key indicates the parameter to be altered.
+ *
+ * @_value and @aux are used to specify the value, should a value be required:
+ *
+ * (*) fsconfig_set_flag: No value is specified. The parameter must be boolean
+ * in nature. The key may be prefixed with "no" to invert the
+ * setting. @_value must be NULL and @aux must be 0.
+ *
+ * (*) fsconfig_set_string: A string value is specified. The parameter can be
+ * expecting boolean, integer, string or take a path. A conversion to an
+ * appropriate type will be attempted (which may include looking up as a
+ * path). @_value points to a NUL-terminated string and @aux must be 0.
+ *
+ * (*) fsconfig_set_binary: A binary blob is specified. @_value points to the
+ * blob and @aux indicates its size. The parameter must be expecting a
+ * blob.
+ *
+ * (*) fsconfig_set_path: A non-empty path is specified. The parameter must be
+ * expecting a path object. @_value points to a NUL-terminated string that
+ * is the path and @aux is a file descriptor at which to start a relative
+ * lookup or AT_FDCWD.
+ *
+ * (*) fsconfig_set_path_empty: As fsconfig_set_path, but with AT_EMPTY_PATH
+ * implied.
+ *
+ * (*) fsconfig_set_fd: An open file descriptor is specified. @_value must be
+ * NULL and @aux indicates the file descriptor.
+ */
+SYSCALL_DEFINE5(fsconfig,
+ int, fd,
+ unsigned int, cmd,
+ const char __user *, _key,
+ const void __user *, _value,
+ int, aux)
+{
+ struct fs_context *fc;
+ struct fd f;
+ int ret;
+
+ struct fs_parameter param = {
+ .type = fs_value_is_undefined,
+ };
+
+ if (fd < 0)
+ return -EINVAL;
+
+ switch (cmd) {
+ case FSCONFIG_SET_FLAG:
+ if (!_key || _value || aux)
+ return -EINVAL;
+ break;
+ case FSCONFIG_SET_STRING:
+ if (!_key || !_value || aux)
+ return -EINVAL;
+ break;
+ case FSCONFIG_SET_BINARY:
+ if (!_key || !_value || aux <= 0 || aux > 1024 * 1024)
+ return -EINVAL;
+ break;
+ case FSCONFIG_SET_PATH:
+ case FSCONFIG_SET_PATH_EMPTY:
+ if (!_key || !_value || (aux != AT_FDCWD && aux < 0))
+ return -EINVAL;
+ break;
+ case FSCONFIG_SET_FD:
+ if (!_key || _value || aux < 0)
+ return -EINVAL;
+ break;
+ case FSCONFIG_CMD_CREATE:
+ case FSCONFIG_CMD_RECONFIGURE:
+ if (_key || _value || aux)
+ return -EINVAL;
+ break;
+ default:
+ return -EOPNOTSUPP;
+ }
+
+ f = fdget(fd);
+ if (!f.file)
+ return -EBADF;
+ ret = -EINVAL;
+ if (f.file->f_op != &fscontext_fops)
+ goto out_f;
+
+ fc = f.file->private_data;
+ if (fc->ops == &legacy_fs_context_ops) {
+ switch (cmd) {
+ case FSCONFIG_SET_BINARY:
+ case FSCONFIG_SET_PATH:
+ case FSCONFIG_SET_PATH_EMPTY:
+ case FSCONFIG_SET_FD:
+ ret = -EOPNOTSUPP;
+ goto out_f;
+ }
+ }
+
+ if (_key) {
+ param.key = strndup_user(_key, 256);
+ if (IS_ERR(param.key)) {
+ ret = PTR_ERR(param.key);
+ goto out_f;
+ }
+ }
+
+ switch (cmd) {
+ case FSCONFIG_SET_STRING:
+ param.type = fs_value_is_string;
+ param.string = strndup_user(_value, 256);
+ if (IS_ERR(param.string)) {
+ ret = PTR_ERR(param.string);
+ goto out_key;
+ }
+ param.size = strlen(param.string);
+ break;
+ case FSCONFIG_SET_BINARY:
+ param.type = fs_value_is_blob;
+ param.size = aux;
+ param.blob = memdup_user_nul(_value, aux);
+ if (IS_ERR(param.blob)) {
+ ret = PTR_ERR(param.blob);
+ goto out_key;
+ }
+ break;
+ case FSCONFIG_SET_PATH:
+ param.type = fs_value_is_filename;
+ param.name = getname_flags(_value, 0, NULL);
+ if (IS_ERR(param.name)) {
+ ret = PTR_ERR(param.name);
+ goto out_key;
+ }
+ param.dirfd = aux;
+ param.size = strlen(param.name->name);
+ break;
+ case FSCONFIG_SET_PATH_EMPTY:
+ param.type = fs_value_is_filename_empty;
+ param.name = getname_flags(_value, LOOKUP_EMPTY, NULL);
+ if (IS_ERR(param.name)) {
+ ret = PTR_ERR(param.name);
+ goto out_key;
+ }
+ param.dirfd = aux;
+ param.size = strlen(param.name->name);
+ break;
+ case FSCONFIG_SET_FD:
+ param.type = fs_value_is_file;
+ ret = -EBADF;
+ param.file = fget(aux);
+ if (!param.file)
+ goto out_key;
+ break;
+ default:
+ break;
+ }
+
+ ret = mutex_lock_interruptible(&fc->uapi_mutex);
+ if (ret == 0) {
+ switch (cmd) {
+ case FSCONFIG_CMD_CREATE:
+ case FSCONFIG_CMD_RECONFIGURE:
+ ret = vfs_fsconfig_action(fc, cmd);
+ break;
+ default:
+ ret = vfs_fsconfig(fc, ¶m);
+ break;
+ }
+ mutex_unlock(&fc->uapi_mutex);
+ }
+
+ /* Clean up the our record of any value that we obtained from
+ * userspace. Note that the value may have been stolen by the LSM or
+ * filesystem, in which case the value pointer will have been cleared.
+ */
+ switch (cmd) {
+ case FSCONFIG_SET_STRING:
+ case FSCONFIG_SET_BINARY:
+ kfree(param.string);
+ break;
+ case FSCONFIG_SET_PATH:
+ case FSCONFIG_SET_PATH_EMPTY:
+ if (param.name)
+ putname(param.name);
+ break;
+ case FSCONFIG_SET_FD:
+ if (param.file)
+ fput(param.file);
+ break;
+ default:
+ break;
+ }
+out_key:
+ kfree(param.key);
+out_f:
+ fdput(f);
+ return ret;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index ad6c7ff33c01..9628d14a7ede 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -905,6 +905,8 @@ asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
int to_dfd, const char __user *to_path,
unsigned int ms_flags);
asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
+asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
+ const void __user *value, int aux);
/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f8818e6cddd6..fecbae30a30d 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -349,4 +349,18 @@ typedef int __bitwise __kernel_rwf_t;
*/
#define FSOPEN_CLOEXEC 0x00000001
+/*
+ * The type of fsconfig() call made.
+ */
+enum fsconfig_command {
+ FSCONFIG_SET_FLAG = 0, /* Set parameter, supplying no value */
+ FSCONFIG_SET_STRING = 1, /* Set parameter, supplying a string value */
+ FSCONFIG_SET_BINARY = 2, /* Set parameter, supplying a binary blob value */
+ FSCONFIG_SET_PATH = 3, /* Set parameter, supplying an object by path */
+ FSCONFIG_SET_PATH_EMPTY = 4, /* Set parameter, supplying an object by (empty) path */
+ FSCONFIG_SET_FD = 5, /* Set parameter, supplying an object by fd */
+ FSCONFIG_CMD_CREATE = 6, /* Invoke superblock creation */
+ FSCONFIG_CMD_RECONFIGURE = 7, /* Invoke superblock reconfiguration */
+};
+
#endif /* _UAPI_LINUX_FS_H */
^ permalink raw reply related [flat|nested] 70+ messages in thread
* [PATCH 29/33] vfs: syscall: Add fsmount() to create a mount for a superblock [ver #11]
2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
` (3 preceding siblings ...)
2018-08-01 15:27 ` [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
@ 2018-08-01 15:27 ` David Howells
2018-08-01 15:27 ` [PATCH 30/33] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
` (3 subsequent siblings)
8 siblings, 0 replies; 70+ messages in thread
From: David Howells @ 2018-08-01 15:27 UTC (permalink / raw)
To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel
Provide a system call by which a filesystem opened with fsopen() and
configured by a series of fsconfig() calls can have a detached mount object
created for it. This mount object can then be attached to the VFS mount
hierarchy using move_mount() by passing the returned file descriptor as the
from directory fd.
The system call looks like:
int mfd = fsmount(int fsfd, unsigned int flags,
unsigned int ms_flags);
where fsfd is the file descriptor returned by fsopen(). flags can be 0 or
FSMOUNT_CLOEXEC. ms_flags is a bitwise-OR of the following flags:
MS_RDONLY
MS_NOSUID
MS_NODEV
MS_NOEXEC
MS_NOATIME
MS_NODIRATIME
MS_RELATIME
MS_STRICTATIME
MS_UNBINDABLE
MS_PRIVATE
MS_SLAVE
MS_SHARED
In the event that fsmount() fails, it may be possible to get an error
message by calling read() on fsfd. If no message is available, ENODATA
will be reported.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
fs/namespace.c | 141 +++++++++++++++++++++++++++++++-
include/linux/syscalls.h | 1
include/uapi/linux/fs.h | 2
5 files changed, 142 insertions(+), 4 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f9970310c126..c78b68256f8a 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -402,3 +402,4 @@
388 i386 move_mount sys_move_mount __ia32_sys_move_mount
389 i386 fsopen sys_fsopen __ia32_sys_fsopen
390 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
+391 i386 fsmount sys_fsmount __ia32_sys_fsmount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 4185d36e03bb..d44ead5d4368 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -347,6 +347,7 @@
336 common move_mount __x64_sys_move_mount
337 common fsopen __x64_sys_fsopen
338 common fsconfig __x64_sys_fsconfig
+339 common fsmount __x64_sys_fsmount
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index ea07066a2731..7e131b7fc098 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2503,7 +2503,7 @@ static int do_move_mount(struct path *old_path, struct path *new_path)
attached = mnt_has_parent(old);
/*
- * We need to allow open_tree(OPEN_TREE_CLONE) followed by
+ * We need to allow open_tree(OPEN_TREE_CLONE) or fsmount() followed by
* move_mount(), but mustn't allow "/" to be moved.
*/
if (old->mnt_ns && !attached)
@@ -3348,9 +3348,142 @@ struct vfsmount *kern_mount(struct file_system_type *type)
EXPORT_SYMBOL_GPL(kern_mount);
/*
- * Move a mount from one place to another.
- * In combination with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be
- * used to copy a mount subtree.
+ * Create a kernel mount representation for a new, prepared superblock
+ * (specified by fs_fd) and attach to an open_tree-like file descriptor.
+ */
+SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags)
+{
+ struct fs_context *fc;
+ struct file *file;
+ struct path newmount;
+ struct fd f;
+ unsigned int mnt_flags = 0;
+ long ret;
+
+ if (!may_mount())
+ return -EPERM;
+
+ if ((flags & ~(FSMOUNT_CLOEXEC)) != 0)
+ return -EINVAL;
+
+ if (ms_flags & ~(MS_RDONLY | MS_NOSUID | MS_NODEV | MS_NOEXEC |
+ MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
+ MS_STRICTATIME))
+ return -EINVAL;
+
+ if (ms_flags & MS_RDONLY)
+ mnt_flags |= MNT_READONLY;
+ if (ms_flags & MS_NOSUID)
+ mnt_flags |= MNT_NOSUID;
+ if (ms_flags & MS_NODEV)
+ mnt_flags |= MNT_NODEV;
+ if (ms_flags & MS_NOEXEC)
+ mnt_flags |= MNT_NOEXEC;
+ if (ms_flags & MS_NODIRATIME)
+ mnt_flags |= MNT_NODIRATIME;
+
+ if (ms_flags & MS_STRICTATIME) {
+ if (ms_flags & MS_NOATIME)
+ return -EINVAL;
+ } else if (ms_flags & MS_NOATIME) {
+ mnt_flags |= MNT_NOATIME;
+ } else {
+ mnt_flags |= MNT_RELATIME;
+ }
+
+ f = fdget(fs_fd);
+ if (!f.file)
+ return -EBADF;
+
+ ret = -EINVAL;
+ if (f.file->f_op != &fscontext_fops)
+ goto err_fsfd;
+
+ fc = f.file->private_data;
+
+ /* There must be a valid superblock or we can't mount it */
+ ret = -EINVAL;
+ if (!fc->root)
+ goto err_fsfd;
+
+ ret = -EPERM;
+ if (mount_too_revealing(fc->root->d_sb, &mnt_flags)) {
+ pr_warn("VFS: Mount too revealing\n");
+ goto err_fsfd;
+ }
+
+ ret = mutex_lock_interruptible(&fc->uapi_mutex);
+ if (ret < 0)
+ goto err_fsfd;
+
+ ret = -EBUSY;
+ if (fc->phase != FS_CONTEXT_AWAITING_MOUNT)
+ goto err_unlock;
+
+ ret = -EPERM;
+ if ((fc->sb_flags & SB_MANDLOCK) && !may_mandlock())
+ goto err_unlock;
+
+ newmount.mnt = vfs_create_mount(fc, mnt_flags);
+ if (IS_ERR(newmount.mnt)) {
+ ret = PTR_ERR(newmount.mnt);
+ goto err_unlock;
+ }
+ newmount.dentry = dget(fc->root);
+
+ /* We've done the mount bit - now move the file context into more or
+ * less the same state as if we'd done an fspick(). We don't want to
+ * do any memory allocation or anything like that at this point as we
+ * don't want to have to handle any errors incurred.
+ */
+ if (fc->ops && fc->ops->free)
+ fc->ops->free(fc);
+ fc->need_free = false;
+ fc->fs_private = NULL;
+ fc->s_fs_info = NULL;
+ fc->sb_flags = 0;
+ fc->sloppy = false;
+ fc->silent = false;
+ security_fs_context_free(fc);
+ fc->security = NULL;
+ kfree(fc->subtype);
+ fc->subtype = NULL;
+ kfree(fc->source);
+ fc->source = NULL;
+
+ fc->purpose = FS_CONTEXT_FOR_RECONFIGURE;
+ fc->phase = FS_CONTEXT_AWAITING_RECONF;
+
+ /* Attach to an apparent O_PATH fd with a note that we need to unmount
+ * it, not just simply put it.
+ */
+ file = dentry_open(&newmount, O_PATH, fc->cred);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto err_path;
+ }
+ file->f_mode |= FMODE_NEED_UNMOUNT;
+
+ ret = get_unused_fd_flags((flags & FSMOUNT_CLOEXEC) ? O_CLOEXEC : 0);
+ if (ret >= 0)
+ fd_install(ret, file);
+ else
+ fput(file);
+
+err_path:
+ path_put(&newmount);
+err_unlock:
+ mutex_unlock(&fc->uapi_mutex);
+err_fsfd:
+ fdput(f);
+ return ret;
+}
+
+/*
+ * Move a mount from one place to another. In combination with
+ * fsopen()/fsmount() this is used to install a new mount and in combination
+ * with open_tree(OPEN_TREE_CLONE [| AT_RECURSIVE]) it can be used to copy
+ * a mount subtree.
*
* Note the flags value is a combination of MOVE_MOUNT_* flags.
*/
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 9628d14a7ede..65db661cc2da 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -907,6 +907,7 @@ asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
const void __user *value, int aux);
+asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index fecbae30a30d..10281d582e28 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -349,6 +349,8 @@ typedef int __bitwise __kernel_rwf_t;
*/
#define FSOPEN_CLOEXEC 0x00000001
+#define FSMOUNT_CLOEXEC 0x00000001
+
/*
* The type of fsconfig() call made.
*/
^ permalink raw reply related [flat|nested] 70+ messages in thread
* [PATCH 30/33] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #11]
2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
` (4 preceding siblings ...)
2018-08-01 15:27 ` [PATCH 29/33] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
@ 2018-08-01 15:27 ` David Howells
2018-08-24 14:51 ` Miklos Szeredi
2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
` (2 subsequent siblings)
8 siblings, 1 reply; 70+ messages in thread
From: David Howells @ 2018-08-01 15:27 UTC (permalink / raw)
To: viro; +Cc: linux-api, torvalds, dhowells, linux-fsdevel, linux-kernel
Provide an fspick() system call that can be used to pick an existing
mountpoint into an fs_context which can thereafter be used to reconfigure a
superblock (equivalent of the superblock side of -o remount).
This looks like:
int fd = fspick(AT_FDCWD, "/mnt",
FSPICK_CLOEXEC | FSPICK_NO_AUTOMOUNT);
fsconfig(fd, FSCONFIG_SET_FLAG, "intr", NULL, 0);
fsconfig(fd, FSCONFIG_SET_FLAG, "noac", NULL, 0);
fsconfig(fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
At the point of fspick being called, the file descriptor referring to the
filesystem context is in exactly the same state as the one that was created
by fsopen() after fsmount() has been successfully called.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-api@vger.kernel.org
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/fsopen.c | 58 ++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 1 +
include/uapi/linux/fs.h | 5 +++
5 files changed, 66 insertions(+)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c78b68256f8a..d1eb6c815790 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -403,3 +403,4 @@
389 i386 fsopen sys_fsopen __ia32_sys_fsopen
390 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
391 i386 fsmount sys_fsmount __ia32_sys_fsmount
+392 i386 fspick sys_fspick __ia32_sys_fspick
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index d44ead5d4368..d3ab703c02bb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -348,6 +348,7 @@
337 common fsopen __x64_sys_fsopen
338 common fsconfig __x64_sys_fsconfig
339 common fsmount __x64_sys_fsmount
+340 common fspick __x64_sys_fspick
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/fsopen.c b/fs/fsopen.c
index 5d8560e78ce1..e79bb5b085d6 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -155,6 +155,64 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
return ret;
}
+/*
+ * Pick a superblock into a context for reconfiguration.
+ */
+SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags)
+{
+ struct fs_context *fc;
+ struct path target;
+ unsigned int lookup_flags;
+ int ret;
+
+ if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if ((flags & ~(FSPICK_CLOEXEC |
+ FSPICK_SYMLINK_NOFOLLOW |
+ FSPICK_NO_AUTOMOUNT |
+ FSPICK_EMPTY_PATH)) != 0)
+ return -EINVAL;
+
+ lookup_flags = LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT;
+ if (flags & FSPICK_SYMLINK_NOFOLLOW)
+ lookup_flags &= ~LOOKUP_FOLLOW;
+ if (flags & FSPICK_NO_AUTOMOUNT)
+ lookup_flags &= ~LOOKUP_AUTOMOUNT;
+ if (flags & FSPICK_EMPTY_PATH)
+ lookup_flags |= LOOKUP_EMPTY;
+ ret = user_path_at(dfd, path, lookup_flags, &target);
+ if (ret < 0)
+ goto err;
+
+ ret = -EOPNOTSUPP;
+ if (!target.dentry->d_sb->s_op->reconfigure)
+ goto err_path;
+
+ fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
+ 0, FS_CONTEXT_FOR_RECONFIGURE);
+ if (IS_ERR(fc)) {
+ ret = PTR_ERR(fc);
+ goto err_path;
+ }
+
+ fc->phase = FS_CONTEXT_RECONF_PARAMS;
+
+ ret = fscontext_alloc_log(fc);
+ if (ret < 0)
+ goto err_fc;
+
+ path_put(&target);
+ return fscontext_create_fd(fc, flags & FSPICK_CLOEXEC ? O_CLOEXEC : 0);
+
+err_fc:
+ put_fs_context(fc);
+err_path:
+ path_put(&target);
+err:
+ return ret;
+}
+
/*
* Check the state and apply the configuration. Note that this function is
* allowed to 'steal' the value by setting param->xxx to NULL before returning.
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 65db661cc2da..701522957a12 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -908,6 +908,7 @@ asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
const void __user *value, int aux);
asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
+asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags);
/*
* Architecture-specific system calls
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 10281d582e28..7f01503a9e9b 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -351,6 +351,11 @@ typedef int __bitwise __kernel_rwf_t;
#define FSMOUNT_CLOEXEC 0x00000001
+#define FSPICK_CLOEXEC 0x00000001
+#define FSPICK_SYMLINK_NOFOLLOW 0x00000002
+#define FSPICK_NO_AUTOMOUNT 0x00000004
+#define FSPICK_EMPTY_PATH 0x00000008
+
/*
* The type of fsconfig() call made.
*/
^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: [PATCH 01/33] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #11]
2018-08-01 15:24 ` [PATCH 01/33] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
@ 2018-08-02 17:31 ` Alan Jenkins
2018-08-02 21:29 ` Al Viro
2018-08-02 21:51 ` David Howells
1 sibling, 1 reply; 70+ messages in thread
From: Alan Jenkins @ 2018-08-02 17:31 UTC (permalink / raw)
To: David Howells, viro; +Cc: linux-api, torvalds, linux-fsdevel, linux-kernel
Hi
I found this interesting, though I don't entirely follow the kernel
mount/unmount code. I had one puzzle about the code, and two questions
which I was largely able to answer.
On 01/08/18 16:24, David Howells wrote:
> +void dissolve_on_fput(struct vfsmount *mnt)
> +{
> + namespace_lock();
> + lock_mount_hash();
> + mntget(mnt);
> + umount_tree(real_mount(mnt), UMOUNT_SYNC);
> + unlock_mount_hash();
> + namespace_unlock();
> +}
Can I ask why UMOUNT_SYNC is used here? I feel like I must have missed
something, but doesn't it skip the IS_MNT_LOCKED() check in
disconnect_mount() ?
UMOUNT_SYNC seems used for non-lazy unmounts, and in internal cleanups
where userspace wouldn't be able to see. But I think userspace can keep
watching in this case, e.g. by `fd2 = openat(fd, ".", O_PATH)` (or `fd2
= open_tree(fd, ".", 0)` ?). I would think this function should avoid
using UMOUNT_SYNC, like lazy unmount avoids it.
> From: Al Viro <viro@zeniv.linux.org.uk>
>
> open_tree(dfd, pathname, flags)
>
> Returns an O_PATH-opened file descriptor or an error.
> dfd and pathname specify the location to open, in usual
> fashion (see e.g. fstatat(2)). flags should be an OR of
> some of the following:
> * AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
> same meanings as usual
> * OPEN_TREE_CLOEXEC - make the resulting descriptor
> close-on-exec
> * OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
> instead of opening the location in question, create a detached
> mount tree matching the subtree rooted at location specified by
> dfd/pathname. With AT_RECURSIVE the entire subtree is cloned,
> without it - only the part within in the mount containing the
> location in question. In other words, the same as mount --rbind
> or mount --bind would've taken.
One of the limitations documented for `mount --bind`, is that `mount -o
bind,ro` is not atomic. There's a workaround if you need it, it's just
a bit clunky. I wondered if it was possible to improve `mount` by
changing the mount flags between OPEN_TREE_CLONE and move_mount().
fd = open_tree(..., OPEN_TREE_CLONE);
fchdir(fd);
mount(NULL, ".", NULL, MS_REMOUNT | MS_BIND | newbindflags, NULL);
move_mount(fd, ...);
Another closely-related limitation of `mount`, is that it can't
atomically set the propagation type at mount time.
My conclusion was the above doesn't quite work yet. do_remount() still
uses check_mnt(), so it doesn't accept detached mounts. I wonder if it
can be changed in future.
> The detached tree will be
> dissolved on the final close of obtained file.
My last question turned out very dull, feel free to ignore.
It seems the only way to use MNT_FORCE[1], is to first attach the
filesystem somewhere (and close the file descriptor). I wondered if
there was a way to make things more regular. close_and_umount() feels
too obscure to live though...
[1] "Ask the filesystem to abort pending requests before attempting
theunmount. This may allow the unmount to complete without waitingfor an
inaccessible server. If, after aborting requests, someprocesses still
have active references to the filesystem, theunmount will still fail."
...and I suppose it's much less useful than I thought. The point of
MNT_FORCE is to kick out processes that were blocked _trying to access a
file by name_, e.g. open() or stat(). But if we're considering a
detached mount, then it's impossible to access it by name alone. You
need an fd (or cwd or root), which would stop the filesystem being
unmounted anyway. close_and_umount(fd, MNT_FORCE) is pointless unless
your process has other threads accessing the filesystem through the same
fd, but that's a really bad idea anyway.
It could prevent someone else getting stuck indefinitely on
/proc/$PID/fd, but that's still very obscure.
Regards
Alan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 01/33] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #11]
2018-08-02 17:31 ` Alan Jenkins
@ 2018-08-02 21:29 ` Al Viro
0 siblings, 0 replies; 70+ messages in thread
From: Al Viro @ 2018-08-02 21:29 UTC (permalink / raw)
To: Alan Jenkins
Cc: David Howells, linux-api, torvalds, linux-fsdevel, linux-kernel
On Thu, Aug 02, 2018 at 06:31:06PM +0100, Alan Jenkins wrote:
> Hi
>
> I found this interesting, though I don't entirely follow the kernel
> mount/unmount code. I had one puzzle about the code, and two questions
> which I was largely able to answer.
>
> On 01/08/18 16:24, David Howells wrote:
> > +void dissolve_on_fput(struct vfsmount *mnt)
> > +{
> > + namespace_lock();
> > + lock_mount_hash();
> > + mntget(mnt);
> > + umount_tree(real_mount(mnt), UMOUNT_SYNC);
> > + unlock_mount_hash();
> > + namespace_unlock();
> > +}
>
> Can I ask why UMOUNT_SYNC is used here? I feel like I must have missed
> something, but doesn't it skip the IS_MNT_LOCKED() check in
> disconnect_mount() ?
>
> UMOUNT_SYNC seems used for non-lazy unmounts, and in internal cleanups where
> userspace wouldn't be able to see. But I think userspace can keep watching
> in this case, e.g. by `fd2 = openat(fd, ".", O_PATH)` (or `fd2 =
> open_tree(fd, ".", 0)` ?). I would think this function should avoid using
> UMOUNT_SYNC, like lazy unmount avoids it.
FWIW, I suspect that UMOUNT_CONNECTED might be the right thing here...
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 01/33] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #11]
2018-08-01 15:24 ` [PATCH 01/33] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
2018-08-02 17:31 ` Alan Jenkins
@ 2018-08-02 21:51 ` David Howells
2018-08-02 23:46 ` Alan Jenkins
1 sibling, 1 reply; 70+ messages in thread
From: David Howells @ 2018-08-02 21:51 UTC (permalink / raw)
To: Alan Jenkins
Cc: dhowells, viro, linux-api, torvalds, linux-fsdevel, linux-kernel
Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote:
> Another closely-related limitation of `mount`, is that it can't atomically set
> the propagation type at mount time.
I want to add a mount_setattr() too at some point:
fd = open_tree(..., OPEN_TREE_CLONE);
mount_setattr(fd, ...);
move_mount(fd, ...);
I'm not sure whether you should be able to fchdir into the cloned tree since
it isn't attached to any mount namespace.
David
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 01/33] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #11]
2018-08-02 21:51 ` David Howells
@ 2018-08-02 23:46 ` Alan Jenkins
0 siblings, 0 replies; 70+ messages in thread
From: Alan Jenkins @ 2018-08-02 23:46 UTC (permalink / raw)
To: David Howells; +Cc: viro, linux-api, torvalds, linux-fsdevel, linux-kernel
On 02/08/18 22:51, David Howells wrote:
> Alan Jenkins <alan.christopher.jenkins@gmail.com> wrote:
>
>> Another closely-related limitation of `mount`, is that it can't atomically set
>> the propagation type at mount time.
> I want to add a mount_setattr() too at some point:
>
> fd = open_tree(..., OPEN_TREE_CLONE);
> mount_setattr(fd, ...);
> move_mount(fd, ...);
Cool. Not having to mess with fchdir() sounds nice. (And as a bonus,
being able to avoid the existing multiplexed mount() call, which looks
ugly from all the NULL arguments if nothing else).
> I'm not sure whether you should be able to fchdir into the cloned tree since
> it isn't attached to any mount namespace.
>
> David
I don't see a check prohibiting it :-). I don't think it's a problem.
You can already chdir/chroot into a different mount namespace, you just
can't do any mount operations on it. (You said you want to be able to,
but so far move_mount() still prohibits it, I guess that's for the future).
And you can already do the same into a mount that has been detached,
which will have `mount->mnt_ns = NULL` if I'm reading correctly.
Hmm, there is something that's been nagging at me though. I'm
suspicious about what happens in this series, when you move_mount() from
a victim of MNT_DETACH. I think umount2(MNT_DETACH) sets a flag
MNT_UMOUNT. It's a flag that was added to help correctly handle
MNT_LOCKED in the face of umount2(MNT_DETACH). It's also the point
where my understanding of the kernel mount/unmount code breaks down
:-). But it seems to override both IS_MNT_LOCKED() and UMOUNT_CONNECTED
in disconnect_mount(). That would give another chance to defeat locked
mounts.
Regards
Alan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #11]
2018-08-01 15:27 ` [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
@ 2018-08-06 17:28 ` Eric W. Biederman
2018-08-09 14:14 ` David Howells
2018-08-09 14:24 ` David Howells
2 siblings, 0 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-06 17:28 UTC (permalink / raw)
To: David Howells; +Cc: viro, linux-api, torvalds, linux-fsdevel, linux-kernel
David Howells <dhowells@redhat.com> writes:
>
> (*) FSCONFIG_CMD_CREATE: Trigger superblock creation.
>
> (*) FSCONFIG_CMD_RECONFIGURE: Trigger superblock reconfiguration.
>
First let me thank you for adding both FSCONFIG_CMD_CREATE and
FSCONFIG_CMD_RECONFIGURE. Unfortunately the implementation is currently
broken. So this patch gets my:
This is broken in two specific ways.
1) FSCONFIG_CMD_RECONFIGURE always returns -EOPNOTSUPPORTED.
So it is useless.
2) FSCONFIG_CMD_CREATE will succeed even if the superblock already
exists and it can not use all of the superblock parameters.
This happens because vfs_get_super will only call fill_super
if the super block is created. Which is reasonable on the face
of it. But it in practice this introduces security problems.
a) Either through reconfiguring a shared super block you did not
realize was shared (as we saw with devpts).
b) Mounting a super block and not honoring it's mount options
because something has already mounted it. As we see today
with proc. Leaving userspace to think the filesystem will behave
one way when in fact it behaves another.
I have already explained this several times, and apparently I have been
ignored. This fundamental usability issue that leads to security
problems.
The only feedback I have had from previous time is that it is ``racy''
to fix the code. But it is only racy in the way that O_EXCL is racy.
You might have to retry in userspace if the mount you want isn't in the
state you expect.
Until this security issue is fixed this entire patchset has my:
Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> +/*
> + * Perform an action on a context.
> + */
> +static int vfs_fsconfig_action(struct fs_context *fc, enum fsconfig_command cmd)
> +{
> + int ret = -EINVAL;
> +
> + switch (cmd) {
> + case FSCONFIG_CMD_CREATE:
> + if (fc->phase != FS_CONTEXT_CREATE_PARAMS)
> + return -EBUSY;
> + fc->phase = FS_CONTEXT_CREATING;
> + ret = vfs_get_tree(fc);
> + if (ret == 0)
> + fc->phase = FS_CONTEXT_AWAITING_MOUNT;
> + else
> + fc->phase = FS_CONTEXT_FAILED;
> + return ret;
> +
> + default:
> + return -EOPNOTSUPP;
> + }
> +}
See no support for FSCONFIG_CMD_RECONFIGURE, and no checks to see if
the superblock has already been mounted.
> + ret = mutex_lock_interruptible(&fc->uapi_mutex);
> + if (ret == 0) {
> + switch (cmd) {
> + case FSCONFIG_CMD_CREATE:
> + case FSCONFIG_CMD_RECONFIGURE:
> + ret = vfs_fsconfig_action(fc, cmd);
> + break;
> + default:
> + ret = vfs_fsconfig(fc, ¶m);
> + break;
> + }
> + mutex_unlock(&fc->uapi_mutex);
> + }
> +
Eric
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #11]
2018-08-01 15:27 ` [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
2018-08-06 17:28 ` Eric W. Biederman
@ 2018-08-09 14:14 ` David Howells
2018-08-09 14:24 ` David Howells
2 siblings, 0 replies; 70+ messages in thread
From: David Howells @ 2018-08-09 14:14 UTC (permalink / raw)
To: Eric W. Biederman
Cc: dhowells, viro, linux-api, torvalds, linux-fsdevel, linux-kernel
Eric W. Biederman <ebiederm@xmission.com> wrote:
> First let me thank you for adding both FSCONFIG_CMD_CREATE and
> FSCONFIG_CMD_RECONFIGURE. Unfortunately the implementation is currently
> broken. So this patch gets my:
>
> This is broken in two specific ways.
> 1) FSCONFIG_CMD_RECONFIGURE always returns -EOPNOTSUPPORTED.
> So it is useless.
This isn't broken, just not completely implemented. I would like to get the
core VFS framework upstream so that filesystems can start being converted.
However, since you asked nicely, here's a patch that adds the reconfiguration
bit.
David
---
vfs: Use fs_context for reconfig and implement FSCONFIG_CMD_RECONFIGURE
Implement fs_context-based reconfiguration by the following means:
(1) Provide two more internal fs_context purposes: umount that actually
reconfigures to R/O and emergency reconfiguration to R/O. This tells
the filesystem if it the context hasn't been fully initialised.
(2) Track which bits in sb_mask have been changed in addition to what
they've been changed to.
(3) Move ->reconfigure() from the superblock ops to the fs_context ops.
This makes (4) possible. The ->remount_fs() superblock op is
obsolete.
(4) Provide a legacy wrapper for ->reconfigure().
(5) Make a do_umount() that's unmounting a root allocate an fs_context on
the stack and use that to reconfigure to R/O.
(6) Make do_emergency_remount_callback() allocate an fs_context on the
stack and use that to reconfigurate to R/O.
(6) Make do_remount() unconditionally use an fs_context to invoke
do_remount_sb().
(7) Only pass in a filesystem context to do_remount_sb(). This, along
with (4), allows the function to be simplified.
(8) Pass errors back from mount_single() if reconfiguration fails. We
might want this behaviour to be conditional, depending on which mount
API was used.
(9) Delete security_sb_remount().
(10) Rename do_remount_sb() to reconfigure_super().
Notes:
(1) do_remount() can't make use of vfs_reconfigure_sb() if the former
changes the mount attributes atomically or if the latter doesn't do so
at all.
However, since I think Al wants us to move towards separating
superblock reconfiguration from mountpoint reconfiguration, there may
not be a need to do this atomically.
(2) mount_single() probably shouldn't reconfigure an already existing
superblock if it's supposed to be creating a new one, but rather it
(or, rather, the filesystem) should compare the parameters and either
return the live superblock if the params are the same or return an
error if not.
However, this probably needs to be contingent on the mount API or
something so as not to break userspace.
(3) I should add something like an FSOPEN_EXCL flag to tell sget_fc() to
fail if the superblock already exists at all and an
FSOPEN_FAIL_ON_MISMATCH flag to explicitly say that we *don't* want a
superblock with different parameters. The implication of providing
neither flag is that we are happy to accept a superblock from the same
source but with different parameters.
Signed-off-by: David Howells <dhowells@redhat.com>
---
Documentation/filesystems/mount_api.txt | 56 +++++++-------
fs/afs/mntpt.c | 2
fs/afs/super.c | 2
fs/fs_context.c | 46 ++++++-----
fs/fsopen.c | 84 +++++++++++++++++++--
fs/hugetlbfs/inode.c | 2
fs/internal.h | 9 +-
fs/kernfs/mount.c | 5 -
fs/libfs.c | 3
fs/namespace.c | 87 +++++++++-------------
fs/nfs/super.c | 2
fs/proc/inode.c | 1
fs/proc/internal.h | 1
fs/proc/root.c | 6 +
fs/super.c | 125 ++++++++++++++++++++------------
include/linux/fs.h | 1
include/linux/fs_context.h | 8 +-
include/linux/kernfs.h | 1
include/linux/lsm_hooks.h | 9 --
include/linux/security.h | 6 -
ipc/mqueue.c | 2
kernel/cgroup/cgroup.c | 2
kernel/cgroup/cpuset.c | 3
security/security.c | 5 -
security/selinux/hooks.c | 1
25 files changed, 283 insertions(+), 186 deletions(-)
diff --git a/Documentation/filesystems/mount_api.txt b/Documentation/filesystems/mount_api.txt
index 5fec78eed4f4..35cc5c7a5008 100644
--- a/Documentation/filesystems/mount_api.txt
+++ b/Documentation/filesystems/mount_api.txt
@@ -55,16 +55,6 @@ purposes - otherwise it will be NULL.
Note that security initialisation is done *after* the filesystem is called so
that the namespaces may be adjusted first.
-And the super_operations struct gains one field:
-
- int (*reconfigure)(struct super_block *, struct fs_context *);
-
-This shadows the ->reconfigure() operation and takes a prepared filesystem
-context instead of the mount flags and data page. It may modify the sb_flags
-in the context for the caller to pick up.
-
-[NOTE] reconfigure is intended as a replacement for remount_fs.
-
======================
THE FILESYSTEM CONTEXT
@@ -86,6 +76,7 @@ context. This is represented by the fs_context structure:
void *security;
void *s_fs_info;
unsigned int sb_flags;
+ unsigned int sb_flags_mask;
enum fs_context_purpose purpose:8;
bool sloppy:1;
bool silent:1;
@@ -150,8 +141,9 @@ The fs_context fields are as follows:
sget_fc(). This can be used to distinguish superblocks.
(*) unsigned int sb_flags
+ (*) unsigned int sb_flags_mask
- This holds the SB_* flags to be set in super_block::s_flags.
+ Which bits SB_* flags are to be set/cleared in super_block::s_flags.
(*) enum fs_context_purpose
@@ -162,6 +154,10 @@ The fs_context fields are as follows:
FS_CONTEXT_FOR_KERNEL_MOUNT, -- New superblock for kernel-internal mount
FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount
FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount
+ FS_CONTEXT_FOR_UMOUNT -- Reconfigure to R/O for umount()
+ FS_CONTEXT_FOR_EMERGENCY_RO -- Emergency reconfigure to R/O
+
+ In the last two cases, ->init_fs_context() will not have been called.
(*) bool sloppy
(*) bool silent
@@ -174,8 +170,8 @@ The fs_context fields are as follows:
[NOTE] silent is probably redundant with sb_flags & SB_SILENT.
-The mount context is created by calling vfs_new_fs_context(), vfs_sb_reconfig()
-or vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
+The mount context is created by calling vfs_new_fs_context() or
+vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
structure is not refcounted.
VFS, security and filesystem mount options are set individually with
@@ -206,6 +202,7 @@ The filesystem context points to a table of operations:
size_t data_size);
int (*validate)(struct fs_context *fc);
int (*get_tree)(struct fs_context *fc);
+ int (*reconfigure)(struct fs_context *fc);
};
These operations are invoked by the various stages of the mount procedure to
@@ -278,6 +275,18 @@ manage the filesystem context. They are as follows:
The phase on a userspace-driven context will be set to only allow this to
be called once on any particular context.
+ (*) int (*reconfigure)(struct fs_context *fc);
+
+ Called to effect reconfiguration of a superblock using information stored
+ in the filesystem context. It may detach any resources it desires from
+ the filesystem context and transfer them to the superblock. The
+ superblock can be found from fc->root->d_sb.
+
+ On success it should return 0. In the case of an error, it should return
+ a negative error code.
+
+ [NOTE] reconfigure is intended as a replacement for remount_fs.
+
===========================
FILESYSTEM CONTEXT SECURITY
@@ -358,24 +367,19 @@ one for destroying a context:
(*) struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
struct dentry *reference,
unsigned int sb_flags,
+ unsigned int sb_flags_mask,
enum fs_context_purpose purpose);
Create a filesystem context for a given filesystem type and purpose. This
- allocates the filesystem context, sets the flags, initialises the security
- and calls fs_type->init_fs_context() to initialise the filesystem private
- data.
+ allocates the filesystem context, sets the superblock flags, initialises
+ the security and calls fs_type->init_fs_context() to initialise the
+ filesystem private data.
reference can be NULL or it may indicate the root dentry of a superblock
- that is going to be reconfigured (FS_CONTEXT_FOR_RECONFIGURE) or the
- automount point that triggered a submount (FS_CONTEXT_FOR_SUBMOUNT). This
- is provided as a source of namespace information.
-
- (*) struct fs_context *vfs_sb_reconfig(struct vfsmount *mnt,
- unsigned int sb_flags);
-
- Create a filesystem context from the same filesystem as an extant mount
- and initialise the mount parameters from the superblock underlying that
- mount. This is for use by superblock parameter reconfiguration.
+ that is going to be reconfigured (FS_CONTEXT_FOR_RECONFIGURE,
+ FS_CONTEXT_FOR_UMOUNT or FS_CONTEXT_FOR_EMERGENCY_RO) or the automount
+ point that triggered a submount (FS_CONTEXT_FOR_SUBMOUNT). This is
+ provided as a source of namespace information.
(*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index c8a7f05b9f12..16ee515b51c9 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -147,7 +147,7 @@ static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
BUG_ON(!d_inode(mntpt));
- fc = vfs_new_fs_context(&afs_fs_type, mntpt, 0,
+ fc = vfs_new_fs_context(&afs_fs_type, mntpt, 0, 0,
FS_CONTEXT_FOR_SUBMOUNT);
if (IS_ERR(fc))
return ERR_CAST(fc);
diff --git a/fs/afs/super.c b/fs/afs/super.c
index 7c97836e7937..43cf1a6a4bf7 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -634,7 +634,7 @@ static int afs_init_fs_context(struct fs_context *fc, struct dentry *reference)
}
break;
- case FS_CONTEXT_FOR_RECONFIGURE:
+ default:
break;
}
diff --git a/fs/fs_context.c b/fs/fs_context.c
index a6597a2fbf2b..2e9a88f41071 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -106,12 +106,14 @@ static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
token = lookup_constant(common_set_sb_flag, key, 0);
if (token) {
fc->sb_flags |= token;
+ fc->sb_flags_mask |= token;
return 1;
}
token = lookup_constant(common_clear_sb_flag, key, 0);
if (token) {
fc->sb_flags &= ~token;
+ fc->sb_flags_mask |= token;
return 1;
}
@@ -240,6 +242,7 @@ EXPORT_SYMBOL(generic_parse_monolithic);
* @fs_type: The filesystem type.
* @reference: The dentry from which this one derives (or NULL)
* @sb_flags: Filesystem/superblock flags (SB_*)
+ * @sb_flags_mask: Applicable members of @sb_flags
* @purpose: The purpose that this configuration shall be used for.
*
* Open a filesystem and create a mount context. The mount context is
@@ -250,6 +253,7 @@ EXPORT_SYMBOL(generic_parse_monolithic);
struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
struct dentry *reference,
unsigned int sb_flags,
+ unsigned int sb_flags_mask,
enum fs_context_purpose purpose)
{
int (*init_fs_context)(struct fs_context *, struct dentry *);
@@ -262,6 +266,7 @@ struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
fc->purpose = purpose;
fc->sb_flags = sb_flags;
+ fc->sb_flags_mask = sb_flags_mask;
fc->fs_type = get_filesystem(fs_type);
fc->cred = get_current_cred();
@@ -280,6 +285,8 @@ struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
fc->net_ns = get_net(current->nsproxy->net_ns);
break;
case FS_CONTEXT_FOR_RECONFIGURE:
+ case FS_CONTEXT_FOR_UMOUNT:
+ case FS_CONTEXT_FOR_EMERGENCY_RO:
/* We don't pin any namespaces as the superblock's
* subscriptions cannot be changed at this point.
*/
@@ -314,28 +321,6 @@ struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
}
EXPORT_SYMBOL(vfs_new_fs_context);
-/**
- * vfs_sb_reconfig - Create a filesystem context for remount/reconfiguration
- * @mountpoint: The mountpoint to open
- * @sb_flags: Filesystem/superblock flags (SB_*)
- *
- * Open a mounted filesystem and create a filesystem context such that a
- * remount can be effected.
- */
-struct fs_context *vfs_sb_reconfig(struct path *mountpoint,
- unsigned int sb_flags)
-{
- struct fs_context *fc;
-
- fc = vfs_new_fs_context(mountpoint->dentry->d_sb->s_type,
- mountpoint->dentry,
- sb_flags, FS_CONTEXT_FOR_RECONFIGURE);
- if (IS_ERR(fc))
- return fc;
-
- return fc;
-}
-
/**
* vfs_dup_fc_config: Duplicate a filesytem context.
* @src_fc: The context to copy.
@@ -754,6 +739,22 @@ static int legacy_get_tree(struct fs_context *fc)
return ret;
}
+/*
+ * Handle remount.
+ */
+static int legacy_reconfigure(struct fs_context *fc)
+{
+ struct legacy_fs_context *ctx = fc->fs_private;
+ struct super_block *sb = fc->root->d_sb;
+
+ if (!sb->s_op->remount_fs)
+ return 0;
+
+ return sb->s_op->remount_fs(sb, &fc->sb_flags,
+ ctx ? ctx->legacy_data : NULL,
+ ctx ? ctx->data_size : 0);
+}
+
const struct fs_context_operations legacy_fs_context_ops = {
.free = legacy_fs_context_free,
.dup = legacy_fs_context_dup,
@@ -761,6 +762,7 @@ const struct fs_context_operations legacy_fs_context_ops = {
.parse_monolithic = legacy_parse_monolithic,
.validate = legacy_validate,
.get_tree = legacy_get_tree,
+ .reconfigure = legacy_reconfigure,
};
/*
diff --git a/fs/fsopen.c b/fs/fsopen.c
index e79bb5b085d6..9ead9220e2cb 100644
--- a/fs/fsopen.c
+++ b/fs/fsopen.c
@@ -137,7 +137,7 @@ SYSCALL_DEFINE2(fsopen, const char __user *, _fs_name, unsigned int, flags)
if (!fs_type)
return -ENODEV;
- fc = vfs_new_fs_context(fs_type, NULL, 0, FS_CONTEXT_FOR_USER_MOUNT);
+ fc = vfs_new_fs_context(fs_type, NULL, 0, 0, FS_CONTEXT_FOR_USER_MOUNT);
put_filesystem(fs_type);
if (IS_ERR(fc))
return PTR_ERR(fc);
@@ -185,12 +185,8 @@ SYSCALL_DEFINE3(fspick, int, dfd, const char __user *, path, unsigned int, flags
if (ret < 0)
goto err;
- ret = -EOPNOTSUPP;
- if (!target.dentry->d_sb->s_op->reconfigure)
- goto err_path;
-
fc = vfs_new_fs_context(target.dentry->d_sb->s_type, target.dentry,
- 0, FS_CONTEXT_FOR_RECONFIGURE);
+ 0, 0, FS_CONTEXT_FOR_RECONFIGURE);
if (IS_ERR(fc)) {
ret = PTR_ERR(fc);
goto err_path;
@@ -255,6 +251,58 @@ static int vfs_fsconfig(struct fs_context *fc, struct fs_parameter *param)
return vfs_parse_fs_param(fc, param);
}
+/*
+ * Reconfigure a superblock.
+ */
+int vfs_reconfigure_sb(struct fs_context *fc)
+{
+ struct super_block *sb = fc->root->d_sb;
+ int ret;
+
+ if (fc->ops->validate) {
+ ret = fc->ops->validate(fc);
+ if (ret < 0)
+ return ret;
+ }
+
+ ret = security_fs_context_validate(fc);
+ if (ret)
+ return ret;
+
+ if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ down_write(&sb->s_umount);
+ ret = reconfigure_super(fc);
+ up_write(&sb->s_umount);
+ return ret;
+}
+
+/*
+ * Clean up a context after performing an action on it and put it into a state
+ * from where it can be used to reconfigure a superblock.
+ */
+void vfs_clean_context(struct fs_context *fc)
+{
+ if (fc->ops && fc->ops->free)
+ fc->ops->free(fc);
+ fc->need_free = false;
+ fc->fs_private = NULL;
+ fc->s_fs_info = NULL;
+ fc->sb_flags = 0;
+ fc->sloppy = false;
+ fc->silent = false;
+ security_fs_context_free(fc);
+ fc->security = NULL;
+ kfree(fc->subtype);
+ fc->subtype = NULL;
+ kfree(fc->source);
+ fc->source = NULL;
+
+ fc->purpose = FS_CONTEXT_FOR_RECONFIGURE;
+ fc->phase = FS_CONTEXT_AWAITING_RECONF;
+}
+
/*
* Perform an action on a context.
*/
@@ -274,6 +322,30 @@ static int vfs_fsconfig_action(struct fs_context *fc, enum fsconfig_command cmd)
fc->phase = FS_CONTEXT_FAILED;
return ret;
+ case FSCONFIG_CMD_RECONFIGURE:
+ if (fc->phase == FS_CONTEXT_AWAITING_RECONF) {
+ /* This is probably pointless, since no changes have
+ * been proposed.
+ */
+ if (fc->fs_type->init_fs_context) {
+ ret = fc->fs_type->init_fs_context(fc, fc->root);
+ if (ret < 0) {
+ fc->phase = FS_CONTEXT_FAILED;
+ return ret;
+ }
+ fc->need_free = true;
+ }
+ fc->phase = FS_CONTEXT_RECONF_PARAMS;
+ }
+
+ fc->phase = FS_CONTEXT_RECONFIGURING;
+ ret = vfs_reconfigure_sb(fc);
+ if (ret == 0)
+ vfs_clean_context(fc);
+ else
+ fc->phase = FS_CONTEXT_FAILED;
+ return ret;
+
default:
return -EOPNOTSUPP;
}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e2378c8ce7e0..c09a1cd4fa5a 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1510,7 +1510,7 @@ static struct vfsmount *__init mount_one_hugetlbfs(struct hstate *h)
struct vfsmount *mnt;
int ret;
- fc = vfs_new_fs_context(&hugetlbfs_fs_type, NULL, 0,
+ fc = vfs_new_fs_context(&hugetlbfs_fs_type, NULL, 0, 0,
FS_CONTEXT_FOR_KERNEL_MOUNT);
if (IS_ERR(fc)) {
ret = PTR_ERR(fc);
diff --git a/fs/internal.h b/fs/internal.h
index e5bdfc52b9a1..9c7dd6f12f35 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -54,6 +54,11 @@ extern void __init chrdev_init(void);
*/
extern const struct fs_context_operations legacy_fs_context_ops;
+/*
+ * fsopen.c
+ */
+extern void vfs_clean_context(struct fs_context *fc);
+
/*
* namei.c
*/
@@ -77,6 +82,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
*/
extern void *copy_mount_options(const void __user *);
extern char *copy_mount_string(const void __user *);
+extern int parse_monolithic_mount_data(struct fs_context *, void *, size_t);
extern struct vfsmount *lookup_mnt(const struct path *);
extern int finish_automount(struct vfsmount *, struct path *);
@@ -106,8 +112,7 @@ extern struct file *get_empty_filp(void);
/*
* super.c
*/
-extern int do_remount_sb(struct super_block *, int, void *, size_t, int,
- struct fs_context *);
+extern int reconfigure_super(struct fs_context *);
extern bool trylock_super(struct super_block *sb);
extern struct super_block *user_get_super(dev_t);
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index b568e6c5e063..ec14dc76fe89 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -23,9 +23,9 @@
struct kmem_cache *kernfs_node_cache;
-static int kernfs_sop_reconfigure(struct super_block *sb, struct fs_context *fc)
+int kernfs_reconfigure(struct fs_context *fc)
{
- struct kernfs_root *root = kernfs_info(sb)->root;
+ struct kernfs_root *root = kernfs_info(fc->root->d_sb)->root;
struct kernfs_syscall_ops *scops = root->syscall_ops;
if (scops && scops->reconfigure)
@@ -75,7 +75,6 @@ const struct super_operations kernfs_sops = {
.drop_inode = generic_delete_inode,
.evict_inode = kernfs_evict_inode,
- .reconfigure = kernfs_sop_reconfigure,
.show_options = kernfs_sop_show_options,
.show_path = kernfs_sop_show_path,
.get_fsinfo = kernfs_sop_get_fsinfo,
diff --git a/fs/libfs.c b/fs/libfs.c
index d9a5d883dc3f..b1744c071ab0 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -583,7 +583,8 @@ int simple_pin_fs(struct file_system_type *type, struct vfsmount **mount, int *c
if (unlikely(!*mount)) {
spin_unlock(&pin_fs_lock);
- fc = vfs_new_fs_context(type, NULL, 0, FS_CONTEXT_FOR_KERNEL_MOUNT);
+ fc = vfs_new_fs_context(type, NULL, 0, 0,
+ FS_CONTEXT_FOR_KERNEL_MOUNT);
if (IS_ERR(fc))
return PTR_ERR(fc);
diff --git a/fs/namespace.c b/fs/namespace.c
index e34e3fd064b0..47aea9542bf1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1479,6 +1479,25 @@ static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
static void shrink_submounts(struct mount *mnt);
+static int do_umount_root(struct super_block *sb)
+{
+ int ret = 0;
+ struct fs_context fc = {
+ .purpose = FS_CONTEXT_FOR_UMOUNT,
+ .fs_type = sb->s_type,
+ .root = sb->s_root,
+ .sb_flags = SB_RDONLY,
+ .sb_flags_mask = SB_RDONLY,
+ };
+
+ down_write(&sb->s_umount);
+ if (!sb_rdonly(sb))
+ /* Might want to call ->init_fs_context(). */
+ ret = reconfigure_super(&fc);
+ up_write(&sb->s_umount);
+ return ret;
+}
+
static int do_umount(struct mount *mnt, int flags)
{
struct super_block *sb = mnt->mnt.mnt_sb;
@@ -1544,11 +1563,7 @@ static int do_umount(struct mount *mnt, int flags)
*/
if (!ns_capable(sb->s_user_ns, CAP_SYS_ADMIN))
return -EPERM;
- down_write(&sb->s_umount);
- if (!sb_rdonly(sb))
- retval = do_remount_sb(sb, SB_RDONLY, NULL, 0, 0, NULL);
- up_write(&sb->s_umount);
- return retval;
+ return do_umount_root(sb);
}
namespace_lock();
@@ -2394,7 +2409,7 @@ static int do_reconfigure_mnt(struct path *path, unsigned int mnt_flags)
/*
* Parse the monolithic page of mount data given to sys_mount().
*/
-static int parse_monolithic_mount_data(struct fs_context *fc, void *data, size_t data_size)
+int parse_monolithic_mount_data(struct fs_context *fc, void *data, size_t data_size)
{
int (*monolithic_mount_data)(struct fs_context *, void *, size_t);
@@ -2417,7 +2432,6 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
int err;
struct super_block *sb = path->mnt->mnt_sb;
struct mount *mnt = real_mount(path->mnt);
- struct file_system_type *type = sb->s_type;
if (!check_mnt(mnt))
return -EINVAL;
@@ -2428,41 +2442,34 @@ static int do_remount(struct path *path, int ms_flags, int sb_flags,
if (!can_change_locked_flags(mnt, mnt_flags))
return -EPERM;
- if (type->init_fs_context) {
- fc = vfs_sb_reconfig(path, sb_flags);
- if (IS_ERR(fc))
- return PTR_ERR(fc);
+ fc = vfs_new_fs_context(path->dentry->d_sb->s_type,
+ path->dentry, sb_flags, MS_RMT_MASK,
+ FS_CONTEXT_FOR_RECONFIGURE);
- err = parse_monolithic_mount_data(fc, data, data_size);
+ err = parse_monolithic_mount_data(fc, data, data_size);
+ if (err < 0)
+ goto err_fc;
+
+ if (fc->ops->validate) {
+ err = fc->ops->validate(fc);
if (err < 0)
goto err_fc;
-
- if (fc->ops->validate) {
- err = fc->ops->validate(fc);
- if (err < 0)
- goto err_fc;
- }
-
- err = security_fs_context_validate(fc);
- if (err)
- return err;
- } else {
- err = security_sb_remount(sb, data, data_size);
- if (err)
- return err;
}
+ err = security_fs_context_validate(fc);
+ if (err)
+ return err;
+
down_write(&sb->s_umount);
err = -EPERM;
if (ns_capable(sb->s_user_ns, CAP_SYS_ADMIN)) {
- err = do_remount_sb(sb, sb_flags, data, data_size, 0, fc);
+ err = reconfigure_super(fc);
if (!err)
set_mount_attributes(mnt, mnt_flags);
}
up_write(&sb->s_umount);
err_fc:
- if (fc)
- put_fs_context(fc);
+ put_fs_context(fc);
return err;
}
@@ -2667,7 +2674,7 @@ static int do_new_mount(struct path *mountpoint, const char *fstype,
if (!fs_type)
goto out;
- fc = vfs_new_fs_context(fs_type, NULL, sb_flags,
+ fc = vfs_new_fs_context(fs_type, NULL, sb_flags, sb_flags,
FS_CONTEXT_FOR_USER_MOUNT);
put_filesystem(fs_type);
if (IS_ERR(fc)) {
@@ -3294,7 +3301,7 @@ struct vfsmount *vfs_kern_mount(struct file_system_type *type,
if (!type)
return ERR_PTR(-EINVAL);
- fc = vfs_new_fs_context(type, NULL, sb_flags,
+ fc = vfs_new_fs_context(type, NULL, sb_flags, sb_flags,
sb_flags & SB_KERNMOUNT ?
FS_CONTEXT_FOR_KERNEL_MOUNT :
FS_CONTEXT_FOR_USER_MOUNT);
@@ -3436,23 +3443,7 @@ SYSCALL_DEFINE3(fsmount, int, fs_fd, unsigned int, flags, unsigned int, ms_flags
* do any memory allocation or anything like that at this point as we
* don't want to have to handle any errors incurred.
*/
- if (fc->ops && fc->ops->free)
- fc->ops->free(fc);
- fc->need_free = false;
- fc->fs_private = NULL;
- fc->s_fs_info = NULL;
- fc->sb_flags = 0;
- fc->sloppy = false;
- fc->silent = false;
- security_fs_context_free(fc);
- fc->security = NULL;
- kfree(fc->subtype);
- fc->subtype = NULL;
- kfree(fc->source);
- fc->source = NULL;
-
- fc->purpose = FS_CONTEXT_FOR_RECONFIGURE;
- fc->phase = FS_CONTEXT_AWAITING_RECONF;
+ vfs_clean_context(fc);
/* Attach to an apparent O_PATH fd with a note that we need to unmount
* it, not just simply put it.
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index b5f27d6999e5..9a4eec0ef20a 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2296,7 +2296,7 @@ nfs_remount(struct super_block *sb, int *flags, char *raw_data, size_t data_size
/*
* noac is a special case. It implies -o sync, but that's not
- * necessarily reflected in the mtab options. do_remount_sb
+ * necessarily reflected in the mtab options. reconfigure_super
* will clear SB_SYNCHRONOUS if -o sync wasn't specified in the
* remount options, so we have to explicitly reset it.
*/
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 38155bec4a54..8d6f46558fa4 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -127,7 +127,6 @@ const struct super_operations proc_sops = {
.drop_inode = generic_delete_inode,
.evict_inode = proc_evict_inode,
.statfs = simple_statfs,
- .reconfigure = proc_reconfigure,
.show_options = proc_show_options,
};
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index ea8c5468eafc..75a225688a4c 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -273,7 +273,6 @@ static inline void proc_tty_init(void) {}
extern struct proc_dir_entry proc_root;
extern void proc_self_init(void);
-extern int proc_reconfigure(struct super_block *, struct fs_context *);
/*
* task_[no]mmu.c
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 1d6e5bfa30cc..64aa32155432 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -148,8 +148,9 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
return proc_setup_thread_self(s);
}
-int proc_reconfigure(struct super_block *sb, struct fs_context *fc)
+static int proc_reconfigure(struct fs_context *fc)
{
+ struct super_block *sb = fc->root->d_sb;
struct pid_namespace *pid = sb->s_fs_info;
sync_filesystem(sb);
@@ -180,6 +181,7 @@ static const struct fs_context_operations proc_fs_context_ops = {
.free = proc_fs_context_free,
.parse_param = proc_parse_param,
.get_tree = proc_get_tree,
+ .reconfigure = proc_reconfigure,
};
static int proc_init_fs_context(struct fs_context *fc, struct dentry *reference)
@@ -310,7 +312,7 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
struct vfsmount *mnt;
int ret;
- fc = vfs_new_fs_context(&proc_fs_type, NULL, 0,
+ fc = vfs_new_fs_context(&proc_fs_type, NULL, 0, 0,
FS_CONTEXT_FOR_KERNEL_MOUNT);
if (IS_ERR(fc))
return PTR_ERR(fc);
diff --git a/fs/super.c b/fs/super.c
index 321fbc244570..3f3389e94344 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -920,32 +920,30 @@ struct super_block *user_get_super(dev_t dev)
}
/**
- * do_remount_sb - asks filesystem to change mount options.
- * @sb: superblock in question
- * @sb_flags: revised superblock flags
- * @data: the rest of options
- * @data_size: The size of the data
- * @force: whether or not to force the change
- * @fc: the superblock config for filesystems that support it
- * (NULL if called from emergency or umount)
+ * reconfigure_super - asks filesystem to change superblock parameters
+ * @fc: the superblock and configuration
*
- * Alters the mount options of a mounted file system.
+ * Alters the configuration parameters of a live superblock.
*/
-int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
- size_t data_size, int force, struct fs_context *fc)
+int reconfigure_super(struct fs_context *fc)
{
+ struct super_block *sb = fc->root->d_sb;
int retval;
- int remount_ro;
+ int remount_ro = false;
+ if (fc->sb_flags_mask & ~MS_RMT_MASK)
+ return -EINVAL;
if (sb->s_writers.frozen != SB_UNFROZEN)
return -EBUSY;
+ if (fc->sb_flags_mask & SB_RDONLY) {
#ifdef CONFIG_BLOCK
- if (!(sb_flags & SB_RDONLY) && bdev_read_only(sb->s_bdev))
- return -EACCES;
+ if (!(fc->sb_flags & SB_RDONLY) && bdev_read_only(sb->s_bdev))
+ return -EACCES;
#endif
- remount_ro = (sb_flags & SB_RDONLY) && !sb_rdonly(sb);
+ remount_ro = (fc->sb_flags & SB_RDONLY) && !sb_rdonly(sb);
+ }
if (remount_ro) {
if (!hlist_empty(&sb->s_pins)) {
@@ -956,15 +954,16 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
return 0;
if (sb->s_writers.frozen != SB_UNFROZEN)
return -EBUSY;
- remount_ro = (sb_flags & SB_RDONLY) && !sb_rdonly(sb);
+ remount_ro = !sb_rdonly(sb);
}
}
shrink_dcache_sb(sb);
- /* If we are remounting RDONLY and current sb is read/write,
- make sure there are no rw files opened */
+ /* If we are reconfiguring to RDONLY and current sb is read/write,
+ * make sure there are no files open for writing.
+ */
if (remount_ro) {
- if (force) {
+ if (fc->purpose == FS_CONTEXT_FOR_EMERGENCY_RO) {
sb->s_readonly_remount = 1;
smp_wmb();
} else {
@@ -974,29 +973,21 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
}
}
- if (sb->s_op->reconfigure ||
- sb->s_op->remount_fs) {
- if (sb->s_op->reconfigure) {
- retval = sb->s_op->reconfigure(sb, fc);
- if (fc)
- sb_flags = fc->sb_flags;
- else
- sb_flags = sb->s_flags;
- if (retval == 0)
- security_sb_reconfigure(fc);
+ if (fc->ops->reconfigure) {
+ retval = fc->ops->reconfigure(fc);
+ if (retval == 0) {
+ security_sb_reconfigure(fc);
} else {
- retval = sb->s_op->remount_fs(sb, &sb_flags,
- data, data_size);
- }
- if (retval) {
- if (!force)
+ if (fc->purpose != FS_CONTEXT_FOR_EMERGENCY_RO)
goto cancel_readonly;
/* If forced remount, go ahead despite any errors */
WARN(1, "forced remount of a %s fs returned %i\n",
sb->s_type->name, retval);
}
}
- sb->s_flags = (sb->s_flags & ~MS_RMT_MASK) | (sb_flags & MS_RMT_MASK);
+
+ WRITE_ONCE(sb->s_flags, ((sb->s_flags & ~fc->sb_flags_mask) |
+ (fc->sb_flags & fc->sb_flags_mask)));
/* Needs to be ordered wrt mnt_is_readonly() */
smp_wmb();
sb->s_readonly_remount = 0;
@@ -1020,14 +1011,22 @@ int do_remount_sb(struct super_block *sb, int sb_flags, void *data,
static void do_emergency_remount_callback(struct super_block *sb)
{
+ struct fs_context fc = {
+ .purpose = FS_CONTEXT_FOR_EMERGENCY_RO,
+ .fs_type = sb->s_type,
+ .root = sb->s_root,
+ .sb_flags = SB_RDONLY,
+ .sb_flags_mask = SB_RDONLY,
+ };
+
down_write(&sb->s_umount);
if (sb->s_root && sb->s_bdev && (sb->s_flags & SB_BORN) &&
- !sb_rdonly(sb)) {
+ !sb_rdonly(sb))
+ /* Might want to call ->init_fs_context(). */
/*
* What lock protects sb->s_flags??
*/
- do_remount_sb(sb, SB_RDONLY, NULL, 0, 1, NULL);
- }
+ reconfigure_super(&fc);
up_write(&sb->s_umount);
}
@@ -1416,6 +1415,42 @@ struct dentry *mount_nodev(struct file_system_type *fs_type,
}
EXPORT_SYMBOL(mount_nodev);
+static int reconfigure_single(struct super_block *s,
+ int flags, void *data, size_t data_size)
+{
+ struct fs_context *fc;
+ int ret;
+
+ /* The caller really need to be passing fc down into mount_single(),
+ * then a chunk of this can be removed. Better yet, reconfiguration
+ * shouldn't happen, but rather the second mount should be rejected if
+ * the parameters are not compatible.
+ */
+ fc = vfs_new_fs_context(s->s_type, s->s_root, flags, MS_RMT_MASK,
+ FS_CONTEXT_FOR_RECONFIGURE);
+ if (IS_ERR(fc))
+ return PTR_ERR(fc);
+
+ ret = parse_monolithic_mount_data(fc, data, data_size);
+ if (ret < 0)
+ goto out;
+
+ if (fc->ops->validate) {
+ ret = fc->ops->validate(fc);
+ if (ret < 0)
+ goto out;
+ }
+
+ ret = security_fs_context_validate(fc);
+ if (ret)
+ goto out;
+
+ ret = reconfigure_super(fc);
+out:
+ put_fs_context(fc);
+ return ret;
+}
+
static int compare_single(struct super_block *s, void *p)
{
return 1;
@@ -1433,15 +1468,19 @@ struct dentry *mount_single(struct file_system_type *fs_type,
return ERR_CAST(s);
if (!s->s_root) {
error = fill_super(s, data, data_size, flags & SB_SILENT ? 1 : 0);
- if (error) {
- deactivate_locked_super(s);
- return ERR_PTR(error);
- }
+ if (error)
+ goto error;
s->s_flags |= SB_ACTIVE;
} else {
- do_remount_sb(s, flags, data, data_size, 0, NULL);
+ error = reconfigure_single(s, flags, data, data_size);
+ if (error)
+ goto error;
}
return dget(s->s_root);
+
+error:
+ deactivate_locked_super(s);
+ return ERR_PTR(error);
}
EXPORT_SYMBOL(mount_single);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 053d53861995..1300d77efe96 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1853,7 +1853,6 @@ struct super_operations {
int (*statfs) (struct dentry *, struct kstatfs *);
int (*get_fsinfo) (struct dentry *, struct fsinfo_kparams *);
int (*remount_fs) (struct super_block *, int *, char *, size_t);
- int (*reconfigure) (struct super_block *, struct fs_context *);
void (*umount_begin) (struct super_block *);
int (*show_options)(struct seq_file *, struct dentry *);
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 9a6aa6bcf745..5e79c33ade7d 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -34,6 +34,8 @@ enum fs_context_purpose {
FS_CONTEXT_FOR_KERNEL_MOUNT, /* New superblock for kernel-internal mount */
FS_CONTEXT_FOR_SUBMOUNT, /* New superblock for automatic submount */
FS_CONTEXT_FOR_RECONFIGURE, /* Superblock reconfiguration (remount) */
+ FS_CONTEXT_FOR_UMOUNT, /* Reconfiguration to R/O for unmount */
+ FS_CONTEXT_FOR_EMERGENCY_RO, /* Emergency reconfiguration to R/O */
};
/*
@@ -102,6 +104,7 @@ struct fs_context {
void *security; /* The LSM context */
void *s_fs_info; /* Proposed s_fs_info */
unsigned int sb_flags; /* Proposed superblock flags (SB_*) */
+ unsigned int sb_flags_mask; /* Superblock flags that were changed */
enum fs_context_purpose purpose:8;
enum fs_context_phase phase:8; /* The phase the context is in */
bool sloppy:1; /* T if unrecognised options are okay */
@@ -116,6 +119,7 @@ struct fs_context_operations {
int (*parse_monolithic)(struct fs_context *fc, void *data, size_t data_size);
int (*validate)(struct fs_context *fc);
int (*get_tree)(struct fs_context *fc);
+ int (*reconfigure)(struct fs_context *fc);
};
/*
@@ -123,9 +127,9 @@ struct fs_context_operations {
*/
extern struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
struct dentry *reference,
- unsigned int ms_flags,
+ unsigned int sb_flags,
+ unsigned int sb_flags_mask,
enum fs_context_purpose purpose);
-extern struct fs_context *vfs_sb_reconfig(struct path *path, unsigned int ms_flags);
extern struct fs_context *vfs_dup_fs_context(struct fs_context *src);
extern int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param);
extern int vfs_parse_fs_string(struct fs_context *fc, const char *key,
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 9fdcdbbb67e9..a6da692792a3 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -370,6 +370,7 @@ int kernfs_get_tree(struct fs_context *fc);
void kernfs_free_fs_context(struct fs_context *fc);
void kernfs_kill_sb(struct super_block *sb);
struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
+int kernfs_reconfigure(struct fs_context *fc);
void kernfs_init(void);
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index b1a62dc0b8d9..3cfa89f41bad 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -160,13 +160,6 @@
* @orig_data is the size of the original data
* @copy copied data which will be passed to the security module.
* Returns 0 if the copy was successful.
- * @sb_remount:
- * Extracts security system specific mount options and verifies no changes
- * are being made to those options.
- * @sb superblock being remounted
- * @data contains the filesystem-specific data.
- * @data_size contains the size of the data.
- * Return 0 if permission is granted.
* @sb_umount:
* Check permission before the @mnt file system is unmounted.
* @mnt contains the mounted file system.
@@ -1518,7 +1511,6 @@ union security_list_options {
int (*sb_alloc_security)(struct super_block *sb);
void (*sb_free_security)(struct super_block *sb);
int (*sb_copy_data)(char *orig, size_t orig_size, char *copy);
- int (*sb_remount)(struct super_block *sb, void *data, size_t data_size);
int (*sb_show_options)(struct seq_file *m, struct super_block *sb);
int (*sb_statfs)(struct dentry *dentry);
int (*sb_mount)(const char *dev_name, const struct path *path,
@@ -1865,7 +1857,6 @@ struct security_hook_heads {
struct hlist_head sb_alloc_security;
struct hlist_head sb_free_security;
struct hlist_head sb_copy_data;
- struct hlist_head sb_remount;
struct hlist_head sb_show_options;
struct hlist_head sb_statfs;
struct hlist_head sb_mount;
diff --git a/include/linux/security.h b/include/linux/security.h
index c73d9ba863bc..047b2cff1209 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -240,7 +240,6 @@ int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
int security_sb_alloc(struct super_block *sb);
void security_sb_free(struct super_block *sb);
int security_sb_copy_data(char *orig, size_t orig_size, char *copy);
-int security_sb_remount(struct super_block *sb, void *data, size_t data_size);
int security_sb_show_options(struct seq_file *m, struct super_block *sb);
int security_sb_statfs(struct dentry *dentry);
int security_sb_mount(const char *dev_name, const struct path *path,
@@ -585,11 +584,6 @@ static inline int security_sb_copy_data(char *orig, size_t orig_size, char *copy
return 0;
}
-static inline int security_sb_remount(struct super_block *sb, void *data, size_t data_size)
-{
- return 0;
-}
-
static inline int security_sb_show_options(struct seq_file *m,
struct super_block *sb)
{
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 0f102210f89e..0ac430f48800 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -403,7 +403,7 @@ static struct vfsmount *mq_create_mount(struct ipc_namespace *ns)
struct vfsmount *mnt;
int ret;
- fc = vfs_new_fs_context(&mqueue_fs_type, NULL, 0,
+ fc = vfs_new_fs_context(&mqueue_fs_type, NULL, 0, 0,
FS_CONTEXT_FOR_KERNEL_MOUNT);
if (IS_ERR(fc))
return ERR_CAST(fc);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 958b3fd81c56..6542c0c3e32f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2150,6 +2150,7 @@ static const struct fs_context_operations cgroup_fs_context_ops = {
.parse_param = cgroup_parse_param,
.validate = cgroup_validate,
.get_tree = cgroup_get_tree,
+ .reconfigure = kernfs_reconfigure,
};
/*
@@ -5281,7 +5282,6 @@ int cgroup_rmdir(struct kernfs_node *kn)
static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
.show_options = cgroup_show_options,
.fsinfo = cgroup_fsinfo,
- .reconfigure = cgroup_reconfigure,
.mkdir = cgroup_mkdir,
.rmdir = cgroup_rmdir,
.show_path = cgroup_show_path,
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b02161a41d5a..b4ad1a52f006 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -327,7 +327,8 @@ static int cpuset_get_tree(struct fs_context *fc)
if (!cgroup_fs)
goto out;
- cg_fc = vfs_new_fs_context(cgroup_fs, NULL, fc->sb_flags, fc->purpose);
+ cg_fc = vfs_new_fs_context(cgroup_fs, NULL, fc->sb_flags, fc->sb_flags,
+ fc->purpose);
put_filesystem(cgroup_fs);
if (IS_ERR(cg_fc)) {
ret = PTR_ERR(cg_fc);
diff --git a/security/security.c b/security/security.c
index 2439a5613813..95b348484c5a 100644
--- a/security/security.c
+++ b/security/security.c
@@ -415,11 +415,6 @@ int security_sb_copy_data(char *orig, size_t data_size, char *copy)
}
EXPORT_SYMBOL(security_sb_copy_data);
-int security_sb_remount(struct super_block *sb, void *data, size_t data_size)
-{
- return call_int_hook(sb_remount, 0, sb, data, data_size);
-}
-
int security_sb_show_options(struct seq_file *m, struct super_block *sb)
{
return call_int_hook(sb_show_options, 0, m, sb);
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 3d5b09c256c1..d9cfb8b2fca4 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -7180,7 +7180,6 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),
- LSM_HOOK_INIT(sb_remount, selinux_sb_remount),
LSM_HOOK_INIT(sb_show_options, selinux_sb_show_options),
LSM_HOOK_INIT(sb_statfs, selinux_sb_statfs),
LSM_HOOK_INIT(sb_mount, selinux_mount),
^ permalink raw reply related [flat|nested] 70+ messages in thread
* Re: [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #11]
2018-08-01 15:27 ` [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
2018-08-06 17:28 ` Eric W. Biederman
2018-08-09 14:14 ` David Howells
@ 2018-08-09 14:24 ` David Howells
2018-08-09 14:35 ` Miklos Szeredi
` (3 more replies)
2 siblings, 4 replies; 70+ messages in thread
From: David Howells @ 2018-08-09 14:24 UTC (permalink / raw)
To: Eric W. Biederman
Cc: dhowells, viro, linux-api, torvalds, linux-fsdevel, linux-kernel
Eric W. Biederman <ebiederm@xmission.com> wrote:
> First let me thank you for adding both FSCONFIG_CMD_CREATE and
> FSCONFIG_CMD_RECONFIGURE. Unfortunately the implementation is currently
> broken. So this patch gets my:
>
> This is broken in two specific ways.
> ...
> 2) FSCONFIG_CMD_CREATE will succeed even if the superblock already
> exists and it can not use all of the superblock parameters.
>
> This happens because vfs_get_super will only call fill_super
> if the super block is created. Which is reasonable on the face
> of it. But it in practice this introduces security problems.
>
> a) Either through reconfiguring a shared super block you did not
> realize was shared (as we saw with devpts).
>
> b) Mounting a super block and not honoring it's mount options
> because something has already mounted it. As we see today
> with proc. Leaving userspace to think the filesystem will behave
> one way when in fact it behaves another.
>
> I have already explained this several times, and apparently I have been
> ignored. This fundamental usability issue that leads to security
> problems.
I've also explained why you're wrong or at least only partially right. I *do*
*not* want to implement sget() in userspace with the ability for userspace to
lock out other mount requests - which is what it appears that you've been
asking for.
However, as I have said, I *am* willing to add one of more flags to help with
this, but I can't make any "legacy" fs honour them as this requires the
fs_context to be passed down to sget_fc() and the filesystem - which is why I
was considering leaving it for later.
(1) An FSOPEN_EXCL flag to tell sget_fc() to fail if the superblock already
exists at all.
(2) An FSOPEN_FAIL_ON_MISMATCH flag to explicitly say that we *don't* want a
superblock with different parameters.
The implication of providing neither flag is that we are happy to accept a
superblock from the same source but with different parameters.
But it doesn't seem to be an absolute imperative to roll this out immediately,
since what I have exactly mirrors what the kernel currently does - and forcing
a change in that behaviour risks breaking userspace. If it keeps you happy,
however, I can try and work one up.
David
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #11]
2018-08-09 14:24 ` David Howells
@ 2018-08-09 14:35 ` Miklos Szeredi
2018-08-09 15:32 ` Eric W. Biederman
` (2 subsequent siblings)
3 siblings, 0 replies; 70+ messages in thread
From: Miklos Szeredi @ 2018-08-09 14:35 UTC (permalink / raw)
To: David Howells
Cc: Eric W. Biederman, Al Viro, Linux API, Linus Torvalds,
linux-fsdevel, linux-kernel
On Thu, Aug 9, 2018 at 4:24 PM, David Howells <dhowells@redhat.com> wrote:
> Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> First let me thank you for adding both FSCONFIG_CMD_CREATE and
>> FSCONFIG_CMD_RECONFIGURE. Unfortunately the implementation is currently
>> broken. So this patch gets my:
>>
>> This is broken in two specific ways.
>> ...
>> 2) FSCONFIG_CMD_CREATE will succeed even if the superblock already
>> exists and it can not use all of the superblock parameters.
>>
>> This happens because vfs_get_super will only call fill_super
>> if the super block is created. Which is reasonable on the face
>> of it. But it in practice this introduces security problems.
>>
>> a) Either through reconfiguring a shared super block you did not
>> realize was shared (as we saw with devpts).
>>
>> b) Mounting a super block and not honoring it's mount options
>> because something has already mounted it. As we see today
>> with proc. Leaving userspace to think the filesystem will behave
>> one way when in fact it behaves another.
>>
>> I have already explained this several times, and apparently I have been
>> ignored. This fundamental usability issue that leads to security
>> problems.
>
> I've also explained why you're wrong or at least only partially right. I *do*
> *not* want to implement sget() in userspace with the ability for userspace to
> lock out other mount requests - which is what it appears that you've been
> asking for.
>
> However, as I have said, I *am* willing to add one of more flags to help with
> this, but I can't make any "legacy" fs honour them as this requires the
> fs_context to be passed down to sget_fc() and the filesystem - which is why I
> was considering leaving it for later.
You can determine at fsopen() time whether the filesystem is able to
support the O_EXCL behavior? If so, then it's trivial to enable this
conditionally. I think that's what Eric is asking for, it's obviously
not fair to ask for a change in behavior of the legacy interface.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #11]
2018-08-09 14:24 ` David Howells
2018-08-09 14:35 ` Miklos Szeredi
@ 2018-08-09 15:32 ` Eric W. Biederman
2018-08-09 16:33 ` David Howells
2018-08-11 20:20 ` David Howells
3 siblings, 0 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-09 15:32 UTC (permalink / raw)
To: David Howells; +Cc: viro, linux-api, torvalds, linux-fsdevel, linux-kernel
David Howells <dhowells@redhat.com> writes:
> Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> First let me thank you for adding both FSCONFIG_CMD_CREATE and
>> FSCONFIG_CMD_RECONFIGURE. Unfortunately the implementation is currently
>> broken. So this patch gets my:
>>
>> This is broken in two specific ways.
>> ...
>> 2) FSCONFIG_CMD_CREATE will succeed even if the superblock already
>> exists and it can not use all of the superblock parameters.
>>
>> This happens because vfs_get_super will only call fill_super
>> if the super block is created. Which is reasonable on the face
>> of it. But it in practice this introduces security problems.
>>
>> a) Either through reconfiguring a shared super block you did not
>> realize was shared (as we saw with devpts).
>>
>> b) Mounting a super block and not honoring it's mount options
>> because something has already mounted it. As we see today
>> with proc. Leaving userspace to think the filesystem will behave
>> one way when in fact it behaves another.
>>
>> I have already explained this several times, and apparently I have been
>> ignored. This fundamental usability issue that leads to security
>> problems.
>
> I've also explained why you're wrong or at least only partially right. I *do*
> *not* want to implement sget() in userspace with the ability for userspace to
> lock out other mount requests - which is what it appears that you've been
> asking for.
All I really care about is that when you ask for a set of paramaters
that you get a filesystem with that set of parameters. Not the same
filsystem mounted a different way.
That has gone wrong twice badly. There is no common case I know of that
requires returning the same filesystem twice. AKA the pain of the
existing semantics seems much much worse than any benefit. So I am
asking that we not propagate the existing semantics into the new API.
You are cleaning up dealing with mount options and this is one of the
places where they need cleaning up.
> However, as I have said, I *am* willing to add one of more flags to help with
> this, but I can't make any "legacy" fs honour them as this requires the
> fs_context to be passed down to sget_fc() and the filesystem - which is why I
> was considering leaving it for later.
>
> (1) An FSOPEN_EXCL flag to tell sget_fc() to fail if the superblock already
> exists at all.
>
> (2) An FSOPEN_FAIL_ON_MISMATCH flag to explicitly say that we *don't* want a
> superblock with different parameters.
>
> The implication of providing neither flag is that we are happy to accept a
> superblock from the same source but with different parameters.
>
> But it doesn't seem to be an absolute imperative to roll this out immediately,
> since what I have exactly mirrors what the kernel currently does - and forcing
> a change in that behaviour risks breaking userspace. If it keeps you happy,
> however, I can try and work one up.
What I am asking is that the default behavior for the new API when using
FSCONFIG_CMD_CREATE is to call sget_fc with either FSOPEN_EXCL or
FSOPEN_FAIL_ON_MISMATCH. I know FSOPEN_EXCL is trivial to implement. I
don't know if there are any hidden gotcha's with
FSOPEN_FAIL_ON_MISMATCH.
This change in default behavior for your patch needs to be implemented
before this hits a released kernel. Returning a filesystem with
different than the requested parameters has resulted in at least two
major issues, that are very hard to clean up after the fact. A chroot
system changing the parameters on /dev/pts resulting in some
distributions keeping the suid pt_chown binary long past it's best buy
date, and other distributions instead choosing to break userspace. Then
there is the current issue where in practice proc does not any of it's
mount paramaters which breaks the android security model.
The fact that these things happen silently and you have to be on your
toes to catch them is fundamentally a bug in the API. If the mount
request had simply failed people would have noticed the issues much
sooner and silently b0rkend configuration would not have propagated. As
such I do not believe we should propagate this misfeature from the old
API into the new API.
Conceptually I like FSOPEN_FAIL_ON_MISMATH as it looks like it is
sufficient to the needs, and with a little luck we could even change
the old API to those semantics.
Ultimately I want to close a giant mental model mismatch.
User: I am creating the data structures to read filesystem X
with parameters Y.
Kernel: He wants filesystem X. If it is a slow day use parameters Y.
Given that historically the reuse of a superblock did not exist, and
that in practice it almost never happens. It is quite reqsonable for
users to not expect the kernel to completely ignore the mount parameters
they pass to the kernel.
So please let's fix that now when it is easy.
Eric
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #11]
2018-08-09 14:24 ` David Howells
2018-08-09 14:35 ` Miklos Szeredi
2018-08-09 15:32 ` Eric W. Biederman
@ 2018-08-09 16:33 ` David Howells
2018-08-11 20:20 ` David Howells
3 siblings, 0 replies; 70+ messages in thread
From: David Howells @ 2018-08-09 16:33 UTC (permalink / raw)
To: Miklos Szeredi
Cc: dhowells, Eric W. Biederman, Al Viro, Linux API, Linus Torvalds,
linux-fsdevel, linux-kernel
Miklos Szeredi <miklos@szeredi.hu> wrote:
> > However, as I have said, I *am* willing to add one of more flags to help
> > with this, but I can't make any "legacy" fs honour them as this requires
> > the fs_context to be passed down to sget_fc() and the filesystem - which
> > is why I was considering leaving it for later.
>
> You can determine at fsopen() time whether the filesystem is able to
> support the O_EXCL behavior? If so, then it's trivial to enable this
> conditionally. I think that's what Eric is asking for, it's obviously
> not fair to ask for a change in behavior of the legacy interface.
What do you mean by "enable it conditionally"? It cannot be enabled for
filesystems that don't pass fs_context down to sget().
mount(2) mustn't enable it lest it break userspace.
fsopen(2) can let userspace set a flag to enable it.
David
^ permalink raw reply [flat|nested] 70+ messages in thread
* BUG: Mount ignores mount options
2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
` (5 preceding siblings ...)
2018-08-01 15:27 ` [PATCH 30/33] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
@ 2018-08-10 14:05 ` Eric W. Biederman
2018-08-10 14:36 ` Andy Lutomirski
` (3 more replies)
2018-08-10 15:11 ` David Howells
2018-08-15 16:31 ` Should we split the network filesystem setup into two phases? David Howells
8 siblings, 4 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-10 14:05 UTC (permalink / raw)
To: David Howells
Cc: viro, John Johansen, Tejun Heo, selinux, Paul Moore, Li Zefan,
linux-api, apparmor, Casey Schaufler, fenghua.yu,
Greg Kroah-Hartman, Eric Biggers, linux-security-module,
Tetsuo Handa, Johannes Weiner, Stephen Smalley, tomoyo-dev-en,
cgroups, torvalds, linux-fsdevel, linux-kernel,
Theodore Y. Ts'o, Miklos Szeredi
There is a serious problem with mount options today that fsopen does not
address. The problem is that mount options are ignored for block based
filesystems, and any other type of filesystem that follows the same
pattern.
The script below demonstrates this bug. Showing this bug can cause the
ext4 "acl" "quota" and "user_xattr" options to be silently ignored.
fsopen has my nack until it addresses this issue.
I don't know if we can fix this in the context of sys_mount. But we if
we are redoing the option parsing of how we mount filesystems this needs
to be fixed before we start worrying about bug compatibility.
Hopefully this report is simple and clear enough that we can at least
agree on the problem.
Eric
# cat ~/bin/bdev-loop0.sh
#!/bin/sh
set -x
set -e
LOOP=loop0
dd if=/dev/zero bs=1024 count=1048576 of=$LOOP-file
losetup /dev/$LOOP $LOOP-file
mkfs.ext4 /dev/$LOOP
mkdir $LOOP-noacl-noquota-nouser_xattr
mount -t ext4 /dev/$LOOP -o "noacl,noquota,nouser_xattr" $LOOP-noacl-noquota-nouser_xattr
mkdir $LOOP-acl-quota-user_xattr
mount -t ext4 /dev/$LOOP -o "acl,quota,user_xattr" $LOOP-acl-quota-user_xattr
cat /proc/mounts | grep loop0
root@finagle:~# ~/bin/bdev-loop0.sh
+ set -e
+ LOOP=loop0
+ dd if=/dev/zero bs=1024 count=1048576 of=loop0-file
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 4.37645 s, 245 MB/s
+ losetup /dev/loop0 loop0-file
+ mkfs.ext4 /dev/loop0
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
65536 inodes, 262144 blocks
13107 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=268435456
8 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376
Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 29 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
+ mkdir loop0-noacl-noquota-nouser_xattr
+ mount -t ext4 /dev/loop0 -o noacl,noquota,nouser_xattr loop0-noacl-noquota-nouser_xattr
+ mkdir loop0-acl-quota-user_xattr
+ mount -t ext4 /dev/loop0 -o acl,quota,user_xattr loop0-acl-quota-user_xattr
+ + grep loop0
cat /proc/mounts
/dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
/dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
@ 2018-08-10 14:36 ` Andy Lutomirski
2018-08-10 15:17 ` Eric W. Biederman
2018-08-10 15:24 ` Al Viro
2018-08-10 15:11 ` Tetsuo Handa
` (2 subsequent siblings)
3 siblings, 2 replies; 70+ messages in thread
From: Andy Lutomirski @ 2018-08-10 14:36 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Howells, viro, John Johansen, Tejun Heo, selinux,
Paul Moore, Li Zefan, linux-api, apparmor, Casey Schaufler,
fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-kernel, Theodore Y. Ts'o
> On Aug 10, 2018, at 7:05 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>
> There is a serious problem with mount options today that fsopen does not
> address. The problem is that mount options are ignored for block based
> filesystems, and any other type of filesystem that follows the same
> pattern.
>
> /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
To make sure I understand correctly: the problem is that the second mount ignored the options because the device was already mounted, right?
For the new API, I think the only remotely sane approach is to refuse to mount or init or whatever you call it an already mounted bdev. If user code genuinely needs to bind-mount an existing mount that is known only by its bdev, we can add a specific API just for that.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
` (6 preceding siblings ...)
2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
@ 2018-08-10 15:11 ` David Howells
2018-08-10 15:39 ` Theodore Y. Ts'o
` (3 more replies)
2018-08-15 16:31 ` Should we split the network filesystem setup into two phases? David Howells
8 siblings, 4 replies; 70+ messages in thread
From: David Howells @ 2018-08-10 15:11 UTC (permalink / raw)
To: Eric W. Biederman
Cc: dhowells, viro, John Johansen, Tejun Heo, selinux, Paul Moore,
Li Zefan, linux-api, apparmor, Casey Schaufler, fenghua.yu,
Greg Kroah-Hartman, Eric Biggers, linux-security-module,
Tetsuo Handa, Johannes Weiner, Stephen Smalley, tomoyo-dev-en,
cgroups, torvalds, linux-fsdevel, linux-kernel,
Theodore Y. Ts'o, Miklo
Eric W. Biederman <ebiederm@xmission.com> wrote:
> There is a serious problem with mount options today that fsopen does not
> address. The problem is that mount options are ignored for block based
> filesystems, and any other type of filesystem that follows the same
> pattern.
Yes. Since you *absolutely* *insist* on this being fixed *right* *now* *or*
*else*, I'm working up a set of additional patches to give userspace the
option of whether they want no sharing; sharing, but only with exactly the
same parameters; or to ignore the parameter differences and just accept
sharing of what's already already mounted (ie. the current behaviour).
The second option, however, is not trivial as it needs to compare the fs
contexts, including the LSM parameters. To make that work, I really need to
remove the old security_mnt_opts stuff - which means I need to port btrfs to
the new context stuff.
We discussed this yesterday, and I proposed a solution, and I'm working on it.
Yes, I agree it would be nice to have, but it *doesn't* really need supporting
right this minute, since what I have now oughtn't to break the current
behaviour.
David
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
2018-08-10 14:36 ` Andy Lutomirski
@ 2018-08-10 15:11 ` Tetsuo Handa
2018-08-10 15:13 ` David Howells
2018-08-10 15:16 ` Al Viro
3 siblings, 0 replies; 70+ messages in thread
From: Tetsuo Handa @ 2018-08-10 15:11 UTC (permalink / raw)
To: Eric W. Biederman, David Howells
Cc: viro, John Johansen, Tejun Heo, selinux, Paul Moore, Li Zefan,
linux-api, apparmor, Casey Schaufler, fenghua.yu,
Greg Kroah-Hartman, Eric Biggers, linux-security-module,
Johannes Weiner, Stephen Smalley, tomoyo-dev-en, cgroups,
torvalds, linux-fsdevel, linux-kernel, Theodore Y. Ts'o,
Miklos Szeredi
On 2018/08/10 23:05, Eric W. Biederman wrote:
>
> There is a serious problem with mount options today that fsopen does not
> address. The problem is that mount options are ignored for block based
> filesystems, and any other type of filesystem that follows the same
> pattern.
>
> The script below demonstrates this bug. Showing this bug can cause the
> ext4 "acl" "quota" and "user_xattr" options to be silently ignored.
>
> fsopen has my nack until it addresses this issue.
>
> I don't know if we can fix this in the context of sys_mount. But we if
> we are redoing the option parsing of how we mount filesystems this needs
> to be fixed before we start worrying about bug compatibility.
>
> Hopefully this report is simple and clear enough that we can at least
> agree on the problem.
>
> Eric
This might be related to a problem that syzbot is failing to reproduce a problem.
https://groups.google.com/forum/#!msg/syzkaller-bugs/R03vI7RCVco/0PijCTrcCgAJ
syzbot found a reproducer, and the reproducer was working until next-20180803.
But the reproducer is failing to reproduce this problem in next-20180806 despite
there is no mm related change between next-20180803 and next-20180806.
Therefore, I suspect that the reproducer is no longer working as intended. And
there was parser change (David Howells' patch) between next-20180803 and next-20180806.
I'm waiting for response from David Howells...
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
2018-08-10 14:36 ` Andy Lutomirski
2018-08-10 15:11 ` Tetsuo Handa
@ 2018-08-10 15:13 ` David Howells
2018-08-10 15:16 ` Al Viro
3 siblings, 0 replies; 70+ messages in thread
From: David Howells @ 2018-08-10 15:13 UTC (permalink / raw)
To: Andy Lutomirski
Cc: dhowells, Eric W. Biederman, viro, John Johansen, Tejun Heo,
selinux, Paul Moore, Li Zefan, linux-api, apparmor,
Casey Schaufler, fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-kernel
Andy Lutomirski <luto@amacapital.net> wrote:
> > /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> > /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
>
> To make sure I understand correctly: the problem is that the second mount
> ignored the options because the device was already mounted, right?
>
> For the new API, I think the only remotely sane approach is to refuse to
> mount or init or whatever you call it an already mounted bdev. If user code
> genuinely needs to bind-mount an existing mount that is known only by its
> bdev, we can add a specific API just for that.
I'm adding some flags to fsopen() to allow userspace to say whether it wants
no sharing, same parameters-only sharing or anything-goes sharing (as now).
I'm also adding a flag whereby userspace can forbid anyone else from sharing a
new superblock it has just set up.
David
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
` (2 preceding siblings ...)
2018-08-10 15:13 ` David Howells
@ 2018-08-10 15:16 ` Al Viro
[not found] ` <20180810151606.GA6515-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
3 siblings, 1 reply; 70+ messages in thread
From: Al Viro @ 2018-08-10 15:16 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Howells, John Johansen, Tejun Heo, selinux, Paul Moore,
Li Zefan, linux-api, apparmor, Casey Schaufler, fenghua.yu,
Greg Kroah-Hartman, Eric Biggers, linux-security-module,
Tetsuo Handa, Johannes Weiner, Stephen Smalley, tomoyo-dev-en,
cgroups, torvalds, linux-fsdevel, linux-kernel,
Theodore Y. Ts'o, Miklos
On Fri, Aug 10, 2018 at 09:05:22AM -0500, Eric W. Biederman wrote:
>
> There is a serious problem with mount options today that fsopen does not
> address. The problem is that mount options are ignored for block based
> filesystems, and any other type of filesystem that follows the same
> pattern.
>
> The script below demonstrates this bug. Showing this bug can cause the
> ext4 "acl" "quota" and "user_xattr" options to be silently ignored.
>
> fsopen has my nack until it addresses this issue.
>
> I don't know if we can fix this in the context of sys_mount. But we if
> we are redoing the option parsing of how we mount filesystems this needs
> to be fixed before we start worrying about bug compatibility.
>
> Hopefully this report is simple and clear enough that we can at least
> agree on the problem.
Sure, it is simple. So's the solution: MNT_USERNS_SPECIAL_SEMANTICS that
would get passed to filesystems, so that Eric would be able to implement
his mount(2)-incompatible behaviour at leisure, without worrying about
compatibility issues.
Does that address your complaint? Because one thing we are not going
to do is changing mount(2) behaviour. Reason: userland-visible
behaviour of hell knows how many local scripts. Another thing that
is flat-out not feasible is some kind of blanket "compare options"
stuff; it *can* be done as helpers to be used by filesystem when
it sees that new flag, but it's simply not going to work at the
fs-independent level. Trivial example with the same ext4:
mount /dev/sda1 /mnt/a -o bsddf vs. mount /dev/sda1 /mnt/b
ext4 can tell that these are the same. syscall itself has no
clue. What's more, it's not just explicitly spelled default
options - it's the stuff that has more than one form. And while
we are at it, the things like two NFS mounts of different trees
from the same server; they might or might not get the same superblock.
Depending upon the options.
Convenience helper that would allow ext4 to compare options and reject
the incompatible mount? Not sure how much ext4-specific knowledge
would have to go in it, but if you can come up with one - more power
to you. But the decision to use it *must* be ext4-specific. Because
for e.g. NFS such thing as -o fsid=..., while certainly a part of
options, has a very different meaning - it's "use a separate fs instance"
(and let the server deal with coherency issues on its end).
Decision to use sget() (and the way it's used) is up to filesystem.
We *can't* lift that into syscall. Not without breaking the fuck out
of existing behaviour.
Having something like a second callback for mount_bdev() that would
be called when we'd found an existing instance for the same block
device? Sure, no problem. Having a helper for doing such comparison
that would work in enough cases to bother, so that different fs
could avoid boilerplate in that callback? Again, more power to you.
But I don't see what the hell does that have to the syscall interface.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 14:36 ` Andy Lutomirski
@ 2018-08-10 15:17 ` Eric W. Biederman
2018-08-10 15:24 ` Al Viro
1 sibling, 0 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-10 15:17 UTC (permalink / raw)
To: Andy Lutomirski
Cc: David Howells, viro, John Johansen, Tejun Heo, selinux,
Paul Moore, Li Zefan, linux-api, apparmor, Casey Schaufler,
fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-kernel, Theodore Y. Ts'o
Andy Lutomirski <luto@amacapital.net> writes:
>> On Aug 10, 2018, at 7:05 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>>
>> There is a serious problem with mount options today that fsopen does not
>> address. The problem is that mount options are ignored for block based
>> filesystems, and any other type of filesystem that follows the same
>> pattern.
>>
>
>> /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
>> /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
>
> To make sure I understand correctly: the problem is that the second
> mount ignored the options because the device was already mounted,
> right?
Yes.
> For the new API, I think the only remotely sane approach is to refuse
> to mount or init or whatever you call it an already mounted bdev. If
> user code genuinely needs to bind-mount an existing mount that is
> known only by its bdev, we can add a specific API just for that.
Eric
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 14:36 ` Andy Lutomirski
2018-08-10 15:17 ` Eric W. Biederman
@ 2018-08-10 15:24 ` Al Viro
1 sibling, 0 replies; 70+ messages in thread
From: Al Viro @ 2018-08-10 15:24 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Eric W. Biederman, David Howells, John Johansen, Tejun Heo,
selinux, Paul Moore, Li Zefan, linux-api, apparmor,
Casey Schaufler, fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-kernel, Theo
On Fri, Aug 10, 2018 at 07:36:17AM -0700, Andy Lutomirski wrote:
>
>
> > On Aug 10, 2018, at 7:05 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> >
> > There is a serious problem with mount options today that fsopen does not
> > address. The problem is that mount options are ignored for block based
> > filesystems, and any other type of filesystem that follows the same
> > pattern.
> >
>
> > /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> > /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
>
> To make sure I understand correctly: the problem is that the second mount ignored the options because the device was already mounted, right?
>
> For the new API, I think the only remotely sane approach is to refuse to mount or init or whatever you call it an already mounted bdev. If user code genuinely needs to bind-mount an existing mount that is known only by its bdev, we can add a specific API just for that.
First of all, that does NOT belong anywhere other than fs itself.
Example: NFS. Not every attempt to mount something leads to creation
of new fs instance; moreover, whether it will or not can't be predicted
in general.
PS: for pity sake, fix your MUA; 270-character lines are way over the
top.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 15:11 ` David Howells
@ 2018-08-10 15:39 ` Theodore Y. Ts'o
2018-08-10 15:55 ` Casey Schaufler
` (2 more replies)
2018-08-10 15:53 ` David Howells
` (2 subsequent siblings)
3 siblings, 3 replies; 70+ messages in thread
From: Theodore Y. Ts'o @ 2018-08-10 15:39 UTC (permalink / raw)
To: David Howells
Cc: Eric W. Biederman, viro, John Johansen, Tejun Heo, selinux,
Paul Moore, Li Zefan, linux-api, apparmor, Casey Schaufler,
fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-kernel
On Fri, Aug 10, 2018 at 04:11:31PM +0100, David Howells wrote:
>
> Yes. Since you *absolutely* *insist* on this being fixed *right* *now* *or*
> *else*, I'm working up a set of additional patches to give userspace the
> option of whether they want no sharing; sharing, but only with exactly the
> same parameters; or to ignore the parameter differences and just accept
> sharing of what's already already mounted (ie. the current behaviour).
But there's no way to support "no sharing", at least not in the
general case. A file system can only be mounted once, and without
file system support, there's no way for a file system to be mounted
with the bsddf or minixdf mount simultaneously.
Even *with* file system support, there's no way today for the VFS to
keep track of whether a pathname resolution came through one
mountpoint or another, so I can't do something like this:
mount /dev/sdXX -o casefold /android-data
mount /dev/sdXX -o nocasefold /android-data-2
Which is a pity, since if we could we could much more easily get rid
of the horror which is Android's wrapfs...
So if the file system has been mounted with one set of mount options,
and you want to try to mount it with a conflicting set of mount
options and you don't want it to silently ignore the mount options,
the *only* thing we can today is to refuse the mount and return an
error.
I'm not sure Eric would really consider that an improvement for the
container use case....
- Ted
P.S. And as Al has pointed out, this would require special, per-file
system support to determine whether the mount options are conflicting
or not....
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 15:11 ` David Howells
2018-08-10 15:39 ` Theodore Y. Ts'o
@ 2018-08-10 15:53 ` David Howells
2018-08-10 16:14 ` Theodore Y. Ts'o
2018-08-11 1:19 ` Eric W. Biederman
[not found] ` <87pnyphf8i.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
3 siblings, 1 reply; 70+ messages in thread
From: David Howells @ 2018-08-10 15:53 UTC (permalink / raw)
To: Theodore Y. Ts'o
Cc: dhowells, Eric W. Biederman, viro, John Johansen, Tejun Heo,
selinux, Paul Moore, Li Zefan, linux-api, apparmor,
Casey Schaufler, fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-kernel
Theodore Y. Ts'o <tytso@mit.edu> wrote:
> Even *with* file system support, there's no way today for the VFS to
> keep track of whether a pathname resolution came through one
> mountpoint or another, so I can't do something like this:
Ummm... Isn't that encoded in the vfsmount pointer in struct path?
However, the case folding stuff - is that a superblockism of a mountpointism?
> So if the file system has been mounted with one set of mount options,
> and you want to try to mount it with a conflicting set of mount
> options and you don't want it to silently ignore the mount options,
> the *only* thing we can today is to refuse the mount and return an
> error.
With fsopen() there is the option to have the filesystem and the LSM attempt
to compare the non-key[*] mount options and reject the attempt to share if
they differ in any way.
David
[*] sget lookup keys, that is.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 15:39 ` Theodore Y. Ts'o
@ 2018-08-10 15:55 ` Casey Schaufler
2018-08-10 16:11 ` David Howells
2018-08-10 18:00 ` Eric W. Biederman
2 siblings, 0 replies; 70+ messages in thread
From: Casey Schaufler @ 2018-08-10 15:55 UTC (permalink / raw)
To: Theodore Y. Ts'o, David Howells, Eric W. Biederman, viro,
John Johansen, Tejun Heo, selinux, Paul Moore, Li Zefan,
linux-api, apparmor, fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-ke
Cc: Casey Schaufler
On 8/10/2018 8:39 AM, Theodore Y. Ts'o wrote:
> On Fri, Aug 10, 2018 at 04:11:31PM +0100, David Howells wrote:
>> Yes. Since you *absolutely* *insist* on this being fixed *right* *now* *or*
>> *else*, I'm working up a set of additional patches to give userspace the
>> option of whether they want no sharing; sharing, but only with exactly the
>> same parameters; or to ignore the parameter differences and just accept
>> sharing of what's already already mounted (ie. the current behaviour).
> But there's no way to support "no sharing", at least not in the
> general case. A file system can only be mounted once, and without
> file system support, there's no way for a file system to be mounted
> with the bsddf or minixdf mount simultaneously.
>
> Even *with* file system support, there's no way today for the VFS to
> keep track of whether a pathname resolution came through one
> mountpoint or another, so I can't do something like this:
>
> mount /dev/sdXX -o casefold /android-data
> mount /dev/sdXX -o nocasefold /android-data-2
>
> Which is a pity, since if we could we could much more easily get rid
> of the horror which is Android's wrapfs...
>
> So if the file system has been mounted with one set of mount options,
> and you want to try to mount it with a conflicting set of mount
> options and you don't want it to silently ignore the mount options,
> the *only* thing we can today is to refuse the mount and return an
> error.
>
> I'm not sure Eric would really consider that an improvement for the
> container use case....
>
> - Ted
>
> P.S. And as Al has pointed out, this would require special, per-file
> system support to determine whether the mount options are conflicting
> or not....
This extends to LSMs that support mount options (SELinux and Smack)
as well.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 15:39 ` Theodore Y. Ts'o
2018-08-10 15:55 ` Casey Schaufler
@ 2018-08-10 16:11 ` David Howells
2018-08-10 18:00 ` Eric W. Biederman
2 siblings, 0 replies; 70+ messages in thread
From: David Howells @ 2018-08-10 16:11 UTC (permalink / raw)
To: Casey Schaufler
Cc: dhowells, Theodore Y. Ts'o, Eric W. Biederman, viro,
John Johansen, Tejun Heo, selinux, Paul Moore, Li Zefan,
linux-api, apparmor, fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-kernel, Mi
Casey Schaufler <casey@schaufler-ca.com> wrote:
> > P.S. And as Al has pointed out, this would require special, per-file
> > system support to determine whether the mount options are conflicting
> > or not....
>
> This extends to LSMs that support mount options (SELinux and Smack)
> as well.
Yes. I'm doing that.
David
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 15:53 ` David Howells
@ 2018-08-10 16:14 ` Theodore Y. Ts'o
2018-08-10 20:06 ` Andy Lutomirski
[not found] ` <20180810161400.GA627-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
0 siblings, 2 replies; 70+ messages in thread
From: Theodore Y. Ts'o @ 2018-08-10 16:14 UTC (permalink / raw)
To: David Howells
Cc: Eric W. Biederman, viro, John Johansen, Tejun Heo, selinux,
Paul Moore, Li Zefan, linux-api, apparmor, Casey Schaufler,
fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-kernel
On Fri, Aug 10, 2018 at 04:53:58PM +0100, David Howells wrote:
> Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> > Even *with* file system support, there's no way today for the VFS to
> > keep track of whether a pathname resolution came through one
> > mountpoint or another, so I can't do something like this:
>
> Ummm... Isn't that encoded in the vfsmount pointer in struct path?
Well, yes, and we do use this as a hack to make read-only bind mounts
work. But that's done as a special case, and it's for permissions
checking only.
The big problem is that there is single dentry cache object regardless
of which mount point was used to access it. So that makes it
impossible to support case folding as a mount-pointism.
>
> However, the case folding stuff - is that a superblockism of a mountpointism?
It's a superblock-ism. As far as I know the *only* thing that we can
support as a mount-pointism is the ro flag, and that's handled as a
special case, and only if the original superblock was mounted
read/write. ey That was my point; aside from the ro flag, we can't
support any other mount options as a per-mount point thing, so the
only thing we can do is to fail the mount if there are conflicting
mount options. And I'm not really sure it helps the container use
case, since the whole point is they want their "guest" to be able to
blithely run "mount /dev/sda1 -o noxattr /mnt" and not worry about the
fact that in some other container, someone had run "mount /dev/sda1 -o
xattr /mnt". But having the second mount fail because of conflicting
mount option breaks the illusion that containers are functionally as
rich as VM's.
So before you put in lots of work to support rejecting the attmpted
mount if the mount options conflict, are we sure people will actually
find this to be useful? Because it's not only fsopen() work for you,
but each file system is going to have to implement new functions to
answer the question "are these mount options conflicting or not?".
Are we sure it's worth the effort?
- Ted
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 15:39 ` Theodore Y. Ts'o
2018-08-10 15:55 ` Casey Schaufler
2018-08-10 16:11 ` David Howells
@ 2018-08-10 18:00 ` Eric W. Biederman
2 siblings, 0 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-10 18:00 UTC (permalink / raw)
To: Theodore Y. Ts'o
Cc: David Howells, viro, John Johansen, Tejun Heo, selinux,
Paul Moore, Li Zefan, linux-api, apparmor, Casey Schaufler,
fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-kernel, Miklos Szeredi
"Theodore Y. Ts'o" <tytso@mit.edu> writes:
> On Fri, Aug 10, 2018 at 04:11:31PM +0100, David Howells wrote:
>>
>> Yes. Since you *absolutely* *insist* on this being fixed *right* *now* *or*
>> *else*, I'm working up a set of additional patches to give userspace the
>> option of whether they want no sharing; sharing, but only with exactly the
>> same parameters; or to ignore the parameter differences and just accept
>> sharing of what's already already mounted (ie. the current behaviour).
>
> But there's no way to support "no sharing", at least not in the
> general case. A file system can only be mounted once, and without
> file system support, there's no way for a file system to be mounted
> with the bsddf or minixdf mount simultaneously.
>
> Even *with* file system support, there's no way today for the VFS to
> keep track of whether a pathname resolution came through one
> mountpoint or another, so I can't do something like this:
>
> mount /dev/sdXX -o casefold /android-data
> mount /dev/sdXX -o nocasefold /android-data-2
>
> Which is a pity, since if we could we could much more easily get rid
> of the horror which is Android's wrapfs...
>
> So if the file system has been mounted with one set of mount options,
> and you want to try to mount it with a conflicting set of mount
> options and you don't want it to silently ignore the mount options,
> the *only* thing we can today is to refuse the mount and return an
> error.
>
> I'm not sure Eric would really consider that an improvement for the
> container use case....
I think I would consider it an improvement. I keep running into cases
where the mount options differed and something was done silently and
that causes problems.
Eric
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 16:14 ` Theodore Y. Ts'o
@ 2018-08-10 20:06 ` Andy Lutomirski
2018-08-10 20:46 ` Theodore Y. Ts'o
2018-08-13 16:35 ` Alan Cox
[not found] ` <20180810161400.GA627-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
1 sibling, 2 replies; 70+ messages in thread
From: Andy Lutomirski @ 2018-08-10 20:06 UTC (permalink / raw)
To: Theodore Y. Ts'o, David Howells, Eric W. Biederman, Al Viro,
John Johansen, Tejun Heo, SELinux-NSA, Paul Moore, Li Zefan,
Linux API, apparmor, Casey Schaufler, Fenghua Yu,
Greg Kroah-Hartman, Eric Biggers, LSM List, Tetsuo Handa,
Johannes Weiner, Stephen Smalley, tomoyo-dev-en
On Fri, Aug 10, 2018 at 9:14 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> And I'm not really sure it helps the container use
> case, since the whole point is they want their "guest" to be able to
> blithely run "mount /dev/sda1 -o noxattr /mnt" and not worry about the
> fact that in some other container, someone had run "mount /dev/sda1 -o
> xattr /mnt". But having the second mount fail because of conflicting
> mount option breaks the illusion that containers are functionally as
> rich as VM's.
If the same block device is visible, with rw access, in two different
containers, I don't see any anything good can happen. Sure, with the
current somewhat erratic semantics of mount(2), something kind of sort
of reasonable happens if they both mount it. But if one or both of
them try to use, say, tune2fs or fsck, it's not going to go well. And
a situation where they mount with different options and the result
depends on the order of the mounts is just plain bad.
I see four sane ways to deal with this:
1. Don't put the block device in the container at all. The container
manager mounts it.
2. Use seccomp or a similar mechanism to intercept and emulate the
mount request.
3. Teach the filesystem driver to do something sensible. This will
inherently be per-fs, and probably involves some serious magic or
allowing filesystem-specific vfsmount options.
4. Introduce a concept of a special kind of fake block device that
refers to an existing superblock, doesn't allow direct read or write,
and does the right thing when mounted. Not obviously worth the
effort.
It seems to me that the current approach mostly involves crossing our fingers.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 20:06 ` Andy Lutomirski
@ 2018-08-10 20:46 ` Theodore Y. Ts'o
2018-08-10 22:12 ` Darrick J. Wong
2018-08-13 16:35 ` Alan Cox
1 sibling, 1 reply; 70+ messages in thread
From: Theodore Y. Ts'o @ 2018-08-10 20:46 UTC (permalink / raw)
To: Andy Lutomirski
Cc: David Howells, Eric W. Biederman, Al Viro, John Johansen,
Tejun Heo, SELinux-NSA, Paul Moore, Li Zefan, Linux API, apparmor,
Casey Schaufler, Fenghua Yu, Greg Kroah-Hartman, Eric Biggers,
LSM List, Tetsuo Handa, Johannes Weiner, Stephen Smalley,
tomoyo-dev-en, CONTROL
On Fri, Aug 10, 2018 at 01:06:54PM -0700, Andy Lutomirski wrote:
> If the same block device is visible, with rw access, in two different
> containers, I don't see any anything good can happen.
It's worse than that. I've fixed a lot of bugs which cause the kernel
to crash, and a few that might be levered into a privilege escalationh
attack, when you mount a maliciously corrupted file system using ext4.
I'm told told the security researcher filed similar reports with the
XFS community, and he was told, "that's what metadata checksums are
for; go away". Given how much time it takes to work with these
security researchers, I don't blame them.
But in light of that, I'd make a somewhat stronger statement. If you
let an untrusted container mount arbitrary block devices where they
have rw acccess to the underlying block device, nothing good can
happen. Period. :-)
Which is why I don't think the lack of being able to reject
"conflicting mount options" is really all that important. It
certainly shouldn't block the fsopen patch series. #1, it's a problem
we have today, and #2, I'm really not all sure supporting bind mounts
via specifying block device was ever a good idea to begin with. And
#3, while I've been fixing ext4 against security issues caused by
maliciously corrupted file system images, I'm still sure that allowing
untrusted containers access to mount *any* file system via a block
device for which they have r/w access is a Really Bad Idea.
> It seems to me that the current approach mostly involves crossing our fingers.
Agreed!
- Ted
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 20:46 ` Theodore Y. Ts'o
@ 2018-08-10 22:12 ` Darrick J. Wong
2018-08-10 23:54 ` Theodore Y. Ts'o
0 siblings, 1 reply; 70+ messages in thread
From: Darrick J. Wong @ 2018-08-10 22:12 UTC (permalink / raw)
To: Theodore Y. Ts'o, Andy Lutomirski, David Howells,
Eric W. Biederman, Al Viro, John Johansen, Tejun Heo, SELinux-NSA,
Paul Moore, Li Zefan, Linux API, apparmor, Casey Schaufler,
Fenghua Yu, Greg Kroah-Hartman, Eric Biggers, LSM List,
Tetsuo Handa, Johannes Weiner, Stephen Smalley
On Fri, Aug 10, 2018 at 04:46:39PM -0400, Theodore Y. Ts'o wrote:
> On Fri, Aug 10, 2018 at 01:06:54PM -0700, Andy Lutomirski wrote:
> > If the same block device is visible, with rw access, in two different
> > containers, I don't see any anything good can happen.
>
> It's worse than that. I've fixed a lot of bugs which cause the kernel
> to crash, and a few that might be levered into a privilege escalationh
> attack, when you mount a maliciously corrupted file system using ext4.
> I'm told told the security researcher filed similar reports with the
> XFS community, and he was told, "that's what metadata checksums are
> for; go away".
Hey now, there was a little more nuance to it than that[1][2]. The
complaint in the first instance had much more to do with breaking
existing V4 filesystems by adding format requirements that mkfs didn't
know about when the filesystem was created. Yes, you can create V4
filesystems that will hang the system if the log was totally unformatted
and metadata updates are made, but OTOH it's fairly obvious when that
happens, you have to be root to mount a disk filesystem, and we try to
avoid breaking existing users.
XFS developers have been and will continue to examine security problems
when they are brought to our attention and strengthen validation as
needed to minimize the risk of incorrect behaviors, but filesystems are
complex machines, complex machinery is risky, and we arbitrate some of
that risk by requiring administrators to elect to mount an XFS.
> Given how much time it takes to work with these security researchers,
> I don't blame them.
>
> But in light of that, I'd make a somewhat stronger statement. If you
> let an untrusted container mount arbitrary block devices where they
> have rw acccess to the underlying block device, nothing good can
> happen. Period. :-)
>
> Which is why I don't think the lack of being able to reject
> "conflicting mount options" is really all that important. It
> certainly shouldn't block the fsopen patch series. #1, it's a problem
> we have today, and #2, I'm really not all sure supporting bind mounts
> via specifying block device was ever a good idea to begin with. And
> #3, while I've been fixing ext4 against security issues caused by
> maliciously corrupted file system images, I'm still sure that allowing
> untrusted containers access to mount *any* file system via a block
> device for which they have r/w access is a Really Bad Idea.
>
> > It seems to me that the current approach mostly involves crossing our fingers.
>
> Agreed!
Crossing our fingers and demanding administrator intentionality when
mounting filesystems off some piece of storage.
--D
[1] https://lkml.org/lkml/2018/5/21/649
[2] https://lkml.org/lkml/2018/4/2/572
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 22:12 ` Darrick J. Wong
@ 2018-08-10 23:54 ` Theodore Y. Ts'o
[not found] ` <20180810235447.GK627-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
0 siblings, 1 reply; 70+ messages in thread
From: Theodore Y. Ts'o @ 2018-08-10 23:54 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Eric Biggers, Tetsuo Handa, LKML, David Howells, SELinux-NSA,
tomoyo-dev-en-5NWGOfrQmneRv+LV9MX5uooqe+aC9MnS, Paul Moore,
Miklos Szeredi, Stephen Smalley, Fenghua Yu,
apparmor-nLRlyDuq1AZFpShjVBNYrg, Tejun Heo, Al Viro,
Andy Lutomirski, open list:CONTROL GROUP (CGROUP), Linux API,
Greg Kroah-Hartman, LSM List, Li Zefan, Eric W. Biederman,
Johannes Weiner, Linux FS Devel
On Fri, Aug 10, 2018 at 03:12:34PM -0700, Darrick J. Wong wrote:
> Hey now, there was a little more nuance to it than that[1][2]. The
> complaint in the first instance had much more to do with breaking
> existing V4 filesystems by adding format requirements that mkfs didn't
> know about when the filesystem was created. Yes, you can create V4
> filesystems that will hang the system if the log was totally unformatted
> and metadata updates are made, but OTOH it's fairly obvious when that
> happens, you have to be root to mount a disk filesystem, and we try to
> avoid breaking existing users.
I wasn't thinking about syzbot reports; I've largely written them off
as far as file system testing is concerned, but rather Wen Xu at
Georgia Tech, who is much more reasonable than Dmitry, and has helpeyd
me out a lot; and has complained that the XFS folks haven't been
engaging with him.
In either case, both security researchers are fuzzing file system
images, and then fixing the checksums, and discovering that this can
lead to kernel crashes, and in a few cases, buffer overruns that can
lead to potential privilege escalations. Wen can generate reports
faster than syzbot, but at least he gives me file system images (as
opposed to having to dig them out of syzbot repro C files) and he
actually does some analysis and explains what he thinks is going on.
I don't think anyone was claiming that format requirements should be
added to ext4 or xfs file systems. But rather, that kernel code
should be made more robust against maliciously corrupted file system
images that have valid checksums. I've been more willing to work with
Wen; Dave has expressed the opinion that these are not realistic bug
reports, and since only root can mount file systems, it's not high
priority.
The reason why I bring this up here is that in container land, there
are those who believe that "container root" should be able to mount
file systems, and if the "container root" isn't trusted, the fact that
the "container root" can crash the host kernel, or worse, corrupt the
host kernel and break out of the container as a result, that would be
sad.
I was pretty sure most file system developers are on the same page
that allowing untrusted "container roots" the ability to mount
arbitrary block device file systems is insanity. Whether or not we
try to fix these sorts of bugs submitted by security researchers. :-)
- Ted
--
AppArmor mailing list
AppArmor@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/apparmor
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
[not found] ` <20180810161400.GA627-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
@ 2018-08-11 0:28 ` Eric W. Biederman
0 siblings, 0 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-11 0:28 UTC (permalink / raw)
To: Theodore Y. Ts'o
Cc: Eric Biggers, Tetsuo Handa, David Howells,
selinux-+05T5uksL2qpZYMLLGbcSA,
tomoyo-dev-en-5NWGOfrQmneRv+LV9MX5uooqe+aC9MnS, Paul Moore,
Miklos Szeredi, Stephen Smalley,
fenghua.yu-ral2JQCrhuEAvxtiuMwx3w,
apparmor-nLRlyDuq1AZFpShjVBNYrg, Greg Kroah-Hartman,
viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-security-module-u79uwXL29TY76Z2rM5mHXA, Li Zefan,
Johannes Weiner, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
"Theodore Y. Ts'o" <tytso@mit.edu> writes:
> On Fri, Aug 10, 2018 at 04:53:58PM +0100, David Howells wrote:
>> Theodore Y. Ts'o <tytso@mit.edu> wrote:
>>
>> > Even *with* file system support, there's no way today for the VFS to
>> > keep track of whether a pathname resolution came through one
>> > mountpoint or another, so I can't do something like this:
>>
>> However, the case folding stuff - is that a superblockism of a mountpointism?
>
> It's a superblock-ism. As far as I know the *only* thing that we can
> support as a mount-pointism is the ro flag, and that's handled as a
> special case, and only if the original superblock was mounted
> read/write. ey That was my point; aside from the ro flag, we can't
> support any other mount options as a per-mount point thing, so the
> only thing we can do is to fail the mount if there are conflicting
> mount options. And I'm not really sure it helps the container use
> case, since the whole point is they want their "guest" to be able to
> blithely run "mount /dev/sda1 -o noxattr /mnt" and not worry about the
> fact that in some other container, someone had run "mount /dev/sda1 -o
> xattr /mnt". But having the second mount fail because of conflicting
> mount option breaks the illusion that containers are functionally as
> rich as VM's.
Ted this isn't about some container case.
It about the fact that practically every filesystem in the kernel has
the behavior I have described and it means that if root is not super
careful root will shoot himself in the foot with the shotgun we have
pointed there.
It really is about loosing acls or some other filesystem option.
Eric
--
AppArmor mailing list
AppArmor@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/apparmor
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
[not found] ` <20180810235447.GK627-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
@ 2018-08-11 0:38 ` Darrick J. Wong
2018-08-11 1:32 ` Eric W. Biederman
0 siblings, 1 reply; 70+ messages in thread
From: Darrick J. Wong @ 2018-08-11 0:38 UTC (permalink / raw)
To: Theodore Y. Ts'o, Andy Lutomirski, David Howells,
Eric W. Biederman, Al Viro, John Johansen, Tejun Heo, SELinux-NSA,
Paul Moore, Li Zefan, Linux API, apparmor-nLRlyDuq1AZFpShjVBNYrg,
Casey Schaufler, Fenghua Yu, Greg Kroah-Hartman, Eric Biggers,
LSM List, Tetsuo Handa, Johannes Weiner, Stephen Smalley,
tomoyo-dev-en-5NWGOfrQmneRv+LV9MX5uooqe+aC9MnS,
open list:CONTROL GROUP (CGROUP)
On Fri, Aug 10, 2018 at 07:54:47PM -0400, Theodore Y. Ts'o wrote:
> On Fri, Aug 10, 2018 at 03:12:34PM -0700, Darrick J. Wong wrote:
> > Hey now, there was a little more nuance to it than that[1][2]. The
> > complaint in the first instance had much more to do with breaking
> > existing V4 filesystems by adding format requirements that mkfs didn't
> > know about when the filesystem was created. Yes, you can create V4
> > filesystems that will hang the system if the log was totally unformatted
> > and metadata updates are made, but OTOH it's fairly obvious when that
> > happens, you have to be root to mount a disk filesystem, and we try to
> > avoid breaking existing users.
>
> I wasn't thinking about syzbot reports; I've largely written them off
> as far as file system testing is concerned, but rather Wen Xu at
> Georgia Tech, who is much more reasonable than Dmitry, and has helpeyd
> me out a lot; and has complained that the XFS folks haven't been
> engaging with him.
Ahh, ok. Yes, Wen has been easier to work with, and gives out
filesystem images. Hm, I'll go comb the bugzilla again...
> In either case, both security researchers are fuzzing file system
> images, and then fixing the checksums, and discovering that this can
> lead to kernel crashes, and in a few cases, buffer overruns that can
> lead to potential privilege escalations. Wen can generate reports
> faster than syzbot, but at least he gives me file system images (as
> opposed to having to dig them out of syzbot repro C files) and he
> actually does some analysis and explains what he thinks is going on.
(FWIW I tried to figure out how to add fs image dumping to syzbot and
whoah that was horrifying.
> I don't think anyone was claiming that format requirements should be
> added to ext4 or xfs file systems. But rather, that kernel code
> should be made more robust against maliciously corrupted file system
> images that have valid checksums. I've been more willing to work with
> Wen; Dave has expressed the opinion that these are not realistic bug
> reports, and since only root can mount file systems, it's not high
> priority.
I don't think they're high priority either, but they're at least worth
/some/ attention.
> The reason why I bring this up here is that in container land, there
> are those who believe that "container root" should be able to mount
> file systems, and if the "container root" isn't trusted, the fact that
> the "container root" can crash the host kernel, or worse, corrupt the
> host kernel and break out of the container as a result, that would be
> sad.
>
> I was pretty sure most file system developers are on the same page
> that allowing untrusted "container roots" the ability to mount
> arbitrary block device file systems is insanity.
Agreed.
> Whether or not we try to fix these sorts of bugs submitted by security
> researchers. :-)
and agreed. :)
--D
> - Ted
--
AppArmor mailing list
AppArmor@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/apparmor
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
[not found] ` <20180810151606.GA6515-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2018-08-11 1:05 ` Eric W. Biederman
[not found] ` <87pnypiufr.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2018-08-11 1:58 ` Al Viro
0 siblings, 2 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-11 1:05 UTC (permalink / raw)
To: Al Viro
Cc: Eric Biggers, Tetsuo Handa, David Howells,
selinux-+05T5uksL2qpZYMLLGbcSA,
tomoyo-dev-en-5NWGOfrQmneRv+LV9MX5uooqe+aC9MnS, Paul Moore,
Miklos Szeredi, Stephen Smalley,
fenghua.yu-ral2JQCrhuEAvxtiuMwx3w,
apparmor-nLRlyDuq1AZFpShjVBNYrg, Greg Kroah-Hartman,
cgroups-u79uwXL29TY76Z2rM5mHXA, Theodore Y. Ts'o,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-security-module-u79uwXL29TY76Z2rM5mHXA, Li Zefan,
Johannes Weiner, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
Al Viro <viro@ZenIV.linux.org.uk> writes:
> On Fri, Aug 10, 2018 at 09:05:22AM -0500, Eric W. Biederman wrote:
>>
>> There is a serious problem with mount options today that fsopen does not
>> address. The problem is that mount options are ignored for block based
>> filesystems, and any other type of filesystem that follows the same
>> pattern.
>>
>> The script below demonstrates this bug. Showing this bug can cause the
>> ext4 "acl" "quota" and "user_xattr" options to be silently ignored.
>>
>> fsopen has my nack until it addresses this issue.
>>
>> I don't know if we can fix this in the context of sys_mount. But we if
>> we are redoing the option parsing of how we mount filesystems this needs
>> to be fixed before we start worrying about bug compatibility.
>>
>> Hopefully this report is simple and clear enough that we can at least
>> agree on the problem.
>
> Sure, it is simple. So's the solution: MNT_USERNS_SPECIAL_SEMANTICS that
> would get passed to filesystems, so that Eric would be able to implement
> his mount(2)-incompatible behaviour at leisure, without worrying about
> compatibility issues.
>
> Does that address your complaint?
Absolutely not.
My complaint is that the current implemented behavior of practically
every filesystem in the kernel, is that it will ignore mount options
when mounted a second time.
It is not some weird special case.
It is not some container thing.
It is that the behavior of mount(2) with practically every filesystem
type when that filesystem is already mounted somewhere else behaves
in ways no one would expect.
With the new fsopen api the easy thing to do is simply have CMD_CREATE
CMD_BIND_INTERNAL and be done with it. CMD_CREATE guarantee that a new
superblock is created. CMD_BIND_INTERNAL would only work with an
existing superblock. Then root would at least know that he is
connecting to an already mounted filesystem and could look at the
options etc and fail if he didn't like what he saw. No surprises, no
muss, no fuss simple.
But I have been told the simple solution above is somehow unacceptable.
And an option to compare the mount options and see if they are the same
was offered. That would will work to.
I just care that we define the semantics in such a way that it is not
easy for root to get confused and do something stupid that will bite
later, and that we build the infrastructure so that all filesystems
can implement it easily.
So yes this is 100% a question about how filesystems should behave with
respect to their option when mounted for a second time. That is what
Dave Howells patchset is addressing.
> Because one thing we are not going to do is changing mount(2)
> behaviour.
I have not asked for that. I have asked that we get it right for
fsopen.
> Reason: userland-visible behaviour of hell knows how many local scripts.
> Another thing that
> is flat-out not feasible is some kind of blanket "compare options"
> stuff; it *can* be done as helpers to be used by filesystem when
> it sees that new flag, but it's simply not going to work at the
> fs-independent level.
>
> Trivial example with the same ext4:
> mount /dev/sda1 /mnt/a -o bsddf vs. mount /dev/sda1 /mnt/b
> ext4 can tell that these are the same. syscall itself has no
> clue. What's more, it's not just explicitly spelled default
> options - it's the stuff that has more than one form. And while
> we are at it, the things like two NFS mounts of different trees
> from the same server; they might or might not get the same superblock.
> Depending upon the options.
>
> Convenience helper that would allow ext4 to compare options and reject
> the incompatible mount? Not sure how much ext4-specific knowledge
> would have to go in it, but if you can come up with one - more power
> to you. But the decision to use it *must* be ext4-specific. Because
> for e.g. NFS such thing as -o fsid=..., while certainly a part of
> options, has a very different meaning - it's "use a separate fs instance"
> (and let the server deal with coherency issues on its end).
>
> Decision to use sget() (and the way it's used) is up to filesystem.
> We *can't* lift that into syscall. Not without breaking the fuck out
> of existing behaviour.
I have never proposed that. See above. I may have talked in terms
of what sget does and muddied the waters. If so I apologize.
All I proposed was that we distinguish between a first mount and an
additional mount so that userspace knows the options will be ignored.
Then the code to replicate the current behavior can look like:
fd = fsopen(...);
fsconfig(fd, ...);
fsconfig(fd, ...);
fsconfig(fd, ...);
fsconfig(fd, ...);
fsconfig(fd, ...);
fsconfig(fd, ...);
fsconfig(fd, ...);
if (fsconfig(fd, CMD_CREATE) == -EBUSY) {
fsconfig(fd, CMD_BIND_INTERNAL);
}
But userspace would then be free to issue a warning or do something
else if CMD_CREATE returns -EBUSY.
I don't know how the above wound up being construed as asking that the
code call sget directly but that is what has happened.
> Having something like a second callback for mount_bdev() that would
> be called when we'd found an existing instance for the same block
> device? Sure, no problem. Having a helper for doing such comparison
> that would work in enough cases to bother, so that different fs
> could avoid boilerplate in that callback? Again, more power to you.
Normal forms etc. If we want to do that it just requires a wee bit of
discipline. And if all of the option parsing is being rewritten and
retested anyway I don't see why we can't do something like that as well.
So it does not sound unreasonable to me.
It does sound like more work than what I was proposing.
Eric
--
AppArmor mailing list
AppArmor@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/apparmor
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 15:11 ` David Howells
2018-08-10 15:39 ` Theodore Y. Ts'o
2018-08-10 15:53 ` David Howells
@ 2018-08-11 1:19 ` Eric W. Biederman
[not found] ` <87pnyphf8i.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
3 siblings, 0 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-11 1:19 UTC (permalink / raw)
To: David Howells
Cc: viro, John Johansen, Tejun Heo, selinux, Paul Moore, Li Zefan,
linux-api, apparmor, Casey Schaufler, fenghua.yu,
Greg Kroah-Hartman, Eric Biggers, linux-security-module,
Tetsuo Handa, Johannes Weiner, Stephen Smalley, tomoyo-dev-en,
cgroups, torvalds, linux-fsdevel, linux-kernel,
Theodore Y. Ts'o, Miklos Szeredi
David Howells <dhowells@redhat.com> writes:
> Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> There is a serious problem with mount options today that fsopen does not
>> address. The problem is that mount options are ignored for block based
>> filesystems, and any other type of filesystem that follows the same
>> pattern.
>
> Yes. Since you *absolutely* *insist* on this being fixed *right* *now* *or*
> *else*, I'm working up a set of additional patches to give userspace the
> option of whether they want no sharing; sharing, but only with exactly the
> same parameters; or to ignore the parameter differences and just accept
> sharing of what's already already mounted (ie. the current behaviour).
>
> The second option, however, is not trivial as it needs to compare the fs
> contexts, including the LSM parameters. To make that work, I really need to
> remove the old security_mnt_opts stuff - which means I need to port btrfs to
> the new context stuff.
>
> We discussed this yesterday, and I proposed a solution, and I'm working on it.
I repeated this because after some comments from Al on IRC yesterday
and Miklos's email replay. It appeared clear that I had not specified
why my issue was clearly enough for people reading the thread to
understand the problem that I see.
> Yes, I agree it would be nice to have, but it *doesn't* really need supporting
> right this minute, since what I have now oughtn't to break the current
> behaviour.
I am really reluctant to endorse anything that propagates the issues of
the current interface in the new mount interface.
Eric
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-11 0:38 ` Darrick J. Wong
@ 2018-08-11 1:32 ` Eric W. Biederman
0 siblings, 0 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-11 1:32 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Eric Biggers, Tetsuo Handa, LKML, David Howells, SELinux-NSA,
tomoyo-dev-en-5NWGOfrQmneRv+LV9MX5uooqe+aC9MnS, Paul Moore,
Miklos Szeredi, Stephen Smalley, Fenghua Yu,
apparmor-nLRlyDuq1AZFpShjVBNYrg, Tejun Heo, Al Viro,
Andy Lutomirski, open list:CONTROL GROUP (CGROUP),
Theodore Y. Ts'o, Linux API, Greg Kroah-Hartman, LSM List,
Li Zefan, Johannes Weiner, Linux FS Devel, Linus Torvalds
"Darrick J. Wong" <darrick.wong@oracle.com> writes:
> On Fri, Aug 10, 2018 at 07:54:47PM -0400, Theodore Y. Ts'o wrote:
>> The reason why I bring this up here is that in container land, there
>> are those who believe that "container root" should be able to mount
>> file systems, and if the "container root" isn't trusted, the fact that
>> the "container root" can crash the host kernel, or worse, corrupt the
>> host kernel and break out of the container as a result, that would be
>> sad.
>>
>> I was pretty sure most file system developers are on the same page
>> that allowing untrusted "container roots" the ability to mount
>> arbitrary block device file systems is insanity.
>
> Agreed.
For me I am happy with fuse. That is sufficient to cover any container
use cases people have. If anyone comes bugging you for more I will be
happy to push back.
The only thing that containers have to do with this is I wind up
touching a lot of the kernel/user boundary so I get to see a lot of it
and sometimes see weird things.
Eric
--
AppArmor mailing list
AppArmor@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/apparmor
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
[not found] ` <87pnypiufr.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-08-11 1:46 ` Theodore Y. Ts'o
2018-08-11 4:48 ` Eric W. Biederman
0 siblings, 1 reply; 70+ messages in thread
From: Theodore Y. Ts'o @ 2018-08-11 1:46 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Eric Biggers, Tetsuo Handa, David Howells,
selinux-+05T5uksL2qpZYMLLGbcSA,
tomoyo-dev-en-5NWGOfrQmneRv+LV9MX5uooqe+aC9MnS, Paul Moore,
Miklos Szeredi, Stephen Smalley,
fenghua.yu-ral2JQCrhuEAvxtiuMwx3w,
apparmor-nLRlyDuq1AZFpShjVBNYrg, Greg Kroah-Hartman, Al Viro,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-security-module-u79uwXL29TY76Z2rM5mHXA, Li Zefan,
Johannes Weiner, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
>
> My complaint is that the current implemented behavior of practically
> every filesystem in the kernel, is that it will ignore mount options
> when mounted a second time.
The file system is ***not*** mounted a second time.
The design bug is that we allow bind mounts to be specified via a
block device. A bind mount is not "a second mount" of the file
system. Bind mounts != mounts.
I had assumed we had allowed bind mounts to be specified via the block
device because of container use cases. If the container folks don't
want it, I would be pushing to simply not allow bind mounts to be
specified via block device at all.
The only reason why we should support it is because we don't want to
break scripts; and if the goal is not to break scripts, then we have
to keep to the current semantics, however broken you think it is.
- Ted
--
AppArmor mailing list
AppArmor@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/apparmor
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-11 1:05 ` Eric W. Biederman
[not found] ` <87pnypiufr.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-08-11 1:58 ` Al Viro
2018-08-11 2:17 ` Al Viro
2018-08-13 12:54 ` Miklos Szeredi
1 sibling, 2 replies; 70+ messages in thread
From: Al Viro @ 2018-08-11 1:58 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Howells, John Johansen, Tejun Heo, selinux, Paul Moore,
Li Zefan, linux-api, apparmor, Casey Schaufler, fenghua.yu,
Greg Kroah-Hartman, Eric Biggers, linux-security-module,
Tetsuo Handa, Johannes Weiner, Stephen Smalley, tomoyo-dev-en,
cgroups, torvalds, linux-fsdevel, linux-kernel,
Theodore Y. Ts'o, Miklos
On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
> All I proposed was that we distinguish between a first mount and an
> additional mount so that userspace knows the options will be ignored.
For pity sake, just what does it take to explain to you that your
notions of "first mount" and "additional mount" ARE HEAVILY FS-DEPENDENT
and may depend upon the pieces of state userland (especially in container)
simply does not have?
One more time, slowly:
mount -t nfs4 wank.example.org:/foo/bar /mnt/a
mount -t nfs4 wank.example.org:/baz/barf /mnt/b
yield the same superblock. Is anyone who mounts something over NFS
required to know if anybody else has mounted something from the same
server, and if so how the hell are they supposed to find that out,
so that they could decide whether they are creating the "first" or
"additional" mount, whatever that might mean in this situation?
And how, kernel-side, is that supposed to be handled by generic code
of any description?
While we are at it,
mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c
is *NOT* the same superblock as the previous two.
> I don't know how the above wound up being construed as asking that the
> code call sget directly but that is what has happened.
Not by me. What I'm saying is that the entire superblock-creating
machinery - all of it - is nothing but library helpers. With the
decision of when/how/if they are to be used being down to filesystem
driver. Your "first mount"/"additional mount" simply do not map
to anything universally applicable.
> > Having something like a second callback for mount_bdev() that would
> > be called when we'd found an existing instance for the same block
> > device? Sure, no problem. Having a helper for doing such comparison
> > that would work in enough cases to bother, so that different fs
> > could avoid boilerplate in that callback? Again, more power to you.
>
> Normal forms etc. If we want to do that it just requires a wee bit of
> discipline. And if all of the option parsing is being rewritten and
> retested anyway I don't see why we can't do something like that as well.
> So it does not sound unreasonable to me.
See above.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-11 1:58 ` Al Viro
@ 2018-08-11 2:17 ` Al Viro
2018-08-11 4:43 ` Eric W. Biederman
2018-08-13 12:54 ` Miklos Szeredi
1 sibling, 1 reply; 70+ messages in thread
From: Al Viro @ 2018-08-11 2:17 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Howells, John Johansen, Tejun Heo, selinux, Paul Moore,
Li Zefan, linux-api, apparmor, Casey Schaufler, fenghua.yu,
Greg Kroah-Hartman, Eric Biggers, linux-security-module,
Tetsuo Handa, Johannes Weiner, Stephen Smalley, tomoyo-dev-en,
cgroups, torvalds, linux-fsdevel, linux-kernel,
Theodore Y. Ts'o, Miklos
On Sat, Aug 11, 2018 at 02:58:15AM +0100, Al Viro wrote:
> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
>
> > All I proposed was that we distinguish between a first mount and an
> > additional mount so that userspace knows the options will be ignored.
>
> For pity sake, just what does it take to explain to you that your
> notions of "first mount" and "additional mount" ARE HEAVILY FS-DEPENDENT
> and may depend upon the pieces of state userland (especially in container)
> simply does not have?
>
> One more time, slowly:
>
> mount -t nfs4 wank.example.org:/foo/bar /mnt/a
> mount -t nfs4 wank.example.org:/baz/barf /mnt/b
>
> yield the same superblock. Is anyone who mounts something over NFS
> required to know if anybody else has mounted something from the same
> server, and if so how the hell are they supposed to find that out,
> so that they could decide whether they are creating the "first" or
> "additional" mount, whatever that might mean in this situation?
>
> And how, kernel-side, is that supposed to be handled by generic code
> of any description?
>
> While we are at it,
> mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c
> is *NOT* the same superblock as the previous two.
s/as the previous two/as in the previous two cases/, that is - the first two
examples yield one superblock, this one - another.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-11 2:17 ` Al Viro
@ 2018-08-11 4:43 ` Eric W. Biederman
0 siblings, 0 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-11 4:43 UTC (permalink / raw)
To: Al Viro
Cc: David Howells, John Johansen, Tejun Heo, selinux, Paul Moore,
Li Zefan, linux-api, apparmor, Casey Schaufler, fenghua.yu,
Greg Kroah-Hartman, Eric Biggers, linux-security-module,
Tetsuo Handa, Johannes Weiner, Stephen Smalley, tomoyo-dev-en,
cgroups, torvalds, linux-fsdevel, linux-kernel,
Theodore Y. Ts'o, Miklos
Al Viro <viro@ZenIV.linux.org.uk> writes:
> On Sat, Aug 11, 2018 at 02:58:15AM +0100, Al Viro wrote:
>> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
>>
>> > All I proposed was that we distinguish between a first mount and an
>> > additional mount so that userspace knows the options will be ignored.
>>
>> For pity sake, just what does it take to explain to you that your
>> notions of "first mount" and "additional mount" ARE HEAVILY FS-DEPENDENT
>> and may depend upon the pieces of state userland (especially in container)
>> simply does not have?
>>
>> One more time, slowly:
>>
>> mount -t nfs4 wank.example.org:/foo/bar /mnt/a
>> mount -t nfs4 wank.example.org:/baz/barf /mnt/b
>>
>> yield the same superblock. Is anyone who mounts something over NFS
>> required to know if anybody else has mounted something from the same
>> server, and if so how the hell are they supposed to find that out,
>> so that they could decide whether they are creating the "first" or
>> "additional" mount, whatever that might mean in this situation?
>>
>> And how, kernel-side, is that supposed to be handled by generic code
>> of any description?
>>
>> While we are at it,
>> mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c
>> is *NOT* the same superblock as the previous two.
>
> s/as the previous two/as in the previous two cases/, that is - the first two
> examples yield one superblock, this one - another.
Exactly because the mount options differ.
I don't have a problem if we have something sophisticated like nfs that
handles all of the hairy details and does not reuse a superblock unless the
mount options match.
What I have a problem with is the helper for ordinary filesystems that
are not as sophisticated as nfs that don't handle all of the option
magic and give userspace something different from what userspace asked
for.
It may take a little generalization of the definitions I proposed but it
still remains simple and straight forward.
CMD_THESE_MOUNT_OPTIONS_NO_SURPRISES
CMD_WHATEVER_ALREADY_EXISTS
Or we can make the filesystems more sophisticated when we move
them to the new API and perform the comparisons there. I think
that is what David Howells is working on.
Eric
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-11 1:46 ` Theodore Y. Ts'o
@ 2018-08-11 4:48 ` Eric W. Biederman
[not found] ` <8736vlo6ef.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-11 4:48 UTC (permalink / raw)
To: Theodore Y. Ts'o
Cc: Al Viro, David Howells, John Johansen, Tejun Heo, selinux,
Paul Moore, Li Zefan, linux-api, apparmor, Casey Schaufler,
fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-kernel, Miklos
"Theodore Y. Ts'o" <tytso@mit.edu> writes:
> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
>>
>> My complaint is that the current implemented behavior of practically
>> every filesystem in the kernel, is that it will ignore mount options
>> when mounted a second time.
>
> The file system is ***not*** mounted a second time.
>
> The design bug is that we allow bind mounts to be specified via a
> block device. A bind mount is not "a second mount" of the file
> system. Bind mounts != mounts.
>
> I had assumed we had allowed bind mounts to be specified via the block
> device because of container use cases. If the container folks don't
> want it, I would be pushing to simply not allow bind mounts to be
> specified via block device at all.
No it is not a container thing.
> The only reason why we should support it is because we don't want to
> break scripts; and if the goal is not to break scripts, then we have
> to keep to the current semantics, however broken you think it is.
But we don't have to support returning filesystems with mismatched mount
options in the new fsopen api. That is my concern. Confusing
userspace this way has been shown to be harmful let's not keep doing it.
Eric
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
[not found] ` <87pnyphf8i.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-08-11 7:29 ` David Howells
2018-08-11 16:31 ` Andy Lutomirski
0 siblings, 1 reply; 70+ messages in thread
From: David Howells @ 2018-08-11 7:29 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Eric Biggers, Tetsuo Handa, dhowells-H+wXaHxf7aLQT0dZR+AlfA,
selinux-+05T5uksL2qpZYMLLGbcSA,
tomoyo-dev-en-5NWGOfrQmneRv+LV9MX5uooqe+aC9MnS, Paul Moore,
Miklos Szeredi, Stephen Smalley,
fenghua.yu-ral2JQCrhuEAvxtiuMwx3w,
apparmor-nLRlyDuq1AZFpShjVBNYrg, Greg Kroah-Hartman,
viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
cgroups-u79uwXL29TY76Z2rM5mHXA, Theodore Y. Ts'o,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-security-module-u79uwXL29TY76Z2rM5mHXA, Li Zefan,
Johannes Weiner, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
Eric W. Biederman <ebiederm@xmission.com> wrote:
> > Yes, I agree it would be nice to have, but it *doesn't* really need
> > supporting right this minute, since what I have now oughtn't to break the
> > current behaviour.
>
> I am really reluctant to endorse anything that propagates the issues of
> the current interface in the new mount interface.
Do realise that your problem cannot be solved through fsopen() until every
filesystem is converted to the new fs_context-based sget() since the flag has
to make it from the VFS through the filesystem to sget().
I'm reluctant to add this flag till that point until that time unless we error
out if the flag is set against a legacy filesystem.
David
--
AppArmor mailing list
AppArmor@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/apparmor
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-11 7:29 ` David Howells
@ 2018-08-11 16:31 ` Andy Lutomirski
[not found] ` <9B6E2781-484B-4C42-95F5-F900EA36CEA5-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
0 siblings, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2018-08-11 16:31 UTC (permalink / raw)
To: David Howells
Cc: Eric W. Biederman, viro, John Johansen, Tejun Heo, selinux,
Paul Moore, Li Zefan, linux-api, apparmor, Casey Schaufler,
fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-kernel, Theodore
> On Aug 11, 2018, at 12:29 AM, David Howells <dhowells@redhat.com> wrote:
>
> Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>>> Yes, I agree it would be nice to have, but it *doesn't* really need
>>> supporting right this minute, since what I have now oughtn't to break the
>>> current behaviour.
>>
>> I am really reluctant to endorse anything that propagates the issues of
>> the current interface in the new mount interface.
>
> Do realise that your problem cannot be solved through fsopen() until every
> filesystem is converted to the new fs_context-based sget() since the flag has
> to make it from the VFS through the filesystem to sget().
>
> I'm reluctant to add this flag till that point until that time unless we error
> out if the flag is set against a legacy filesystem.
>
>
I don’t see why we need all this fancy “do the options match” stuff. For the handful of filesystems (like NFS) that do something intelligent when multiple non-bind mount requests against the same underlying storage happen, we can keep that behavior in the new API. For other filesystems that don’t have this feature, we should simply fail the request.
IOW I see so compelling reason to call sget() at all from the new API. The only sort-of-legit use case I can think of is mounting more than one btrfs subvolume. But even that should probably not be done by asking the kernel to separately instantiate the filesystem.
As another way of looking at it: for a network filesystem, mounting the same target ip and path from two different Linux machines works, so mounting it twice from the same machine should also work. But mounting the same underlying ext4 block device from two different Linux machines (using nbd, iscsi, etc) would be a catastrophe, so I see no reason that it needs to be supported if it’s two mounts from one machine.
The case folding example is interesting, and I think it should probably have a slightly different API. A program could open_tree a nocasefold mount and then make a request to create what is functionally a bind mount but with different options.
mount(8) will presumably just keep using mount(2).
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
[not found] ` <9B6E2781-484B-4C42-95F5-F900EA36CEA5-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
@ 2018-08-11 16:51 ` Al Viro
0 siblings, 0 replies; 70+ messages in thread
From: Al Viro @ 2018-08-11 16:51 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Eric Biggers, Tetsuo Handa, David Howells,
selinux-+05T5uksL2qpZYMLLGbcSA,
tomoyo-dev-en-5NWGOfrQmneRv+LV9MX5uooqe+aC9MnS, Paul Moore,
Miklos Szeredi, Stephen Smalley,
fenghua.yu-ral2JQCrhuEAvxtiuMwx3w,
apparmor-nLRlyDuq1AZFpShjVBNYrg, Greg Kroah-Hartman,
cgroups-u79uwXL29TY76Z2rM5mHXA, Theodore Y. Ts'o,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-security-module-u79uwXL29TY76Z2rM5mHXA, Li Zefan,
Eric W. Biederman, Johannes Weiner,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
On Sat, Aug 11, 2018 at 09:31:29AM -0700, Andy Lutomirski wrote:
> I don’t see why we need all this fancy “do the options match” stuff. For the handful of filesystems (like NFS) that do something intelligent when multiple non-bind mount requests against the same underlying storage happen, we can keep that behavior in the new API. For other filesystems that don’t have this feature, we should simply fail the request.
> IOW I see so compelling reason to call sget() at all from the new API. The only sort-of-legit use case I can think of is mounting more than one btrfs subvolume. But even that should probably not be done by asking the kernel to separately instantiate the filesystem.
May I politely suggest the esteemed participants of that conversation
to RTFS? Yes, I know that it's less fun that talking about your
rather vague ideas of how the things (surely) work, but it just might
avoid the feats of idiocy like the above.
Andy, I don't know how to put it more plainly: read the fucking source.
Even grep would do. The same NFS you've granted (among the "handful"
of filesystems) an exception, *DOES* *CALL* *THE* *FUCKING* sget().
Yes, really. And in some obscure[1] cases (including the one mentioned
upthread) it does reuse a pre-existing superblock. For a very good
reason.
[1] such as, oh, mounting two filesystems from the same server with
default options - who would've ever thought of doing something so
perverted?
--
AppArmor mailing list
AppArmor@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/apparmor
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
[not found] ` <8736vlo6ef.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2018-08-11 17:47 ` Casey Schaufler
2018-08-15 4:03 ` Eric W. Biederman
0 siblings, 1 reply; 70+ messages in thread
From: Casey Schaufler @ 2018-08-11 17:47 UTC (permalink / raw)
To: Eric W. Biederman, Theodore Y. Ts'o
Cc: Eric Biggers, Tetsuo Handa, David Howells,
selinux-+05T5uksL2qpZYMLLGbcSA, Paul Moore, Miklos Szeredi,
Stephen Smalley, fenghua.yu-ral2JQCrhuEAvxtiuMwx3w,
apparmor-nLRlyDuq1AZFpShjVBNYrg,
tomoyo-dev-en-5NWGOfrQmneRv+LV9MX5uooqe+aC9MnS,
Greg Kroah-Hartman, Al Viro, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-security-module-u79uwXL29TY76Z2rM5mHXA, Li Zefan,
Johannes Weiner, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
On 8/10/2018 9:48 PM, Eric W. Biederman wrote:
> "Theodore Y. Ts'o" <tytso@mit.edu> writes:
>
>> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
>>> My complaint is that the current implemented behavior of practically
>>> every filesystem in the kernel, is that it will ignore mount options
>>> when mounted a second time.
>> The file system is ***not*** mounted a second time.
>>
>> The design bug is that we allow bind mounts to be specified via a
>> block device. A bind mount is not "a second mount" of the file
>> system. Bind mounts != mounts.
>>
>> I had assumed we had allowed bind mounts to be specified via the block
>> device because of container use cases. If the container folks don't
>> want it, I would be pushing to simply not allow bind mounts to be
>> specified via block device at all.
> No it is not a container thing.
Inigo: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."
Rugen: "Stop saying that!"
Eric: "It is not a container thing."
Casey: "Stop saying that!"
Yes, Virginia, it *is* a container thing. Your container manager expects all
filesystems to be server-client based. It makes bad assumptions. It is doing
things that we would fire a sysadmin for doing. Don't blame the filesystems
for behaving as documented. Export the filesystem using NFS and mount them
using the NFS mechanism, which is designed to do what you're asking for. The
problem is not in the mount mechanism, it's in the way you want to abuse it.
>> The only reason why we should support it is because we don't want to
>> break scripts; and if the goal is not to break scripts, then we have
>> to keep to the current semantics, however broken you think it is.
> But we don't have to support returning filesystems with mismatched mount
> options in the new fsopen api. That is my concern. Confusing
> userspace this way has been shown to be harmful let's not keep doing it.
It's not "userspace" that's confused. Developers of userspace code
implementing system behavior (e.g. systemd, container managers) need to
understand how the system works. The container manager needs to know
that it can't mount filesystems with different options. That's the kind
of thing "managers" do. If it has to go to the mount table and check
on how the device is already mounted before doing a mount, so be it.
Unless, of course, you want the concept of "container" introduced into
the kernel. There's a whole lot of feldercarb that container managers
have to deal with that would be lots easier to deal with down below.
I'm not advocating that, and I understand the arguments against it.
On the other hand, if you want a platform that is optimized for a
container environment ...
> Eric
--
AppArmor mailing list
AppArmor@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/apparmor
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #11]
2018-08-09 14:24 ` David Howells
` (2 preceding siblings ...)
2018-08-09 16:33 ` David Howells
@ 2018-08-11 20:20 ` David Howells
2018-08-11 23:26 ` Andy Lutomirski
3 siblings, 1 reply; 70+ messages in thread
From: David Howells @ 2018-08-11 20:20 UTC (permalink / raw)
To: Miklos Szeredi
Cc: dhowells, Eric W. Biederman, Al Viro, Linux API, Linus Torvalds,
linux-fsdevel, linux-kernel
Miklos Szeredi <miklos@szeredi.hu> wrote:
> You can determine at fsopen() time whether the filesystem is able to
> support the O_EXCL behavior? If so, then it's trivial to enable this
> conditionally. I think that's what Eric is asking for, it's obviously
> not fair to ask for a change in behavior of the legacy interface.
It's not trivial, see btrfs and nfs :-/
David
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context [ver #11]
2018-08-11 20:20 ` David Howells
@ 2018-08-11 23:26 ` Andy Lutomirski
0 siblings, 0 replies; 70+ messages in thread
From: Andy Lutomirski @ 2018-08-11 23:26 UTC (permalink / raw)
To: David Howells
Cc: Miklos Szeredi, Eric W. Biederman, Al Viro, Linux API,
Linus Torvalds, Linux FS Devel, LKML
On Sat, Aug 11, 2018 at 1:20 PM, David Howells <dhowells@redhat.com> wrote:
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
>> You can determine at fsopen() time whether the filesystem is able to
>> support the O_EXCL behavior? If so, then it's trivial to enable this
>> conditionally. I think that's what Eric is asking for, it's obviously
>> not fair to ask for a change in behavior of the legacy interface.
>
> It's not trivial, see btrfs and nfs :-/
>
I'm not convinced that btrfs and nfs are the same situation. As far
as I can tell, in NFS's case, NFS shares superblocks as an
implementation detail. With Al's example, someone can do:
mount -t nfs4 wank.example.org:/foo/bar /mnt/a
mount -t nfs4 wank.example.org:/baz/barf /mnt/b
mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c
or equivalently create three fscontexts and FSCONFIG_CMD_CREATE all of
them, and the kernel creates one superblock for /mnt/a and /mnt/b and
a second one for /mnt/c. That seems like a good optimization, but I
think it really is just an optimization. In any sane implementation,
all three calls should succeed, and it should in general be possible
to create as many totally fresh mounts of the same network file system
as anyone wants.
Given this example, I think that it may be important to give
FSCONFIG_CMD_RECONFIGURE a very clear definition, and possibly a
definition that doesn't use the word superblock. After all, if
someone does FSCONFIG_CMD_RECONFIGURE on /mnt/a, if it really
reconfigures a *superblock*, then it will change /mnt/b as a side
effect but will not change /mnt/c. This seems like a mistake.
But I think that btrfs is quite a bit different. With btrfs, I can do:
mount -t btrfs /dev/sda1 -o subvol=a /mnt/a
mount -t btrfs /dev/sda1 -o subvol=b /mnt/b
and I get two mounts, each pointing at a different subvolume, that
(I'm pretty sure) share a superblock
mount -t btrfs /dev/sda1 -o subvol=c,foo=bar /mnt/c
where foo is a per-superblock option, it probably gets ignored. If I
set up /dev/mapper/foo as a linear alias for /dev/sda1 and I do:
mount -t btrfs /dev/mapper/foo -o subvol=d /mnt/d
then I get a fresh superblock. If /dev/sda1 is still mounted and the
various O_EXCL-like checks don'e catch it, then I get massive
corruption.
The btrfs case seems quite fragile to me, and it seems like a bit of
an abuse of mount(2). (Of course, basically everything anyone does
with mount(2) is a bit of an abuse.)
I would hope that the new fs mounting API would clean this up. The
NFS case seems just fine, but for btrfs, it seems like maybe the whole
CMD_CREATE operation should be more fine grained. There seem to be
*two* actions going on in a btrfs mount. First there's the act of
instantiating the filesystem driver backed by the device (I think this
is open_ctree()), and *then* there's the act of instantiating a dentry
tree pointing at some subvolume, etc.
ZFS seems to handle this quite nicely. First you fire up a zpool, and
then you start mounting its volumes.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-11 1:58 ` Al Viro
2018-08-11 2:17 ` Al Viro
@ 2018-08-13 12:54 ` Miklos Szeredi
1 sibling, 0 replies; 70+ messages in thread
From: Miklos Szeredi @ 2018-08-13 12:54 UTC (permalink / raw)
To: Al Viro
Cc: Eric W. Biederman, David Howells, John Johansen, Tejun Heo,
selinux, Paul Moore, Li Zefan, Linux API, apparmor,
Casey Schaufler, fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
LSM, Tetsuo Handa, Johannes Weiner, Stephen Smalley,
tomoyo-dev-en, cgroups, Linus Torvalds, linux-fsdevel
On Sat, Aug 11, 2018 at 3:58 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> What I'm saying is that the entire superblock-creating
> machinery - all of it - is nothing but library helpers. With the
> decision of when/how/if they are to be used being down to filesystem
> driver. Your "first mount"/"additional mount" simply do not map
> to anything universally applicable.
Why so? (Note: using the "mount" terminology here is fundamentally
broken to start with, mounts have nothing to do with this...
Filesystem instance is better word.)
You bring up NFS as an example, but creating and/or reusing an nfs
client instance connected to a certain server is certainly a clear and
well defined concept.
The question becomes: does it make sense to generalize this concept
and export it to userspace with the new API?
You know the Plan 9 fs interface much better, but to me it looks like
there's a separate namespace for filesystem instances, and the mount
command just refers to such an instance. So there's no comparing of
options or any such horror, just the need to explicitly instantiate a
new instance when necessary. Doesn't sound very difficult to
implement in the new API.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-10 20:06 ` Andy Lutomirski
2018-08-10 20:46 ` Theodore Y. Ts'o
@ 2018-08-13 16:35 ` Alan Cox
2018-08-13 16:48 ` Andy Lutomirski
1 sibling, 1 reply; 70+ messages in thread
From: Alan Cox @ 2018-08-13 16:35 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Theodore Y. Ts'o, David Howells, Eric W. Biederman, Al Viro,
John Johansen, Tejun Heo, SELinux-NSA, Paul Moore, Li Zefan,
Linux API, apparmor, Casey Schaufler, Fenghua Yu,
Greg Kroah-Hartman, Eric Biggers, LSM List, Tetsuo Handa,
Johannes Weiner, Stephen Smalley, tomoyo-dev-en
> If the same block device is visible, with rw access, in two different
> containers, I don't see any anything good can happen. Sure, with the
At the raw level there are lots of use cases involving high performance
data capture, media streaming and the like.
At the file system layer you can use GFS2 for example.
So there are cases where it's possible. There are even cases where it's
actually useful at the filesystem level although not many I agree.
Alan
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-13 16:35 ` Alan Cox
@ 2018-08-13 16:48 ` Andy Lutomirski
2018-08-13 17:29 ` Al Viro
0 siblings, 1 reply; 70+ messages in thread
From: Andy Lutomirski @ 2018-08-13 16:48 UTC (permalink / raw)
To: Alan Cox
Cc: Andy Lutomirski, Theodore Y. Ts'o, David Howells,
Eric W. Biederman, Al Viro, John Johansen, Tejun Heo, SELinux-NSA,
Paul Moore, Li Zefan, Linux API, apparmor, Casey Schaufler,
Fenghua Yu, Greg Kroah-Hartman, Eric Biggers, LSM List,
Tetsuo Handa, Johannes Weiner, Stephen Smalley
On Mon, Aug 13, 2018 at 9:35 AM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
>> If the same block device is visible, with rw access, in two different
>> containers, I don't see any anything good can happen. Sure, with the
>
> At the raw level there are lots of use cases involving high performance
> data capture, media streaming and the like.
>
> At the file system layer you can use GFS2 for example.
Ugh. I even thought of this case, and I should have been a bit more precise:
I would consider the GFS2 case to be essentially equivalent to the NFS
case. I think we can probably divide all the filesystems into three
or four types:
pseudo file systems: Multiple instantiations of the same fs driver
pointing at the same backing store give separate filesystems. (Same
backing store includes the case where there isn't any backing store.)
tmpfs is an example. This isn't particularly interesting.
network-like file systems: Multiple instantiations of the same fs
driver pointing at the same backing store are expected. This includes
NFS, GFS2, AFS, CIFS, etc. This is only really interesting to the
extent that, if the fs driver internally wants to share state between
multiple instantiations, it should be smart enough to make sure the
options are compatible or that it can otherwise handle mismatched
options correctly. NFS does this right.
non-network-like filesystems: There are complicated ones like btrfs
and ZFS and simple ones like ext4. In either case, multiple totally
separate instantiations of the driver sharing the backing store will
lead to corruption. In cases like ext4, we seem to support it for
legacy reasons, because we're afraid that there are scripts that try
to mount the same block device more than once, and I think the new API
has no need to support this. In cases like btrfs, we also seem to
support multiple user requests for "mounts" with the same underlying
block devices because we need it for full functionality. But I think
this is because our API is wrong.
Are there cases I'm missing? It sounds like the API could be improved
to fully model the last case, and everything will work nicely.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-13 16:48 ` Andy Lutomirski
@ 2018-08-13 17:29 ` Al Viro
2018-08-13 19:00 ` James Morris
0 siblings, 1 reply; 70+ messages in thread
From: Al Viro @ 2018-08-13 17:29 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Alan Cox, Theodore Y. Ts'o, David Howells, Eric W. Biederman,
John Johansen, Tejun Heo, SELinux-NSA, Paul Moore, Li Zefan,
Linux API, apparmor, Casey Schaufler, Fenghua Yu,
Greg Kroah-Hartman, Eric Biggers, LSM List, Tetsuo Handa,
Johannes Weiner, Stephen Smalley, tomoyo-dev-en
On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote:
> I would consider the GFS2 case to be essentially equivalent to the NFS
> case. I think we can probably divide all the filesystems into three
> or four types:
>
> pseudo file systems: Multiple instantiations of the same fs driver
> pointing at the same backing store give separate filesystems. (Same
> backing store includes the case where there isn't any backing store.)
> tmpfs is an example. This isn't particularly interesting.
>
> network-like file systems: Multiple instantiations of the same fs
> driver pointing at the same backing store are expected. This includes
> NFS, GFS2, AFS, CIFS, etc. This is only really interesting to the
> extent that, if the fs driver internally wants to share state between
> multiple instantiations, it should be smart enough to make sure the
> options are compatible or that it can otherwise handle mismatched
> options correctly. NFS does this right.
>
> non-network-like filesystems: There are complicated ones like btrfs
> and ZFS and simple ones like ext4. In either case, multiple totally
> separate instantiations of the driver sharing the backing store will
> lead to corruption. In cases like ext4, we seem to support it for
> legacy reasons, because we're afraid that there are scripts that try
> to mount the same block device more than once, and I think the new API
> has no need to support this. In cases like btrfs, we also seem to
> support multiple user requests for "mounts" with the same underlying
> block devices because we need it for full functionality. But I think
> this is because our API is wrong.
>
> Are there cases I'm missing? It sounds like the API could be improved
> to fully model the last case, and everything will work nicely.
You know, that's starting to remind of this little gem of Borges:
http://www.alamut.com/subj/artiface/language/johnWilkins.html
Especially the delightful (fake) quote contained in there:
[...] it is written that the animals are divided into:
(a) belonging to the emperor,
(b) embalmed,
(c) tame,
(d) sucking pigs,
(e) sirens,
(f) fabulous,
(g) stray dogs,
(h) included in the present classification,
(i) frenzied,
(j) innumerable,
(k) drawn with a very fine camelhair brush,
(l) et cetera,
(m) having just broken the water pitcher,
(n) that from a long way off look like flies.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-13 17:29 ` Al Viro
@ 2018-08-13 19:00 ` James Morris
2018-08-13 19:20 ` Casey Schaufler
2018-08-15 23:29 ` Serge E. Hallyn
0 siblings, 2 replies; 70+ messages in thread
From: James Morris @ 2018-08-13 19:00 UTC (permalink / raw)
To: Al Viro
Cc: Andy Lutomirski, Alan Cox, Theodore Y. Ts'o, David Howells,
Eric W. Biederman, John Johansen, Tejun Heo, SELinux-NSA,
Paul Moore, Li Zefan, Linux API, apparmor, Casey Schaufler,
Fenghua Yu, Greg Kroah-Hartman, Eric Biggers, LSM List,
Tetsuo Handa, Johannes Weiner, Stephen Smalley <sd>
On Mon, 13 Aug 2018, Al Viro wrote:
> On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote:
> > Are there cases I'm missing? It sounds like the API could be improved
> > to fully model the last case, and everything will work nicely.
>
> You know, that's starting to remind of this little gem of Borges:
> http://www.alamut.com/subj/artiface/language/johnWilkins.html
> Especially the delightful (fake) quote contained in there:
> [...] it is written that the animals are divided into:
> (a) belonging to the emperor,
> (b) embalmed,
> (c) tame,
> (d) sucking pigs,
> (e) sirens,
> (f) fabulous,
> (g) stray dogs,
> (h) included in the present classification,
> (i) frenzied,
> (j) innumerable,
> (k) drawn with a very fine camelhair brush,
> (l) et cetera,
> (m) having just broken the water pitcher,
> (n) that from a long way off look like flies.
Coincidentally, this was also the model for Linux capabilities.
--
James Morris
<jmorris@namei.org>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-13 19:00 ` James Morris
@ 2018-08-13 19:20 ` Casey Schaufler
2018-08-15 23:29 ` Serge E. Hallyn
1 sibling, 0 replies; 70+ messages in thread
From: Casey Schaufler @ 2018-08-13 19:20 UTC (permalink / raw)
To: James Morris, Al Viro
Cc: Andy Lutomirski, Alan Cox, Theodore Y. Ts'o, David Howells,
Eric W. Biederman, John Johansen, Tejun Heo, SELinux-NSA,
Paul Moore, Li Zefan, Linux API, apparmor, Fenghua Yu,
Greg Kroah-Hartman, Eric Biggers, LSM List, Tetsuo Handa,
Johannes Weiner, Stephen Smalley, tomoyo-dev-en
On 8/13/2018 12:00 PM, James Morris wrote:
> On Mon, 13 Aug 2018, Al Viro wrote:
>
>> On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote:
>>> Are there cases I'm missing? It sounds like the API could be improved
>>> to fully model the last case, and everything will work nicely.
>> You know, that's starting to remind of this little gem of Borges:
>> http://www.alamut.com/subj/artiface/language/johnWilkins.html
>> Especially the delightful (fake) quote contained in there:
>> [...] it is written that the animals are divided into:
>> (a) belonging to the emperor,
>> (b) embalmed,
>> (c) tame,
>> (d) sucking pigs,
>> (e) sirens,
>> (f) fabulous,
>> (g) stray dogs,
>> (h) included in the present classification,
>> (i) frenzied,
>> (j) innumerable,
>> (k) drawn with a very fine camelhair brush,
>> (l) et cetera,
>> (m) having just broken the water pitcher,
>> (n) that from a long way off look like flies.
>
> Coincidentally, this was also the model for Linux capabilities.
Linux capabilities are POSIX capabilities which are modeled closely
to accommodate the historical behavior manifest in the P1003.1 specification.
So except for (c), (f) and (k) you can use this characterization.
On a slightly more serious note, there's a lot of Linux, mount semantics
included, that have grow organically and that aren't quite up to the
usage models they are being applied to. I applaud David's work in part
because it may make it possible to accommodate more of those cases going
forward.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-11 17:47 ` Casey Schaufler
@ 2018-08-15 4:03 ` Eric W. Biederman
0 siblings, 0 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-15 4:03 UTC (permalink / raw)
To: Casey Schaufler
Cc: Theodore Y. Ts'o, Al Viro, David Howells, John Johansen,
Tejun Heo, selinux, Paul Moore, Li Zefan, linux-api, apparmor,
fenghua.yu, Greg Kroah-Hartman, Eric Biggers,
linux-security-module, Tetsuo Handa, Johannes Weiner,
Stephen Smalley, tomoyo-dev-en, cgroups, torvalds, linux-fsdevel,
linux-kernel, Miklos Szeredi <miklo>
Casey Schaufler <casey@schaufler-ca.com> writes:
> Don't blame the filesystems for behaving as documented.
No. This behavior is not documented. At least I certainly don't see a
word about this in any of the man pages. Where does it say mounting a
filesystem will not honor it's mount options?
It is also rare enough in practice it is something it is reasonable to
expect people to be surprised by.
> The problem is not in the mount mechanism, it's in the way you want to
> abuse it.
I am not asking for this behavior. I am pointing out this behavior
exists. I am pointing out this behavior is harmful. I am asking we
stop doing this harmful thing in the new API where we don't have a
chance of breaking anything.
The place where this has bitten the hardest is someone wrote a script to
do something for Xen in a chroot. That script involved a chroot that
mounted devpts and in doing so happend to change the options of the main
/dev/pts. Which resulted in ptys created with /dev/ptmx outside the
chroot with the wrong permissions. That in turn caused several distros
to retain the ancient suid pt_chown binary from libc that the devpts
filesystem was built to make obsolete. As the world turned that
pt_chown binary could be confused into chowning the wrong pty if a pty
from a container was used.
The fix was to mount a new instance of devpts every time mount of devpts
is called. That simplified the code, and allowed pt_chown to be removed
permanently. The tricky bit was figuring out how keep /dev/ptmx
working. I wound up testing on every distribution I could think of to
ensure no one would notice the slightly changed behavior of the devpts
filesystem.
The behavior in other filesystems of ignoring the options instead of
changing them on the filesystem isn't quite as bad. But it still has
the potential for a lot of mischief.
Eric
^ permalink raw reply [flat|nested] 70+ messages in thread
* Should we split the network filesystem setup into two phases?
2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
` (7 preceding siblings ...)
2018-08-10 15:11 ` David Howells
@ 2018-08-15 16:31 ` David Howells
2018-08-15 16:51 ` Andy Lutomirski
` (2 more replies)
8 siblings, 3 replies; 70+ messages in thread
From: David Howells @ 2018-08-15 16:31 UTC (permalink / raw)
To: trond.myklebust, anna.schumaker, sfrench, steved, viro
Cc: dhowells, torvalds, Eric W. Biederman, linux-api,
linux-security-module, linux-fsdevel, linux-kernel, linux-nfs,
linux-cifs, linux-afs, ceph-devel, v9fs-developer
Having just re-ported NFS on top of the new mount API stuff, I find that I
don't really like the idea of superblocks being separated by communication
parameters - especially when it might seem reasonable to be able to adjust
those parameters.
Does it make sense to abstract out the remote peer and allow (a) that to be
configured separately from any superblocks using it and (b) that to be used to
create superblocks?
Note that what a 'remote peer' is would be different for different
filesystems:
(*) For NFS, it would probably be a named server, with address(es) attached
to the name. In lieu of actually having a name, the initial IP address
could be used.
(*) For CIFS, it would probably be a named server. I'm not sure if CIFS
allows an abstraction for a share that can move about inside a domain.
(*) For AFS, it would be a cell, I think, where the actual fileserver(s) used
are a matter of direction from the Volume Location server.
(*) For 9P and Ceph, I don't really know.
What could be configured? Well, addresses, ports, timeouts. Maybe protocol
level negotiation - though not being able to explicitly specify, say, the
particular version and minorversion on an NFS share would be problematic for
backward compatibility.
One advantage it could give us is that it might make it easier if someone asks
for server X to query userspace in some way for the default parameters for X
are.
What might this look like in terms of userspace? Well, we could overload the
new mount API:
peer1 = fsopen("nfs", FSOPEN_CREATE_PEER);
fsconfig(peer1, FSCONFIG_SET_NS, "net", NULL, netns_fd);
fsconfig(peer1, FSCONFIG_SET_STRING, "peer_name", "server.home");
fsconfig(peer1, FSCONFIG_SET_STRING, "vers", "4.2");
fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.1");
fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.2");
fsconfig(peer1, FSCONFIG_SET_STRING, "timeo", "122");
fsconfig(peer1, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);
peer2 = fsopen("nfs", FSOPEN_CREATE_PEER);
fsconfig(peer2, FSCONFIG_SET_NS, "net", NULL, netns_fd);
fsconfig(peer2, FSCONFIG_SET_STRING, "peer_name", "server2.home");
fsconfig(peer2, FSCONFIG_SET_STRING, "vers", "3");
fsconfig(peer2, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.3");
fsconfig(peer2, FSCONFIG_SET_STRING, "address", "udp:192.168.1.4+6001");
fsconfig(peer2, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);
fs = fsopen("nfs", 0);
fsconfig(fs, FSCONFIG_SET_PEER, "peer.1", NULL, peer1);
fsconfig(fs, FSCONFIG_SET_PEER, "peer.2", NULL, peer2);
fsconfig(fs, FSCONFIG_SET_STRING, "source", "/home/dhowells", 0);
m = fsmount(fs, 0, 0);
[Note that Eric's oft-repeated point about the 'creation' operation altering
established parameters still stands here.]
You could also then reopen it for configuration, maybe by:
peer = fspick(AT_FDCWD, "/mnt", FSPICK_PEER);
or:
peer = fspick(AT_FDCWD, "nfs:server.home", FSPICK_PEER_BY_NAME);
though it might be better to give it its own syscall:
peer = fspeer("nfs", "server.home", O_CLOEXEC);
fsconfig(peer, FSCONFIG_SET_NS, "net", NULL, netns_fd);
...
fsconfig(peer, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);
In terms of alternative interfaces, I'm not sure how easy it would be to make
it like cgroups where you go and create a dir in a special filesystem, say,
"/sys/peers/nfs", because the peers records and names would have to be network
namespaced. Also, it might make it more difficult to use to create a root fs.
On the other hand, being able to adjust the peer configuration by:
echo 71 >/sys/peers/nfs/server.home/timeo
does have a certain appeal.
Also, netlink might be the right option, but I'm not sure how you'd pin the
resultant object whilst you make use of it.
A further thought is that is it worth making this idea more general and
encompassing non-network devices also? This would run into issues of some
logical sources being visible across namespaces and but not others.
David
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Should we split the network filesystem setup into two phases?
2018-08-15 16:31 ` Should we split the network filesystem setup into two phases? David Howells
@ 2018-08-15 16:51 ` Andy Lutomirski
2018-08-16 3:51 ` Steve French
2018-08-16 5:06 ` Eric W. Biederman
2 siblings, 0 replies; 70+ messages in thread
From: Andy Lutomirski @ 2018-08-15 16:51 UTC (permalink / raw)
To: David Howells
Cc: trond.myklebust, anna.schumaker, sfrench, steved, viro, torvalds,
Eric W. Biederman, linux-api, linux-security-module,
linux-fsdevel, linux-kernel, linux-nfs, linux-cifs, linux-afs,
ceph-devel, v9fs-developer
> On Aug 15, 2018, at 9:31 AM, David Howells <dhowells@redhat.com> wrote:
>
> Having just re-ported NFS on top of the new mount API stuff, I find that I
> don't really like the idea of superblocks being separated by communication
> parameters - especially when it might seem reasonable to be able to adjust
> those parameters.
>
> Does it make sense to abstract out the remote peer and allow (a) that to be
> configured separately from any superblocks using it and (b) that to be used to
> create superblocks?
>
> Note that what a 'remote peer' is would be different for different
> filesystems:
...
I think this looks rather nice. But maybe you should generalize the concept of “peer” so that it works for btrfs too. In the case where you mount two different subvolumes, you’re creating a *something*, and you’re then creating a filesystem that references it. It’s almost the same thing.
>
>
>
> fs = fsopen("nfs", 0);
> fsconfig(fs, FSCONFIG_SET_PEER, "peer.1", NULL, peer1);
As you mention below, this seems like it might have namespacing issues.
>
> In terms of alternative interfaces, I'm not sure how easy it would be to make
> it like cgroups where you go and create a dir in a special filesystem, say,
> "/sys/peers/nfs", because the peers records and names would have to be network
> namespaced. Also, it might make it more difficult to use to create a root fs.
>
> On the other hand, being able to adjust the peer configuration by:
>
> echo 71 >/sys/peers/nfs/server.home/timeo
>
> does have a certain appeal.
>
> Also, netlink might be the right option, but I'm not sure how you'd pin the
> resultant object whilst you make use of it.
>
My suggestion would be to avoid giving these things names at all. I think that referring to them by fd should be sufficient, especially if you allow them to be reopened based on a mount that uses them and allow them to get bind-mounted somewhere a la namespaces to make them permanent if needed.
> A further thought is that is it worth making this idea more general and
> encompassing non-network devices also? This would run into issues of some
> logical sources being visible across namespaces and but not others.
Indeed :)
It probably pays to rope a btrfs person into this discussion.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: BUG: Mount ignores mount options
2018-08-13 19:00 ` James Morris
2018-08-13 19:20 ` Casey Schaufler
@ 2018-08-15 23:29 ` Serge E. Hallyn
1 sibling, 0 replies; 70+ messages in thread
From: Serge E. Hallyn @ 2018-08-15 23:29 UTC (permalink / raw)
To: James Morris
Cc: Al Viro, Andy Lutomirski, Alan Cox, Theodore Y. Ts'o,
David Howells, Eric W. Biederman, John Johansen, Tejun Heo,
SELinux-NSA, Paul Moore, Li Zefan, Linux API, apparmor,
Casey Schaufler, Fenghua Yu, Greg Kroah-Hartman, Eric Biggers,
LSM List, Tetsuo Handa, Johannes Weiner <hanne>
Quoting James Morris (jmorris@namei.org):
> On Mon, 13 Aug 2018, Al Viro wrote:
>
> > On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote:
>
> > > Are there cases I'm missing? It sounds like the API could be improved
> > > to fully model the last case, and everything will work nicely.
> >
> > You know, that's starting to remind of this little gem of Borges:
> > http://www.alamut.com/subj/artiface/language/johnWilkins.html
> > Especially the delightful (fake) quote contained in there:
> > [...] it is written that the animals are divided into:
> > (a) belonging to the emperor,
> > (b) embalmed,
> > (c) tame,
> > (d) sucking pigs,
> > (e) sirens,
> > (f) fabulous,
> > (g) stray dogs,
> > (h) included in the present classification,
> > (i) frenzied,
> > (j) innumerable,
> > (k) drawn with a very fine camelhair brush,
> > (l) et cetera,
> > (m) having just broken the water pitcher,
> > (n) that from a long way off look like flies.
>
>
> Coincidentally, this was also the model for Linux capabilities.
But maybe we want to split the stray dogs up by breed.
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Should we split the network filesystem setup into two phases?
2018-08-15 16:31 ` Should we split the network filesystem setup into two phases? David Howells
2018-08-15 16:51 ` Andy Lutomirski
@ 2018-08-16 3:51 ` Steve French
2018-08-16 5:06 ` Eric W. Biederman
2 siblings, 0 replies; 70+ messages in thread
From: Steve French @ 2018-08-16 3:51 UTC (permalink / raw)
To: David Howells
Cc: Steve French, Al Viro, Linus Torvalds, ebiederm, linux-api,
linux-security-module, linux-fsdevel, LKML, CIFS
This is worth further detailed discussion re:SMB3 as there are some fascinating
protocol features that might help here, but my first thought is just the obvious
one - this could help 'DFS' (the global name space feature almost all modern
CIFS/SMB3 implement) work a little better in the client. A share can
be represented by an array of \\server\share\path targets although typically
only one except in the DFS case (and server can be an ipv4 or
ipv6 address or host name (which could have multiple addresses).
It could be over RDMA, TCP, and even other protocols (as the transport).
There are various examples of DFS referrals in
https://msdn.microsoft.com/en-us/library/cc227066.aspx section 4.
But since SMB3 also supports transparent failover, and "share move"
and "server move" features, as well as multichannel - I would like
to better understand the patch set to see if it helps/hurts.
But until I dive into the patch set more and try it, hard for me to speculate.
Has anyone looked at the CIFS/SMB3 changes needed?
On Wed, Aug 15, 2018 at 11:32 AM David Howells <dhowells@redhat.com> wrote:
>
> Having just re-ported NFS on top of the new mount API stuff, I find that I
> don't really like the idea of superblocks being separated by communication
> parameters - especially when it might seem reasonable to be able to adjust
> those parameters.
>
> Does it make sense to abstract out the remote peer and allow (a) that to be
> configured separately from any superblocks using it and (b) that to be used to
> create superblocks?
>
> Note that what a 'remote peer' is would be different for different
> filesystems:
>
> (*) For NFS, it would probably be a named server, with address(es) attached
> to the name. In lieu of actually having a name, the initial IP address
> could be used.
>
> (*) For CIFS, it would probably be a named server. I'm not sure if CIFS
> allows an abstraction for a share that can move about inside a domain.
CIFS/SMB3 has fairly mature support (in the protocol) for various types
of share redirection (not just 'DFS' that is supported by most every
NAS server, and Macs, Windows, Linux clients etc). There are also
very interesting features introduced with SMB 3.1.1 allowing 'tree
connect contexts"
which some important servers in the last few years implement.
This is worth more discussion - SMB3 (in particular the SMB3.1.1 dialect) has
a lot of interesting features here.
--
Thanks,
Steve
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Should we split the network filesystem setup into two phases?
2018-08-15 16:31 ` Should we split the network filesystem setup into two phases? David Howells
2018-08-15 16:51 ` Andy Lutomirski
2018-08-16 3:51 ` Steve French
@ 2018-08-16 5:06 ` Eric W. Biederman
2018-08-16 16:24 ` Steve French
2018-08-17 23:11 ` Al Viro
2 siblings, 2 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-16 5:06 UTC (permalink / raw)
To: David Howells
Cc: trond.myklebust, anna.schumaker, sfrench, steved, viro, torvalds,
Eric W. Biederman, linux-api, linux-security-module,
linux-fsdevel, linux-kernel, linux-nfs, linux-cifs, linux-afs,
ceph-devel, v9fs-developer
David Howells <dhowells@redhat.com> writes:
> Having just re-ported NFS on top of the new mount API stuff, I find that I
> don't really like the idea of superblocks being separated by communication
> parameters - especially when it might seem reasonable to be able to adjust
> those parameters.
>
> Does it make sense to abstract out the remote peer and allow (a) that to be
> configured separately from any superblocks using it and (b) that to be used to
> create superblocks?
>
> Note that what a 'remote peer' is would be different for different
> filesystems:
>
> (*) For NFS, it would probably be a named server, with address(es) attached
> to the name. In lieu of actually having a name, the initial IP address
> could be used.
>
> (*) For CIFS, it would probably be a named server. I'm not sure if CIFS
> allows an abstraction for a share that can move about inside a domain.
>
> (*) For AFS, it would be a cell, I think, where the actual fileserver(s) used
> are a matter of direction from the Volume Location server.
>
> (*) For 9P and Ceph, I don't really know.
>
> What could be configured? Well, addresses, ports, timeouts. Maybe protocol
> level negotiation - though not being able to explicitly specify, say, the
> particular version and minorversion on an NFS share would be problematic for
> backward compatibility.
>
> One advantage it could give us is that it might make it easier if someone asks
> for server X to query userspace in some way for the default parameters for X
> are.
>
> What might this look like in terms of userspace? Well, we could overload the
> new mount API:
>
> peer1 = fsopen("nfs", FSOPEN_CREATE_PEER);
> fsconfig(peer1, FSCONFIG_SET_NS, "net", NULL, netns_fd);
> fsconfig(peer1, FSCONFIG_SET_STRING, "peer_name", "server.home");
> fsconfig(peer1, FSCONFIG_SET_STRING, "vers", "4.2");
> fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.1");
> fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.2");
> fsconfig(peer1, FSCONFIG_SET_STRING, "timeo", "122");
> fsconfig(peer1, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);
>
> peer2 = fsopen("nfs", FSOPEN_CREATE_PEER);
> fsconfig(peer2, FSCONFIG_SET_NS, "net", NULL, netns_fd);
> fsconfig(peer2, FSCONFIG_SET_STRING, "peer_name", "server2.home");
> fsconfig(peer2, FSCONFIG_SET_STRING, "vers", "3");
> fsconfig(peer2, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.3");
> fsconfig(peer2, FSCONFIG_SET_STRING, "address", "udp:192.168.1.4+6001");
> fsconfig(peer2, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);
>
> fs = fsopen("nfs", 0);
> fsconfig(fs, FSCONFIG_SET_PEER, "peer.1", NULL, peer1);
> fsconfig(fs, FSCONFIG_SET_PEER, "peer.2", NULL, peer2);
> fsconfig(fs, FSCONFIG_SET_STRING, "source", "/home/dhowells", 0);
> m = fsmount(fs, 0, 0);
>
> [Note that Eric's oft-repeated point about the 'creation' operation altering
> established parameters still stands here.]
>
> You could also then reopen it for configuration, maybe by:
>
> peer = fspick(AT_FDCWD, "/mnt", FSPICK_PEER);
>
> or:
>
> peer = fspick(AT_FDCWD, "nfs:server.home", FSPICK_PEER_BY_NAME);
>
> though it might be better to give it its own syscall:
>
> peer = fspeer("nfs", "server.home", O_CLOEXEC);
> fsconfig(peer, FSCONFIG_SET_NS, "net", NULL, netns_fd);
> ...
> fsconfig(peer, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);
>
> In terms of alternative interfaces, I'm not sure how easy it would be to make
> it like cgroups where you go and create a dir in a special filesystem, say,
> "/sys/peers/nfs", because the peers records and names would have to be network
> namespaced. Also, it might make it more difficult to use to create a root fs.
>
> On the other hand, being able to adjust the peer configuration by:
>
> echo 71 >/sys/peers/nfs/server.home/timeo
>
> does have a certain appeal.
>
> Also, netlink might be the right option, but I'm not sure how you'd pin the
> resultant object whilst you make use of it.
>
> A further thought is that is it worth making this idea more general and
> encompassing non-network devices also? This would run into issues of some
> logical sources being visible across namespaces and but not others.
Even network filesystems are going to have challenges of filesystems
being visible in some network namespaces and not others. As some
filesystems will be visible on the internet and some filesystems will
only be visible on the appropriate local network. Network namespaces
are sometimes used to deal with the case of local networks with
overlapping ip addresses.
I think you are proposing a model for network filesystems that is
essentially the same situation where we are with most block devices
filesystems today. Where some parameters identitify the local
filesystem instance and some parameters identify how the kernel
interacts with that filesystem instance.
For system efficiency there is a strong argument for having the fewest
number of filesystem instances we can. Otherwise we will be caching the
same data twice and wasting space in RAM etc.
So I like the idea.
At least for devpts we always create a new filesystem instance every
time mount(2) is called. NFS seems to have the option to create a new
filesystem instance every time mount(2) is called as well, (even if the
filesystem parameters are the same). And depending on the case I can
see the attraction for other filesystems as well.
So I don't think we can completely abandon the option for filesystems
to always create a new filesystem instance when mount(8) is called.
I most definitely support thinking this through and figuring out how it
best make sense for the new filesystem API to create new filesystem
instances or fail to create new filesystems instances.
Eric
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Should we split the network filesystem setup into two phases?
2018-08-16 5:06 ` Eric W. Biederman
@ 2018-08-16 16:24 ` Steve French
2018-08-16 17:21 ` Eric W. Biederman
2018-08-16 17:23 ` Aurélien Aptel
2018-08-17 23:11 ` Al Viro
1 sibling, 2 replies; 70+ messages in thread
From: Steve French @ 2018-08-16 16:24 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Howells, trond.myklebust, Anna Schumaker, Steve French,
Steve Dickson, Al Viro, Linus Torvalds, ebiederm, linux-api,
linux-security-module, linux-fsdevel, LKML, linux-nfs, CIFS,
linux-afs, ceph-devel, v9fs-developer
On Thu, Aug 16, 2018 at 2:56 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> David Howells <dhowells@redhat.com> writes:
>
> > Having just re-ported NFS on top of the new mount API stuff, I find that I
> > don't really like the idea of superblocks being separated by communication
> > parameters - especially when it might seem reasonable to be able to adjust
> > those parameters.
> >
> > Does it make sense to abstract out the remote peer and allow (a) that to be
> > configured separately from any superblocks using it and (b) that to be used to
> > create superblocks?
<snip>
> At least for devpts we always create a new filesystem instance every
> time mount(2) is called. NFS seems to have the option to create a new
> filesystem instance every time mount(2) is called as well, (even if the
> filesystem parameters are the same). And depending on the case I can
> see the attraction for other filesystems as well.
>
> So I don't think we can completely abandon the option for filesystems
> to always create a new filesystem instance when mount(8) is called.
In cifs we attempt to match new mounts to existing tree connections
(instances of connections to a \\server\share) from other mount(s)
based first on whether security settings match (e.g. are both
Kerberos) and then on whether encryption is on/off and whether this is
a snapshot mount (smb3 previous versions feature). If neither is
mounted with a snaphsot and the encryption settings match then
we will use the same tree id to talk with the server as the other
mounts use. Interesting idea to allow mount to force a new
tree id.
What was the NFS mount option you were talking about?
Looking at the nfs man page the only one that looked similar
was "nosharecache"
> I most definitely support thinking this through and figuring out how it
> best make sense for the new filesystem API to create new filesystem
> instances or fail to create new filesystems instances.
Yes - it is an interesting question.
--
Thanks,
Steve
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Should we split the network filesystem setup into two phases?
2018-08-16 16:24 ` Steve French
@ 2018-08-16 17:21 ` Eric W. Biederman
2018-08-16 17:23 ` Aurélien Aptel
1 sibling, 0 replies; 70+ messages in thread
From: Eric W. Biederman @ 2018-08-16 17:21 UTC (permalink / raw)
To: Steve French
Cc: David Howells, trond.myklebust, Anna Schumaker, Steve French,
Steve Dickson, Al Viro, Linus Torvalds, ebiederm, linux-api,
linux-security-module, linux-fsdevel, LKML, linux-nfs, CIFS,
linux-afs, ceph-devel, v9fs-developer
Steve French <smfrench@gmail.com> writes:
> On Thu, Aug 16, 2018 at 2:56 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> David Howells <dhowells@redhat.com> writes:
>>
>> > Having just re-ported NFS on top of the new mount API stuff, I find that I
>> > don't really like the idea of superblocks being separated by communication
>> > parameters - especially when it might seem reasonable to be able to adjust
>> > those parameters.
>> >
>> > Does it make sense to abstract out the remote peer and allow (a) that to be
>> > configured separately from any superblocks using it and (b) that to be used to
>> > create superblocks?
> <snip>
>> At least for devpts we always create a new filesystem instance every
>> time mount(2) is called. NFS seems to have the option to create a new
>> filesystem instance every time mount(2) is called as well, (even if the
>> filesystem parameters are the same). And depending on the case I can
>> see the attraction for other filesystems as well.
>>
>> So I don't think we can completely abandon the option for filesystems
>> to always create a new filesystem instance when mount(8) is called.
>
> In cifs we attempt to match new mounts to existing tree connections
> (instances of connections to a \\server\share) from other mount(s)
> based first on whether security settings match (e.g. are both
> Kerberos) and then on whether encryption is on/off and whether this is
> a snapshot mount (smb3 previous versions feature). If neither is
> mounted with a snaphsot and the encryption settings match then
> we will use the same tree id to talk with the server as the other
> mounts use. Interesting idea to allow mount to force a new
> tree id.
>
> What was the NFS mount option you were talking about?
> Looking at the nfs man page the only one that looked similar
> was "nosharecache"
I was remembering this from reading the nfs mount code:
static int nfs_compare_super(struct super_block *sb, void *data)
{
...
if (!nfs_compare_super_address(old, server))
return 0;
/* Note: NFS_MOUNT_UNSHARED == NFS4_MOUNT_UNSHARED */
if (old->flags & NFS_MOUNT_UNSHARED)
return 0;
...
}
If a filesystem has NFS_MOUNT_UNSHARED set it does not serve as a
candidate for new mount requests. Skimming the code it looks like
nosharecache is what sets NFS_MOUNT_UNSHARED.
Another interesting and common case is tmpfs which always creates a new
filesystem instance whenever it is mounted.
Eric
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Should we split the network filesystem setup into two phases?
2018-08-16 16:24 ` Steve French
2018-08-16 17:21 ` Eric W. Biederman
@ 2018-08-16 17:23 ` Aurélien Aptel
2018-08-16 18:36 ` Steve French
1 sibling, 1 reply; 70+ messages in thread
From: Aurélien Aptel @ 2018-08-16 17:23 UTC (permalink / raw)
To: Steve French, Eric W. Biederman
Cc: David Howells, trond.myklebust, Anna Schumaker, Steve French,
Steve Dickson, Al Viro, Linus Torvalds, ebiederm, linux-api,
linux-security-module, linux-fsdevel, LKML, linux-nfs, CIFS,
linux-afs, ceph-devel, v9fs-developer
Steve French <smfrench@gmail.com> writes:
> In cifs we attempt to match new mounts to existing tree connections
> (instances of connections to a \\server\share) from other mount(s)
> based first on whether security settings match (e.g. are both
> Kerberos) and then on whether encryption is on/off and whether this is
> a snapshot mount (smb3 previous versions feature). If neither is
> mounted with a snaphsot and the encryption settings match then
> we will use the same tree id to talk with the server as the other
> mounts use. Interesting idea to allow mount to force a new
> tree id.
We actually already have this mount option in cifs.ko, it's "nosharesock".
> What was the NFS mount option you were talking about?
> Looking at the nfs man page the only one that looked similar
> was "nosharecache"
Cheers,
--
Aurélien Aptel / SUSE Labs Samba Team
GPG: 1839 CB5F 9F5B FB9B AA97 8C99 03C8 A49B 521B D5D3
SUSE Linux GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Should we split the network filesystem setup into two phases?
2018-08-16 17:23 ` Aurélien Aptel
@ 2018-08-16 18:36 ` Steve French
0 siblings, 0 replies; 70+ messages in thread
From: Steve French @ 2018-08-16 18:36 UTC (permalink / raw)
To: Aurélien Aptel
Cc: Eric W. Biederman, David Howells, trond.myklebust, Anna Schumaker,
Steve French, Steve Dickson, Al Viro, Linus Torvalds, ebiederm,
linux-api, linux-security-module, linux-fsdevel, LKML, linux-nfs,
CIFS, linux-afs, ceph-devel, v9fs-developer
On Thu, Aug 16, 2018 at 12:23 PM Aurélien Aptel <aaptel@suse.com> wrote:
>
> Steve French <smfrench@gmail.com> writes:
> > In cifs we attempt to match new mounts to existing tree connections
> > (instances of connections to a \\server\share) from other mount(s)
> > based first on whether security settings match (e.g. are both
> > Kerberos) and then on whether encryption is on/off and whether this is
> > a snapshot mount (smb3 previous versions feature). If neither is
> > mounted with a snaphsot and the encryption settings match then
> > we will use the same tree id to talk with the server as the other
> > mounts use. Interesting idea to allow mount to force a new
> > tree id.
>
> We actually already have this mount option in cifs.ko, it's "nosharesock".
Yes - good point. It is very easy to do on cifs. I mainly use that to simulate
multiple clients for testing servers (so each mount to the same server
whether or not the share matched, looks like a different client, coming
from a different socket and thus with different session ids and tree
ids as well).
It is very useful when trying to simulate multiple clients running to the same
server while using only one client machine (or VM).
> > What was the NFS mount option you were talking about?
> > Looking at the nfs man page the only one that looked similar
> > was "nosharecache"
The nfs man page apparently discourages its use:
"As of kernel 2.6.18, the behavior specified by nosharecache is legacy
caching behavior. This is considered a data risk"
--
Thanks,
Steve
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: Should we split the network filesystem setup into two phases?
2018-08-16 5:06 ` Eric W. Biederman
2018-08-16 16:24 ` Steve French
@ 2018-08-17 23:11 ` Al Viro
1 sibling, 0 replies; 70+ messages in thread
From: Al Viro @ 2018-08-17 23:11 UTC (permalink / raw)
To: Eric W. Biederman
Cc: David Howells, trond.myklebust, anna.schumaker, sfrench, steved,
torvalds, Eric W. Biederman, linux-api, linux-security-module,
linux-fsdevel, linux-kernel, linux-nfs, linux-cifs, linux-afs,
ceph-devel, v9fs-developer
On Thu, Aug 16, 2018 at 12:06:06AM -0500, Eric W. Biederman wrote:
> So I don't think we can completely abandon the option for filesystems
> to always create a new filesystem instance when mount(8) is called.
Huh? If filesystem wants to create a new instance on each ->mount(),
it can bloody well do so. Quite a few do - if that fs can handle
that, more power to it.
The problem is what to do with filesystems that *can't* do that.
You really, really can't have two ext4 (or xfs, etc.) instances over
the same device at the same time. Cache coherency, locking, etc.
will kill you.
And that's not to mention the joy of defining the semantics of
having the same ext4 mounted with two logs at the same time ;-)
I've seen "reject unless the options are compatible/identical/whatever",
but that ignores the real problem with existing policy. It's *NOT*
"I've mounted this and got an existing instance with non-matching
options". That's a minor annoyance (and back when that decision
had been made, mount(2) was very definitly root-only). The real
problem is different and much worse - it's remount.
I have asked to mount something and it had already been mounted,
with identical options. OK, so what happens if I do mount -o remount
on my instance? *IF* we are operating in the "only sysadmin can
mount new filesystems", it's not a big deal - there are already
lots of ways you can shoot yourself in the foot and mount(2) is
certainly a powerful one. But if we get to "Joe R. Luser can do
it in his container", we have a big problem.
Decision back then had been mostly for usability reasons - it was
back in 2001 (well before the containermania, userns or anything
of that sort) and it was more about "how many hoops does one have
to jump through to get something mounted, assuming the sanity of
sysadmin doing that?". If *anything* like userns had been a concern
back then, it probably would've been different. However, it's
17 years too late and if anyone has a functional TARDIS, I can
easily think of better uses for it...
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 30/33] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #11]
2018-08-01 15:27 ` [PATCH 30/33] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
@ 2018-08-24 14:51 ` Miklos Szeredi
2018-08-24 14:54 ` Andy Lutomirski
0 siblings, 1 reply; 70+ messages in thread
From: Miklos Szeredi @ 2018-08-24 14:51 UTC (permalink / raw)
To: David Howells; +Cc: Al Viro, Linux API, Linus Torvalds, linux-fsdevel, LKML
On Wed, Aug 1, 2018 at 5:29 PM David Howells <dhowells@redhat.com> wrote:
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -351,6 +351,11 @@ typedef int __bitwise __kernel_rwf_t;
>
> #define FSMOUNT_CLOEXEC 0x00000001
>
> +#define FSPICK_CLOEXEC 0x00000001
> +#define FSPICK_SYMLINK_NOFOLLOW 0x00000002
> +#define FSPICK_NO_AUTOMOUNT 0x00000004
> +#define FSPICK_EMPTY_PATH 0x00000008
This caught my eye: why aren't we using the AT_ constants? Adding an
AT_CLOEXEC sounds less horrible than duplicating all the lookup
related flags for FSPICK...
Thanks,
Miklos
> +
> /*
> * The type of fsconfig() call made.
> */
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: [PATCH 30/33] vfs: syscall: Add fspick() to select a superblock for reconfiguration [ver #11]
2018-08-24 14:51 ` Miklos Szeredi
@ 2018-08-24 14:54 ` Andy Lutomirski
0 siblings, 0 replies; 70+ messages in thread
From: Andy Lutomirski @ 2018-08-24 14:54 UTC (permalink / raw)
To: Miklos Szeredi
Cc: David Howells, Al Viro, Linux API, Linus Torvalds, linux-fsdevel,
LKML
> On Aug 24, 2018, at 7:51 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
>> On Wed, Aug 1, 2018 at 5:29 PM David Howells <dhowells@redhat.com> wrote:
>>
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -351,6 +351,11 @@ typedef int __bitwise __kernel_rwf_t;
>>
>> #define FSMOUNT_CLOEXEC 0x00000001
>>
>> +#define FSPICK_CLOEXEC 0x00000001
>> +#define FSPICK_SYMLINK_NOFOLLOW 0x00000002
>> +#define FSPICK_NO_AUTOMOUNT 0x00000004
>> +#define FSPICK_EMPTY_PATH 0x00000008
>
> This caught my eye: why aren't we using the AT_ constants? Adding an
> AT_CLOEXEC sounds less horrible than duplicating all the lookup
> related flags for FSPICK...
For a totally new API, is there any need to support !CLOEXEC? A caller can safely remove the CLOEXEC bit without races.
^ permalink raw reply [flat|nested] 70+ messages in thread
end of thread, other threads:[~2018-08-24 14:54 UTC | newest]
Thread overview: 70+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
2018-08-01 15:24 ` [PATCH 01/33] vfs: syscall: Add open_tree(2) to reference or clone a mount " David Howells
2018-08-02 17:31 ` Alan Jenkins
2018-08-02 21:29 ` Al Viro
2018-08-02 21:51 ` David Howells
2018-08-02 23:46 ` Alan Jenkins
2018-08-01 15:24 ` [PATCH 02/33] vfs: syscall: Add move_mount(2) to move mounts around " David Howells
2018-08-01 15:26 ` [PATCH 25/33] vfs: syscall: Add fsopen() to prepare for superblock creation " David Howells
2018-08-01 15:27 ` [PATCH 28/33] vfs: syscall: Add fsconfig() for configuring and managing a context " David Howells
2018-08-06 17:28 ` Eric W. Biederman
2018-08-09 14:14 ` David Howells
2018-08-09 14:24 ` David Howells
2018-08-09 14:35 ` Miklos Szeredi
2018-08-09 15:32 ` Eric W. Biederman
2018-08-09 16:33 ` David Howells
2018-08-11 20:20 ` David Howells
2018-08-11 23:26 ` Andy Lutomirski
2018-08-01 15:27 ` [PATCH 29/33] vfs: syscall: Add fsmount() to create a mount for a superblock " David Howells
2018-08-01 15:27 ` [PATCH 30/33] vfs: syscall: Add fspick() to select a superblock for reconfiguration " David Howells
2018-08-24 14:51 ` Miklos Szeredi
2018-08-24 14:54 ` Andy Lutomirski
2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
2018-08-10 14:36 ` Andy Lutomirski
2018-08-10 15:17 ` Eric W. Biederman
2018-08-10 15:24 ` Al Viro
2018-08-10 15:11 ` Tetsuo Handa
2018-08-10 15:13 ` David Howells
2018-08-10 15:16 ` Al Viro
[not found] ` <20180810151606.GA6515-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2018-08-11 1:05 ` Eric W. Biederman
[not found] ` <87pnypiufr.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2018-08-11 1:46 ` Theodore Y. Ts'o
2018-08-11 4:48 ` Eric W. Biederman
[not found] ` <8736vlo6ef.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2018-08-11 17:47 ` Casey Schaufler
2018-08-15 4:03 ` Eric W. Biederman
2018-08-11 1:58 ` Al Viro
2018-08-11 2:17 ` Al Viro
2018-08-11 4:43 ` Eric W. Biederman
2018-08-13 12:54 ` Miklos Szeredi
2018-08-10 15:11 ` David Howells
2018-08-10 15:39 ` Theodore Y. Ts'o
2018-08-10 15:55 ` Casey Schaufler
2018-08-10 16:11 ` David Howells
2018-08-10 18:00 ` Eric W. Biederman
2018-08-10 15:53 ` David Howells
2018-08-10 16:14 ` Theodore Y. Ts'o
2018-08-10 20:06 ` Andy Lutomirski
2018-08-10 20:46 ` Theodore Y. Ts'o
2018-08-10 22:12 ` Darrick J. Wong
2018-08-10 23:54 ` Theodore Y. Ts'o
[not found] ` <20180810235447.GK627-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2018-08-11 0:38 ` Darrick J. Wong
2018-08-11 1:32 ` Eric W. Biederman
2018-08-13 16:35 ` Alan Cox
2018-08-13 16:48 ` Andy Lutomirski
2018-08-13 17:29 ` Al Viro
2018-08-13 19:00 ` James Morris
2018-08-13 19:20 ` Casey Schaufler
2018-08-15 23:29 ` Serge E. Hallyn
[not found] ` <20180810161400.GA627-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2018-08-11 0:28 ` Eric W. Biederman
2018-08-11 1:19 ` Eric W. Biederman
[not found] ` <87pnyphf8i.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2018-08-11 7:29 ` David Howells
2018-08-11 16:31 ` Andy Lutomirski
[not found] ` <9B6E2781-484B-4C42-95F5-F900EA36CEA5-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
2018-08-11 16:51 ` Al Viro
2018-08-15 16:31 ` Should we split the network filesystem setup into two phases? David Howells
2018-08-15 16:51 ` Andy Lutomirski
2018-08-16 3:51 ` Steve French
2018-08-16 5:06 ` Eric W. Biederman
2018-08-16 16:24 ` Steve French
2018-08-16 17:21 ` Eric W. Biederman
2018-08-16 17:23 ` Aurélien Aptel
2018-08-16 18:36 ` Steve French
2018-08-17 23:11 ` Al Viro
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).