[PATCH 00/33] VFS: Introduce filesystem context [ver #11]

linux-security-module.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/33] VFS: Introduce filesystem context [ver #11]
@ 2018-08-01 15:23 David Howells
  2018-08-01 15:24 ` [PATCH 08/33] vfs: Add LSM hooks for the new mount API " David Howells
                   ` (7 more replies)
  0 siblings, 8 replies; 57+ messages in thread
From: David Howells @ 2018-08-01 15:23 UTC (permalink / raw)
  To: linux-security-module


Hi Al,

Here are a set of patches to create a filesystem context prior to setting
up a new mount, populating it with the parsed options/binary data, creating
the superblock and then effecting the mount.  This is also used for remount
since much of the parsing stuff is common in many filesystems.

This allows namespaces and other information to be conveyed through the
mount procedure.

This also allows Mikl?s Szeredi's idea of doing:

	fd = fsopen("nfs");
	fsconfig(fd, FSCONFIG_SET_STRING, "option", "val", 0);
	fsconfig(fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
	mfd = fsmount(fd, MS_NODEV);
	move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);

that he presented at LSF-2017 to be implemented (see the relevant patches
in the series).

I didn't use netlink as that would make the core kernel depend on
CONFIG_NET and CONFIG_NETLINK and would introduce network namespacing
issues.

I've implemented filesystem context handling for procfs, nfs, mqueue,
cpuset, kernfs, sysfs, cgroup and afs filesystems.

Unconverted filesystems are handled by a legacy filesystem wrapper.


====================
WHY DO WE WANT THIS?
====================

Firstly, there's a bunch of problems with the mount(2) syscall:

 (1) It's actually six or seven different interfaces rolled into one and
     weird combinations of flags make it do different things beyond the
     original specification of the syscall.

 (2) It produces a particularly large and diverse set of errors, which have
     to be mapped back to a small error code.  Yes, there's dmesg - if you
     have it configured - but you can't necessarily see that if you're
     doing a mount inside of a container.

 (3) It copies a PAGE_SIZE block of data for each of the type, device name
     and options.

 (4) The size of the buffers is PAGE_SIZE - and this is arch dependent.

 (5) You can't mount into another mount namespace.  I could, for example,
     build a container without having to be in that container's namespace
     if I can do it from outside.

 (6) It's not really geared for the specification of multiple sources, but
     some filesystems really want that - overlayfs, for example.

and some problems in the internal kernel api:

 (1) There's no defined way to supply namespace configuration for the
     superblock - so, for instance, I can't say that I want to create a
     superblock in a particular network namespace (on automount, say).

     NFS hacks around this by creating multiple shadow file_system_types
     with different ->mount() ops.

 (2) When calling mount internally, unless you have NFS-like hacks, you
     have to generate or otherwise provide text config data which then gets
     parsed, when some of the time you could bypass the parsing stage
     entirely.

 (3) The amount of data in the data buffer is not known, but the data
     buffer might be on a kernel stack somewhere, leading to the
     possibility of tripping the stack underrun guard.

and other issues too:

 (1) Superblock remount in some filesystems applies options on an as-parsed
     basis, so if there's a parse failure, a partial alteration with no
     rollback is effected.

 (2) Under some circumstances, the mount data may get copied multiple times
     so that it can have multiple parsers applied to it or because it has
     to be parsed multiple times - for instance, once to get the
     preliminary info required to access the on-disk superblock and then
     again to update the superblock record in the kernel.

I want to be able to add support for a bunch of things:

 (1) UID, GID and Project ID mapping/translation.  I want to be able to
     install a translation table of some sort on the superblock to
     translate source identifiers (which may be foreign numeric UIDs/GIDs,
     text names, GUIDs) into system identifiers.  This needs to be done
     before the superblock is published[*].

     Note that this may, for example, involve using the context and the
     superblock held therein to issue an RPC to a server to look up
     translations.

     [*] By "published" I mean made available through mount so that other
     	 userspace processes can access it by path.

     Maybe specifying a translation range element with something like:

	fsconfig(fd, fsconfig_translate_uid, "<srcuid> <nsuid> <count>", 0, 0);

     The translation information also needs to propagate over an automount
     in some circumstances.

 (2) Namespace configuration.  I want to be able to tell the superblock
     creation process what namespaces should be applied when it created (in
     particular the userns and netns) for containerisation purposes, e.g.:

	fsconfig(fd, FSCONFIG_SET_NAMESPACE, "user", 0, userns_fd);
	fsconfig(fd, FSCONFIG_SET_NAMESPACE, "net", 0, netns_fd);

 (3) Namespace propagation.  I want to have a properly defined mechanism
     for propagating namespace configuration over automounts within the
     kernel.  This will be particularly useful for network filesystems.

 (4) Pre-mount attribute query.  A chunk of the changes is actually the
     fsinfo() syscall to query attributes of the filesystem beyond what's
     available in statx() and statfs().  This will allow a created
     superblock to be queried before it is published.

 (5) Upcall for configuration.  I would like to be able to query
     configuration that's stored in userspace when an automount is made.
     For instance, to look up network parameters for NFS or to find a cache
     selector for fscache.

     The internal fs_context could be passed to the upcall process or the
     kernel could read a config file directly if named appropriately for the
     superblock, perhaps:

	[/etc/fscontext.d/afs/example.com/cell.cfg]
	realm = EXAMPLE.COM
	translation = uid,3000,4000,100
	fscache = tag=fred

 (6) Event notifications.  I want to be able to install a watch on a
     superblock before it is published to catch things like quota events
     and EIO.

 (7) Large and binary parameters.  There might be at some point a need to
     pass large/binary objects like Microsoft PACs around.  If I understand
     PACs correctly, you can obtain these from the Kerberos server and then
     pass them to the file server when you connect.

     Having it possible to pass large or binary objects as individual
     fsconfig calls make parsing these trivial.  OTOH, some or all of this
     can potentially be handled with the use of the keyrings interface - as
     the afs filesystem does for passing kerberos tokens around; it's just
     that that seems overkill for a parameter you may only need once.


===================
SIGNIFICANT CHANGES
===================

 ver #11:

 (*) Fixed AppArmor.

 (*) Capitalised all the UAPI constants.

 (*) Explicitly numbered the FSCONFIG_* UAPI constants.

 (*) Removed all the places ANON_INODES is selected.

 (*) Fixed a bug whereby the context gets freed twice (which broke mounts of
     procfs).

 (*) Split fsinfo() off into its own patch series.

 ver #10:

 (*) Renamed "option" to "parameter" in a number of places.

 (*) Replaced the use of write() to drive the configuration with an fsconfig()
     syscall.  This also allows at-style paths and fds to be presented as typed
     object.

 (*) Routed the key=value parameter concept all the way through from the
     fsconfig() system call to the LSM and filesystem.

 (*) Added a parameter-description concept and helper functions to help
     interpret a parameter and possibly convert the value.

 (*) Made it possible to query the parameter description using the fsinfo()
     syscall.  Added a test-fs-query sample to dump the parameters used by a
     filesystem.

 ver #9:

 (*) Dropped the fd cookie stuff and the FMODE_*/O_* split stuff.

 (*) Al added an open_tree() system call to allow a mount tree to be picked
     referenced or cloned into an O_PATH-style fd.  This can then be used
     with sys_move_mount().  Dropped the O_CLONE_MOUNT and O_NON_RECURSIVE
     open() flags.

 (*) Brought error logging back in, though only in the fs_context and not
     in the task_struct.

 (*) Separated MS_REMOUNT|MS_BIND handling from MS_REMOUNT handling.

 (*) Used anon_inodes for the fd returned by fsopen() and fspick().  This
     requires making it unconditional.

 (*) Fixed lots of bugs.  Especial thanks to Al and Eric Biggers for
     finding them and providing patches.

 (*) Wrote manual pages, which I'll post separately.

 ver #8:

 (*) Changed the way fsmount() mounts into the namespace according to some
     of Al's ideas.

 (*) Put better typing on the fd cookie obtained from __fdget() & co..

 (*) Stored the fd cookie in struct nameidata rather than the dfd number.

 (*) Changed sys_fsmount() to return an O_PATH-style fd rather than
     actually mounting into the mount namespace.

 (*) Separated internal FMODE_* handling from O_* handling to free up
     certain O_* flag numbers.

 (*) Added two new open flags (O_CLONE_MOUNT and O_NON_RECURSIVE) for use
     with open(O_PATH) to copy a mount or mount-subtree to an O_PATH fd.

 (*) Added a new syscall, sys_move_mount(), to move a mount from an
     dfd+path source to a dfd+path destination.

 (*) Added a file->f_mode flag (FMODE_NEED_UNMOUNT) that indicates that the
     vfsmount attached to file->f_path needs 'unmounting' if set.

 (*) Made sys_move_mount() clear FMODE_NEED_UNMOUNT if successful.

	[!] This doesn't work quite right.

 (*) Added a new syscall, fsinfo(), to query information about a
     filesystem.  The idea being that this will, in future, work with the
     fd from fsopen() too and permit querying of the parameters and
     metadata before fsmount() is called.

 ver #7:

 (*) Undo an incorrect MS_* -> SB_* conversion.

 (*) Pass the mount data buffer size to all the mount-related functions that
     take the data pointer.  This fixes a problem where someone (say SELinux)
     tries to copy the mount data, assuming it to be a page in size, and
     overruns the buffer - thereby incurring an oops by hitting a guard page.

 (*) Made the AFS filesystem use them as an example.  This is a much easier to
     deal with than with NFS or Ext4 as there are very few mount options.

 ver #6:

 (*) Dropped the supplementary error string facility for the moment.

 (*) Dropped the NFS patches for the moment.

 (*) Dropped the reserved file descriptor argument from fsopen() and
     replaced it with three reserved pointers that must be NULL.

 ver #5:

 (*) Renamed sb_config -> fs_context and adjusted variable names.

 (*) Differentiated the flags in sb->s_flags (now named SB_*) from those
     passed to mount(2) (named MS_*).

 (*) Renamed __vfs_new_fs_context() to vfs_new_fs_context() and made the
     caller always provide a struct file_system_type pointer and the
     parameters required.

 (*) Got rid of vfs_submount_fc() in favour of passing
     FS_CONTEXT_FOR_SUBMOUNT to vfs_new_fs_context().  The purpose is now
     used more.

 (*) Call ->validate() on the remount path.

 (*) Got rid of the inode locking in sys_fsmount().

 (*) Call security_sb_mountpoint() in the mount(2) path.

 ver #4:

 (*) Split the sb_config patch up somewhat.

 (*) Made the supplementary error string facility something attached to the
     task_struct rather than the sb_config so that error messages can be
     obtained from NFS doing a mount-root-and-pathwalk inside the
     nfs_get_tree() operation.

     Further, made this managed and read by prctl rather than through the
     mount fd so that it's more generally available.

 ver #3:

 (*) Rebased on 4.12-rc1.

 (*) Split the NFS patch up somewhat.

 ver #2:

 (*) Removed the ->fill_super() from sb_config_operations and passed it in
     directly to functions that want to call it.  NFS now calls
     nfs_fill_super() directly rather than jumping through a pointer to it
     since there's only the one option at the moment.

 (*) Removed ->mnt_ns and ->sb from sb_config and moved ->pid_ns into
     proc_sb_config.

 (*) Renamed create_super -> get_tree.

 (*) Renamed struct mount_context to struct sb_config and amended various
     variable names.

 (*) sys_fsmount() acquired AT_* flags and MS_* flags (for MNT_* flags)
     arguments.

 ver #1:

 (*) Split the sb_config stuff out into its own header.

 (*) Support non-context aware filesystems through a special set of
     sb_config operations.

 (*) Stored the created superblock and root dentry into the sb_config after
     creation rather than directly into a vfsmount.  This allows some
     arguments to be removed to various NFS functions.

 (*) Added an explicit superblock-creation step.  This allows a created
     superblock to then be mounted multiple times.

 (*) Added a flag to say that the sb_config is degraded and cannot have
     another go at having a superblock creation whilst getting rid of the
     one that says it's already mounted.

Possible further developments:

 (*) Implement sb reconfiguration (for now it returns ENOANO).

 (*) Implement mount context support in more filesystems, ext4 being next
     on my list.

 (*) Move the walk-from-root stuff that nfs has to generic code so that you
     can do something akin to:

	mount /dev/sda1:/foo/bar /mnt

     See nfs_follow_remote_path() and mount_subtree().  This is slightly
     tricky in NFS as we have to prevent referral loops.

 (*) Work out how to get at the error message incurred by submounts
     encountered during nfs_follow_remote_path().

     Should the error message be moved to task_struct and made more
     general, perhaps retrieved with a prctl() function?

 (*) Clean up/consolidate the security functions.  Possibly add a
     validation hook to be called at the same time as the mount context
     validate op.

The patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

tagged as:

	mount-api-20180801

on branch:

	mount-api

David
---
Al Viro (2):
      vfs: syscall: Add open_tree(2) to reference or clone a mount
      teach move_mount(2) to work with OPEN_TREE_CLONE

David Howells (31):
      vfs: syscall: Add move_mount(2) to move mounts around
      vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled
      vfs: Introduce the basic header for the new mount API's filesystem context
      vfs: Introduce logging functions
      vfs: Add configuration parser helpers
      vfs: Add LSM hooks for the new mount API
      selinux: Implement the new mount API LSM hooks
      smack: Implement filesystem context security hooks
      apparmor: Implement security hooks for the new mount API
      tomoyo: Implement security hooks for the new mount API
      vfs: Separate changing mount flags full remount
      vfs: Implement a filesystem superblock creation/configuration context
      vfs: Remove unused code after filesystem context changes
      procfs: Move proc_fill_super() to fs/proc/root.c
      proc: Add fs_context support to procfs
      ipc: Convert mqueue fs to fs_context
      cpuset: Use fs_context
      kernfs, sysfs, cgroup, intel_rdt: Support fs_context
      hugetlbfs: Convert to fs_context
      vfs: Remove kern_mount_data()
      vfs: Provide documentation for new mount API
      Make anon_inodes unconditional
      vfs: syscall: Add fsopen() to prepare for superblock creation
      vfs: Implement logging through fs_context
      vfs: Add some logging to the core users of the fs_context log
      vfs: syscall: Add fsconfig() for configuring and managing a context
      vfs: syscall: Add fsmount() to create a mount for a superblock
      vfs: syscall: Add fspick() to select a superblock for reconfiguration
      afs: Add fs_context support
      afs: Use fs_context to pass parameters over automount
      vfs: Add a sample program for the new mount API


 Documentation/filesystems/mount_api.txt  |  706 ++++++++++++++++++++++++
 arch/arc/kernel/setup.c                  |    1 
 arch/arm/kernel/atags_parse.c            |    1 
 arch/arm/kvm/Kconfig                     |    1 
 arch/arm64/kvm/Kconfig                   |    1 
 arch/mips/kvm/Kconfig                    |    1 
 arch/powerpc/kvm/Kconfig                 |    1 
 arch/s390/kvm/Kconfig                    |    1 
 arch/sh/kernel/setup.c                   |    1 
 arch/sparc/kernel/setup_32.c             |    1 
 arch/sparc/kernel/setup_64.c             |    1 
 arch/x86/Kconfig                         |    1 
 arch/x86/entry/syscalls/syscall_32.tbl   |    6 
 arch/x86/entry/syscalls/syscall_64.tbl   |    6 
 arch/x86/kernel/cpu/intel_rdt.h          |   15 +
 arch/x86/kernel/cpu/intel_rdt_rdtgroup.c |  184 ++++--
 arch/x86/kernel/setup.c                  |    1 
 arch/x86/kvm/Kconfig                     |    1 
 drivers/base/Kconfig                     |    1 
 drivers/base/devtmpfs.c                  |    1 
 drivers/char/tpm/Kconfig                 |    1 
 drivers/dma-buf/Kconfig                  |    1 
 drivers/gpio/Kconfig                     |    1 
 drivers/iio/Kconfig                      |    1 
 drivers/infiniband/Kconfig               |    1 
 drivers/vfio/Kconfig                     |    1 
 fs/Kconfig                               |    7 
 fs/Makefile                              |    5 
 fs/afs/internal.h                        |    9 
 fs/afs/mntpt.c                           |  148 +++--
 fs/afs/super.c                           |  470 +++++++++-------
 fs/afs/volume.c                          |    4 
 fs/f2fs/super.c                          |    2 
 fs/file_table.c                          |    9 
 fs/filesystems.c                         |    4 
 fs/fs_context.c                          |  779 +++++++++++++++++++++++++++
 fs/fs_parser.c                           |  476 ++++++++++++++++
 fs/fsopen.c                              |  491 +++++++++++++++++
 fs/hugetlbfs/inode.c                     |  392 ++++++++------
 fs/internal.h                            |   13 
 fs/kernfs/mount.c                        |   87 +--
 fs/libfs.c                               |   19 +
 fs/namei.c                               |    4 
 fs/namespace.c                           |  867 +++++++++++++++++++++++-------
 fs/notify/fanotify/Kconfig               |    1 
 fs/notify/inotify/Kconfig                |    1 
 fs/pnode.c                               |    1 
 fs/proc/inode.c                          |   51 --
 fs/proc/internal.h                       |    6 
 fs/proc/root.c                           |  245 ++++++--
 fs/super.c                               |  368 ++++++++++---
 fs/sysfs/mount.c                         |   67 ++
 include/linux/cgroup.h                   |    3 
 include/linux/fs.h                       |   21 +
 include/linux/fs_context.h               |  208 +++++++
 include/linux/fs_parser.h                |  116 ++++
 include/linux/kernfs.h                   |   39 +
 include/linux/lsm_hooks.h                |   70 ++
 include/linux/module.h                   |    6 
 include/linux/mount.h                    |    5 
 include/linux/security.h                 |   61 ++
 include/linux/syscalls.h                 |    9 
 include/uapi/linux/fcntl.h               |    2 
 include/uapi/linux/fs.h                  |   82 +--
 include/uapi/linux/mount.h               |   75 +++
 init/Kconfig                             |   10 
 init/do_mounts.c                         |    1 
 init/do_mounts_initrd.c                  |    1 
 ipc/mqueue.c                             |  121 +++-
 kernel/cgroup/cgroup-internal.h          |   50 +-
 kernel/cgroup/cgroup-v1.c                |  347 +++++++-----
 kernel/cgroup/cgroup.c                   |  256 ++++++---
 kernel/cgroup/cpuset.c                   |   68 ++
 samples/Kconfig                          |    6 
 samples/Makefile                         |    2 
 samples/mount_api/Makefile               |    7 
 samples/mount_api/test-fsmount.c         |  118 ++++
 security/apparmor/include/mount.h        |   11 
 security/apparmor/lsm.c                  |  108 ++++
 security/apparmor/mount.c                |   47 ++
 security/security.c                      |   51 ++
 security/selinux/hooks.c                 |  311 ++++++++++-
 security/smack/smack.h                   |   11 
 security/smack/smack_lsm.c               |  370 ++++++++++++-
 security/tomoyo/common.h                 |    3 
 security/tomoyo/mount.c                  |   46 ++
 security/tomoyo/tomoyo.c                 |   15 +
 87 files changed, 6637 insertions(+), 1484 deletions(-)
 create mode 100644 Documentation/filesystems/mount_api.txt
 create mode 100644 fs/fs_context.c
 create mode 100644 fs/fs_parser.c
 create mode 100644 fs/fsopen.c
 create mode 100644 include/linux/fs_context.h
 create mode 100644 include/linux/fs_parser.h
 create mode 100644 include/uapi/linux/mount.h
 create mode 100644 samples/mount_api/Makefile
 create mode 100644 samples/mount_api/test-fsmount.c

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 08/33] vfs: Add LSM hooks for the new mount API [ver #11]
  2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
@ 2018-08-01 15:24 ` David Howells
  2018-08-01 20:50   ` James Morris
  2018-08-01 22:53   ` David Howells
  2018-08-01 15:25 ` [PATCH 09/33] selinux: Implement the new mount API LSM hooks " David Howells
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 57+ messages in thread
From: David Howells @ 2018-08-01 15:24 UTC (permalink / raw)
  To: linux-security-module

Add LSM hooks for use by the new mount API and filesystem context code.
This includes:

 (1) Hooks to handle allocation, duplication and freeing of the security
     record attached to a filesystem context.

 (2) A hook to snoop source specifications.  There may be multiple of these
     if the filesystem supports it.  They will to be local files/devices if
     fs_context::source_is_dev is true and will be something else, possibly
     remote server specifications, if false.

 (3) A hook to snoop superblock configuration options in key[=val] form.
     If the LSM decides it wants to handle it, it can suppress the option
     being passed to the filesystem.  Note that 'val' may include commas
     and binary data with the fsopen patch.

 (4) A hook to perform validation and allocation after the configuration
     has been done but before the superblock is allocated and set up.

 (5) A hook to transfer the security from the context to a newly created
     superblock.

 (6) A hook to rule on whether a path point can be used as a mountpoint.

These are intended to replace:

	security_sb_copy_data
	security_sb_kern_mount
	security_sb_mount
	security_sb_set_mnt_opts
	security_sb_clone_mnt_opts
	security_sb_parse_opts_str

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-security-module at vger.kernel.org
---

 include/linux/lsm_hooks.h |   61 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/security.h  |   47 +++++++++++++++++++++++++++++++++++
 security/security.c       |   41 ++++++++++++++++++++++++++++++
 3 files changed, 149 insertions(+)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 924424e7be8f..60a9a40bd46c 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -76,6 +76,49 @@
  *	changes on the process such as clearing out non-inheritable signal
  *	state.  This is called immediately after commit_creds().
  *
+ * Security hooks for mount using fs_context.
+ *	[See also Documentation/filesystems/mounting.txt]
+ *
+ * @fs_context_alloc:
+ *	Allocate and attach a security structure to sc->security.  This pointer
+ *	is initialised to NULL by the caller.
+ *	@fc indicates the new filesystem context.
+ *	@reference indicates the source dentry of a submount or start of reconfig.
+ * @fs_context_dup:
+ *	Allocate and attach a security structure to sc->security.  This pointer
+ *	is initialised to NULL by the caller.
+ *	@fc indicates the new filesystem context.
+ *	@src_fc indicates the original filesystem context.
+ * @fs_context_free:
+ *	Clean up a filesystem context.
+ *	@fc indicates the filesystem context.
+ * @fs_context_parse_param:
+ *	Userspace provided a parameter to configure a superblock.  The LSM may
+ *	reject it with an error and may use it for itself, in which case it
+ *	should return 1; otherwise it should return 0 to pass it on to the
+ *	filesystem.
+ *	@fc indicates the filesystem context.
+ *	@param The parameter
+ * @fs_context_validate:
+ *	Validate the filesystem context preparatory to applying it.  This is
+ *	done after all the options have been parsed.
+ *	@fc indicates the filesystem context.
+ * @sb_get_tree:
+ *	Assign the security to a newly created superblock.
+ *	@fc indicates the filesystem context.
+ *	@fc->root indicates the root that will be mounted.
+ *	@fc->root->d_sb points to the superblock.
+ * @sb_reconfigure:
+ *	Apply reconfiguration to the security on a superblock.
+ *	@fc indicates the filesystem context.
+ *	@fc->root indicates a dentry in the mount.
+ *	@fc->root->d_sb points to the superblock.
+ * @sb_mountpoint:
+ *	Equivalent of sb_mount, but with an fs_context.
+ *	@fc indicates the filesystem context.
+ *	@mountpoint indicates the path on which the mount will take place.
+ *	@mnt_flags indicates the MNT_* flags specified.
+ *
  * Security hooks for filesystem operations.
  *
  * @sb_alloc_security:
@@ -1462,6 +1505,16 @@ union security_list_options {
 	void (*bprm_committing_creds)(struct linux_binprm *bprm);
 	void (*bprm_committed_creds)(struct linux_binprm *bprm);
 
+	int (*fs_context_alloc)(struct fs_context *fc, struct dentry *reference);
+	int (*fs_context_dup)(struct fs_context *fc, struct fs_context *src_sc);
+	void (*fs_context_free)(struct fs_context *fc);
+	int (*fs_context_parse_param)(struct fs_context *fc, struct fs_parameter *param);
+	int (*fs_context_validate)(struct fs_context *fc);
+	int (*sb_get_tree)(struct fs_context *fc);
+	void (*sb_reconfigure)(struct fs_context *fc);
+	int (*sb_mountpoint)(struct fs_context *fc, struct path *mountpoint,
+			     unsigned int mnt_flags);
+
 	int (*sb_alloc_security)(struct super_block *sb);
 	void (*sb_free_security)(struct super_block *sb);
 	int (*sb_copy_data)(char *orig, size_t orig_size, char *copy);
@@ -1803,6 +1856,14 @@ struct security_hook_heads {
 	struct hlist_head bprm_check_security;
 	struct hlist_head bprm_committing_creds;
 	struct hlist_head bprm_committed_creds;
+	struct hlist_head fs_context_alloc;
+	struct hlist_head fs_context_dup;
+	struct hlist_head fs_context_free;
+	struct hlist_head fs_context_parse_param;
+	struct hlist_head fs_context_validate;
+	struct hlist_head sb_get_tree;
+	struct hlist_head sb_reconfigure;
+	struct hlist_head sb_mountpoint;
 	struct hlist_head sb_alloc_security;
 	struct hlist_head sb_free_security;
 	struct hlist_head sb_copy_data;
diff --git a/include/linux/security.h b/include/linux/security.h
index 9bb5bc6d596c..a8b1056ed6e4 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -53,6 +53,9 @@ struct msg_msg;
 struct xattr;
 struct xfrm_sec_ctx;
 struct mm_struct;
+struct fs_context;
+struct fs_parameter;
+enum fs_value_type;
 
 /* If capable should audit the security request */
 #define SECURITY_CAP_NOAUDIT 0
@@ -225,6 +228,15 @@ int security_bprm_set_creds(struct linux_binprm *bprm);
 int security_bprm_check(struct linux_binprm *bprm);
 void security_bprm_committing_creds(struct linux_binprm *bprm);
 void security_bprm_committed_creds(struct linux_binprm *bprm);
+int security_fs_context_alloc(struct fs_context *fc, struct dentry *reference);
+int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc);
+void security_fs_context_free(struct fs_context *fc);
+int security_fs_context_parse_param(struct fs_context *fc, struct fs_parameter *param);
+int security_fs_context_validate(struct fs_context *fc);
+int security_sb_get_tree(struct fs_context *fc);
+void security_sb_reconfigure(struct fs_context *fc);
+int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+			   unsigned int mnt_flags);
 int security_sb_alloc(struct super_block *sb);
 void security_sb_free(struct super_block *sb);
 int security_sb_copy_data(char *orig, size_t orig_size, char *copy);
@@ -526,6 +538,41 @@ static inline void security_bprm_committed_creds(struct linux_binprm *bprm)
 {
 }
 
+static inline int security_fs_context_alloc(struct fs_context *fc,
+					    struct dentry *reference)
+{
+	return 0;
+}
+static inline int security_fs_context_dup(struct fs_context *fc,
+					  struct fs_context *src_fc)
+{
+	return 0;
+}
+static inline void security_fs_context_free(struct fs_context *fc)
+{
+}
+static inline int security_fs_context_parse_param(struct fs_context *fc,
+						  struct fs_parameter *param)
+{
+	return 0;
+}
+static inline int security_fs_context_validate(struct fs_context *fc)
+{
+	return 0;
+}
+static inline int security_sb_get_tree(struct fs_context *fc)
+{
+	return 0;
+}
+static inline void security_sb_reconfigure(struct fs_context *fc)
+{
+}
+static inline int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+					 unsigned int mnt_flags)
+{
+	return 0;
+}
+
 static inline int security_sb_alloc(struct super_block *sb)
 {
 	return 0;
diff --git a/security/security.c b/security/security.c
index 5149c2cbe8a7..9bdc21ae645c 100644
--- a/security/security.c
+++ b/security/security.c
@@ -358,6 +358,47 @@ void security_bprm_committed_creds(struct linux_binprm *bprm)
 	call_void_hook(bprm_committed_creds, bprm);
 }
 
+int security_fs_context_alloc(struct fs_context *fc, struct dentry *reference)
+{
+	return call_int_hook(fs_context_alloc, 0, fc, reference);
+}
+
+int security_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+	return call_int_hook(fs_context_dup, 0, fc, src_fc);
+}
+
+void security_fs_context_free(struct fs_context *fc)
+{
+	call_void_hook(fs_context_free, fc);
+}
+
+int security_fs_context_parse_param(struct fs_context *fc, struct fs_parameter *param)
+{
+	return call_int_hook(fs_context_parse_param, 0, fc, param);
+}
+
+int security_fs_context_validate(struct fs_context *fc)
+{
+	return call_int_hook(fs_context_validate, 0, fc);
+}
+
+int security_sb_get_tree(struct fs_context *fc)
+{
+	return call_int_hook(sb_get_tree, 0, fc);
+}
+
+void security_sb_reconfigure(struct fs_context *fc)
+{
+	call_void_hook(sb_reconfigure, fc);
+}
+
+int security_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+			   unsigned int mnt_flags)
+{
+	return call_int_hook(sb_mountpoint, 0, fc, mountpoint, mnt_flags);
+}
+
 int security_sb_alloc(struct super_block *sb)
 {
 	return call_int_hook(sb_alloc_security, 0, sb);

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 09/33] selinux: Implement the new mount API LSM hooks [ver #11]
  2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
  2018-08-01 15:24 ` [PATCH 08/33] vfs: Add LSM hooks for the new mount API " David Howells
@ 2018-08-01 15:25 ` David Howells
  2018-08-01 15:25 ` [PATCH 10/33] smack: Implement filesystem context security " David Howells
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 57+ messages in thread
From: David Howells @ 2018-08-01 15:25 UTC (permalink / raw)
  To: linux-security-module

Implement the new mount API LSM hooks for SELinux.  At some point the old
hooks will need to be removed.

Question: Should the ->fs_context_parse_source() hook be implemented to
check the labels on any source devices specified?

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paul Moore <paul@paul-moore.com>
cc: Stephen Smalley <sds@tycho.nsa.gov>
cc: selinux at tycho.nsa.gov
cc: linux-security-module at vger.kernel.org
---

 security/selinux/hooks.c |  290 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 290 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index ef0428311a5c..9774d1f0e99f 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -48,6 +48,8 @@
 #include <linux/fdtable.h>
 #include <linux/namei.h>
 #include <linux/mount.h>
+#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
 #include <linux/netfilter_ipv4.h>
 #include <linux/netfilter_ipv6.h>
 #include <linux/tty.h>
@@ -446,6 +448,7 @@ enum {
 	Opt_rootcontext = 4,
 	Opt_labelsupport = 5,
 	Opt_nextmntopt = 6,
+	nr__selinux_params
 };
 
 #define NUM_SEL_MNT_OPTS	(Opt_nextmntopt - 1)
@@ -2974,6 +2977,285 @@ static int selinux_umount(struct vfsmount *mnt, int flags)
 				   FILESYSTEM__UNMOUNT, NULL);
 }
 
+/* fsopen mount context operations */
+
+static int selinux_fs_context_alloc(struct fs_context *fc,
+				    struct dentry *reference)
+{
+	struct security_mnt_opts *opts;
+
+	opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+	if (!opts)
+		return -ENOMEM;
+
+	fc->security = opts;
+	return 0;
+}
+
+static int selinux_fs_context_dup(struct fs_context *fc,
+				  struct fs_context *src_fc)
+{
+	const struct security_mnt_opts *src = src_fc->security;
+	struct security_mnt_opts *opts;
+	int i, n;
+
+	opts = kzalloc(sizeof(*opts), GFP_KERNEL);
+	if (!opts)
+		return -ENOMEM;
+	fc->security = opts;
+
+	if (!src || !src->num_mnt_opts)
+		return 0;
+	n = opts->num_mnt_opts = src->num_mnt_opts;
+
+	if (src->mnt_opts) {
+		opts->mnt_opts = kcalloc(n, sizeof(char *), GFP_KERNEL);
+		if (!opts->mnt_opts)
+			return -ENOMEM;
+
+		for (i = 0; i < n; i++) {
+			if (src->mnt_opts[i]) {
+				opts->mnt_opts[i] = kstrdup(src->mnt_opts[i],
+							    GFP_KERNEL);
+				if (!opts->mnt_opts[i])
+					return -ENOMEM;
+			}
+		}
+	}
+
+	if (src->mnt_opts_flags) {
+		opts->mnt_opts_flags = kmemdup(src->mnt_opts_flags,
+					       n * sizeof(int), GFP_KERNEL);
+		if (!opts->mnt_opts_flags)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void selinux_fs_context_free(struct fs_context *fc)
+{
+	struct security_mnt_opts *opts = fc->security;
+
+	if (opts) {
+		security_free_mnt_opts(opts);
+		fc->security = NULL;
+	}
+}
+
+static const struct fs_parameter_spec selinux_param_specs[nr__selinux_params] = {
+	[Opt_context]		= { fs_param_is_string },
+	[Opt_defcontext]	= { fs_param_is_string },
+	[Opt_fscontext]		= { fs_param_is_string },
+	[Opt_labelsupport]	= { fs_param_takes_no_value },
+	[Opt_rootcontext]	= { fs_param_is_string },
+};
+
+static const struct constant_table selinux_param_keys[] = {
+	{ CONTEXT_STR,		Opt_context },
+	{ DEFCONTEXT_STR,	Opt_defcontext },
+	{ FSCONTEXT_STR,	Opt_fscontext },
+	{ ROOTCONTEXT_STR,	Opt_rootcontext },
+	{ LABELSUPP_STR,	Opt_labelsupport },
+};
+
+static const struct fs_parameter_description selinux_fs_parameters = {
+	.name		= "SELinux",
+	.nr_params	= nr__selinux_params,
+	.nr_keys	= ARRAY_SIZE(selinux_param_keys),
+	.keys		= selinux_param_keys,
+	.specs		= selinux_param_specs,
+	.ignore_unknown	= true,
+};
+
+static int selinux_fs_context_parse_param(struct fs_context *fc,
+					  struct fs_parameter *param)
+{
+	struct security_mnt_opts *opts = fc->security;
+	struct fs_parse_result result;
+	unsigned int have;
+	char **oo;
+	int ret, ctx, i, *of;
+
+	ret = fs_parse(fc, &selinux_fs_parameters, param, &result);
+	if (ret <= 0)
+		return ret; /* Note: 0 indicates no match */
+
+	have = 0;
+	for (i = 0; i < opts->num_mnt_opts; i++)
+		have |= 1 << opts->mnt_opts_flags[i];
+	if (have & (1 << result.key))
+		return -EINVAL;
+
+	switch (result.key) {
+	case Opt_context:
+		if (have & (1 << Opt_defcontext))
+			goto incompatible;
+		ctx = CONTEXT_MNT;
+		goto copy_context_string;
+
+	case Opt_fscontext:
+		ctx = FSCONTEXT_MNT;
+		goto copy_context_string;
+
+	case Opt_rootcontext:
+		ctx = ROOTCONTEXT_MNT;
+		goto copy_context_string;
+
+	case Opt_defcontext:
+		if (have & (1 << Opt_context))
+			goto incompatible;
+		ctx = DEFCONTEXT_MNT;
+		goto copy_context_string;
+
+	case Opt_labelsupport:
+		return 1;
+
+	default:
+		return -EINVAL;
+	}
+
+copy_context_string:
+	if (opts->num_mnt_opts > 3)
+		return -EINVAL;
+
+	of = krealloc(opts->mnt_opts_flags,
+		      (opts->num_mnt_opts + 1) * sizeof(int), GFP_KERNEL);
+	if (!of)
+		return -ENOMEM;
+	of[opts->num_mnt_opts] = 0;
+	opts->mnt_opts_flags = of;
+
+	oo = krealloc(opts->mnt_opts,
+		      (opts->num_mnt_opts + 1) * sizeof(char *), GFP_KERNEL);
+	if (!oo)
+		return -ENOMEM;
+	oo[opts->num_mnt_opts] = NULL;
+	opts->mnt_opts = oo;
+
+	opts->mnt_opts[opts->num_mnt_opts] = param->string;
+	opts->mnt_opts_flags[opts->num_mnt_opts] = ctx;
+	opts->num_mnt_opts++;
+	param->string = NULL;
+	return 1;
+
+incompatible:
+	return -EINVAL;
+}
+
+/*
+ * Validate the security parameters supplied for a reconfiguration/remount
+ * event.
+ */
+static int selinux_validate_for_sb_reconfigure(struct fs_context *fc)
+{
+	struct super_block *sb = fc->root->d_sb;
+	struct superblock_security_struct *sbsec = sb->s_security;
+	struct security_mnt_opts *opts = fc->security;
+	int rc, i, *flags;
+	char **mount_options;
+
+	if (!(sbsec->flags & SE_SBINITIALIZED))
+		return 0;
+
+	mount_options = opts->mnt_opts;
+	flags = opts->mnt_opts_flags;
+
+	for (i = 0; i < opts->num_mnt_opts; i++) {
+		u32 sid;
+
+		if (flags[i] == SBLABEL_MNT)
+			continue;
+
+		rc = security_context_str_to_sid(&selinux_state, mount_options[i],
+						 &sid, GFP_KERNEL);
+		if (rc) {
+			pr_warn("SELinux: security_context_str_to_sid"
+				"(%s) failed for (dev %s, type %s) errno=%d\n",
+				mount_options[i], sb->s_id, sb->s_type->name, rc);
+			goto inval;
+		}
+
+		switch (flags[i]) {
+		case FSCONTEXT_MNT:
+			if (bad_option(sbsec, FSCONTEXT_MNT, sbsec->sid, sid))
+				goto bad_option;
+			break;
+		case CONTEXT_MNT:
+			if (bad_option(sbsec, CONTEXT_MNT, sbsec->mntpoint_sid, sid))
+				goto bad_option;
+			break;
+		case ROOTCONTEXT_MNT: {
+			struct inode_security_struct *root_isec;
+			root_isec = backing_inode_security(sb->s_root);
+
+			if (bad_option(sbsec, ROOTCONTEXT_MNT, root_isec->sid, sid))
+				goto bad_option;
+			break;
+		}
+		case DEFCONTEXT_MNT:
+			if (bad_option(sbsec, DEFCONTEXT_MNT, sbsec->def_sid, sid))
+				goto bad_option;
+			break;
+		default:
+			goto inval;
+		}
+	}
+
+	rc = 0;
+out:
+	return rc;
+
+bad_option:
+	pr_warn("SELinux: unable to change security options "
+		"during remount (dev %s, type=%s)\n",
+		sb->s_id, sb->s_type->name);
+inval:
+	rc = -EINVAL;
+	goto out;
+}
+
+/*
+ * Validate the security context assembled from the option data supplied to
+ * mount.
+ */
+static int selinux_fs_context_validate(struct fs_context *fc)
+{
+	if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)
+		return selinux_validate_for_sb_reconfigure(fc);
+	return 0;
+}
+
+/*
+ * Set the security context on a superblock.
+ */
+static int selinux_sb_get_tree(struct fs_context *fc)
+{
+	const struct cred *cred = current_cred();
+	struct common_audit_data ad;
+	int rc;
+
+	rc = selinux_set_mnt_opts(fc->root->d_sb, fc->security, 0, NULL);
+	if (rc)
+		return rc;
+
+	/* Allow all mounts performed by the kernel */
+	if (fc->purpose == FS_CONTEXT_FOR_KERNEL_MOUNT)
+		return 0;
+
+	ad.type = LSM_AUDIT_DATA_DENTRY;
+	ad.u.dentry = fc->root;
+	return superblock_has_perm(cred, fc->root->d_sb, FILESYSTEM__MOUNT, &ad);
+}
+
+static int selinux_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+				 unsigned int mnt_flags)
+{
+	const struct cred *cred = current_cred();
+
+	return path_has_perm(cred, mountpoint, FILE__MOUNTON);
+}
+
 /* inode security operations */
 
 static int selinux_inode_alloc_security(struct inode *inode)
@@ -6906,6 +7188,14 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(bprm_committing_creds, selinux_bprm_committing_creds),
 	LSM_HOOK_INIT(bprm_committed_creds, selinux_bprm_committed_creds),
 
+	LSM_HOOK_INIT(fs_context_alloc, selinux_fs_context_alloc),
+	LSM_HOOK_INIT(fs_context_dup, selinux_fs_context_dup),
+	LSM_HOOK_INIT(fs_context_free, selinux_fs_context_free),
+	LSM_HOOK_INIT(fs_context_parse_param, selinux_fs_context_parse_param),
+	LSM_HOOK_INIT(fs_context_validate, selinux_fs_context_validate),
+	LSM_HOOK_INIT(sb_get_tree, selinux_sb_get_tree),
+	LSM_HOOK_INIT(sb_mountpoint, selinux_sb_mountpoint),
+
 	LSM_HOOK_INIT(sb_alloc_security, selinux_sb_alloc_security),
 	LSM_HOOK_INIT(sb_free_security, selinux_sb_free_security),
 	LSM_HOOK_INIT(sb_copy_data, selinux_sb_copy_data),

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 10/33] smack: Implement filesystem context security hooks [ver #11]
  2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
  2018-08-01 15:24 ` [PATCH 08/33] vfs: Add LSM hooks for the new mount API " David Howells
  2018-08-01 15:25 ` [PATCH 09/33] selinux: Implement the new mount API LSM hooks " David Howells
@ 2018-08-01 15:25 ` David Howells
  2018-08-01 15:25 ` [PATCH 11/33] apparmor: Implement security hooks for the new mount API " David Howells
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 57+ messages in thread
From: David Howells @ 2018-08-01 15:25 UTC (permalink / raw)
  To: linux-security-module

Implement filesystem context security hooks for the smack LSM.

Question: Should the ->fs_context_parse_source() hook be implemented to
check the labels on any source devices specified?

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Casey Schaufler <casey@schaufler-ca.com>
cc: linux-security-module at vger.kernel.org
---

 security/smack/smack.h     |   11 +
 security/smack/smack_lsm.c |  337 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 343 insertions(+), 5 deletions(-)

diff --git a/security/smack/smack.h b/security/smack/smack.h
index f7db791fb566..e6d54e829394 100644
--- a/security/smack/smack.h
+++ b/security/smack/smack.h
@@ -195,11 +195,12 @@ struct smack_known_list_elem {
 
 enum {
 	Opt_error = -1,
-	Opt_fsdefault = 1,
-	Opt_fsfloor = 2,
-	Opt_fshat = 3,
-	Opt_fsroot = 4,
-	Opt_fstransmute = 5,
+	Opt_fsdefault = 0,
+	Opt_fsfloor = 1,
+	Opt_fshat = 2,
+	Opt_fsroot = 3,
+	Opt_fstransmute = 4,
+        nr__smack_params
 };
 
 /*
diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 9811476c8441..339ac4d42785 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -42,6 +42,8 @@
 #include <linux/shm.h>
 #include <linux/binfmts.h>
 #include <linux/parser.h>
+#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
 #include "smack.h"
 
 #define TRANS_TRUE	"TRUE"
@@ -521,6 +523,334 @@ static int smack_syslog(int typefrom_file)
 	return rc;
 }
 
+/*
+ * Mount context operations
+ */
+
+struct smack_fs_context {
+	union {
+		struct {
+			char		*fsdefault;
+			char		*fsfloor;
+			char		*fshat;
+			char		*fsroot;
+			char		*fstransmute;
+		};
+		char			*ptrs[5];
+
+	};
+	struct superblock_smack		*sbsp;
+	struct inode_smack		*isp;
+	bool				transmute;
+};
+
+/**
+ * smack_fs_context_free - Free the security data from a filesystem context
+ * @fc: The filesystem context to be cleaned up.
+ */
+static void smack_fs_context_free(struct fs_context *fc)
+{
+	struct smack_fs_context *ctx = fc->security;
+	int i;
+
+	if (ctx) {
+		for (i = 0; i < ARRAY_SIZE(ctx->ptrs); i++)
+			kfree(ctx->ptrs[i]);
+		kfree(ctx->isp);
+		kfree(ctx->sbsp);
+		kfree(ctx);
+		fc->security = NULL;
+	}
+}
+
+/**
+ * smack_fs_context_alloc - Allocate security data for a filesystem context
+ * @fc: The filesystem context.
+ * @reference: Reference dentry (automount/reconfigure) or NULL
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_alloc(struct fs_context *fc,
+				  struct dentry *reference)
+{
+	struct smack_fs_context *ctx;
+	struct superblock_smack *sbsp;
+	struct inode_smack *isp;
+	struct smack_known *skp;
+
+	ctx = kzalloc(sizeof(struct smack_fs_context), GFP_KERNEL);
+	if (!ctx)
+		goto nomem;
+	fc->security = ctx;
+
+	sbsp = kzalloc(sizeof(struct superblock_smack), GFP_KERNEL);
+	if (!sbsp)
+		goto nomem_free;
+	ctx->sbsp = sbsp;
+
+	isp = new_inode_smack(NULL);
+	if (!isp)
+		goto nomem_free;
+	ctx->isp = isp;
+
+	if (reference) {
+		if (reference->d_sb->s_security)
+			memcpy(sbsp, reference->d_sb->s_security, sizeof(*sbsp));
+	} else if (!smack_privileged(CAP_MAC_ADMIN)) {
+		/* Unprivileged mounts get root and default from the caller. */
+		skp = smk_of_current();
+		sbsp->smk_root = skp;
+		sbsp->smk_default = skp;
+	} else {
+		sbsp->smk_root = &smack_known_floor;
+		sbsp->smk_default = &smack_known_floor;
+		sbsp->smk_floor = &smack_known_floor;
+		sbsp->smk_hat = &smack_known_hat;
+		/* SMK_SB_INITIALIZED will be zero from kzalloc. */
+	}
+
+	return 0;
+
+nomem_free:
+	smack_fs_context_free(fc);
+nomem:
+	return -ENOMEM;
+}
+
+/**
+ * smack_fs_context_dup - Duplicate the security data on fs_context duplication
+ * @fc: The new filesystem context.
+ * @src_fc: The source filesystem context being duplicated.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_dup(struct fs_context *fc,
+				struct fs_context *src_fc)
+{
+	struct smack_fs_context *dst, *src = src_fc->security;
+	int i;
+
+	dst = kzalloc(sizeof(struct smack_fs_context), GFP_KERNEL);
+	if (!dst)
+		goto nomem;
+	fc->security = dst;
+
+	dst->sbsp = kmemdup(src->sbsp, sizeof(struct superblock_smack),
+			    GFP_KERNEL);
+	if (!dst->sbsp)
+		goto nomem_free;
+
+	for (i = 0; i < ARRAY_SIZE(dst->ptrs); i++) {
+		if (src->ptrs[i]) {
+			dst->ptrs[i] = kstrdup(src->ptrs[i], GFP_KERNEL);
+			if (!dst->ptrs[i])
+				goto nomem_free;
+		}
+	}
+
+	return 0;
+
+nomem_free:
+	smack_fs_context_free(fc);
+nomem:
+	return -ENOMEM;
+}
+
+static const struct fs_parameter_spec smack_param_specs[nr__smack_params] = {
+	[Opt_fsdefault]		= { fs_param_is_string },
+	[Opt_fsfloor]		= { fs_param_is_string },
+	[Opt_fshat]		= { fs_param_is_string },
+	[Opt_fsroot]		= { fs_param_is_string },
+	[Opt_fstransmute]	= { fs_param_is_string },
+};
+
+static const struct constant_table smack_param_keys[] = {
+	{ SMK_FSDEFAULT,	Opt_fsdefault },
+	{ SMK_FSFLOOR,		Opt_fsfloor },
+	{ SMK_FSHAT,		Opt_fshat },
+	{ SMK_FSROOT,		Opt_fsroot },
+	{ SMK_FSTRANS,		Opt_fstransmute },
+};
+
+static const struct fs_parameter_description smack_fs_parameters = {
+	.name		= "smack",
+	.nr_params	= nr__smack_params,
+	.nr_keys	= ARRAY_SIZE(smack_param_keys),
+	.keys		= smack_param_keys,
+	.specs		= smack_param_specs,
+};
+
+/**
+ * smack_fs_context_parse_param - Parse a single mount parameter
+ * @fc: The new filesystem context being constructed.
+ * @param: The parameter.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_parse_param(struct fs_context *fc,
+					struct fs_parameter *param)
+{
+	struct smack_fs_context *ctx = fc->security;
+	struct fs_parse_result result;
+	int ret;
+
+	/* Unprivileged mounts don't get to specify Smack values. */
+	if (!smack_privileged(CAP_MAC_ADMIN))
+		return -EPERM;
+
+	ret = fs_parse(fc, &smack_fs_parameters, param, &result);
+	if (ret <= 0)
+		return ret; /* Note: 0 indicates no match */
+
+	switch (result.key) {
+	case Opt_fsdefault:
+		if (ctx->fsdefault)
+			goto error_dup;
+		ctx->fsdefault = param->string;
+		if (!ctx->fsdefault)
+			goto error;
+		break;
+	case Opt_fsfloor:
+		if (ctx->fsfloor)
+			goto error_dup;
+		ctx->fsfloor = param->string;
+		if (!ctx->fsfloor)
+			goto error;
+		break;
+	case Opt_fshat:
+		if (ctx->fshat)
+			goto error_dup;
+		ctx->fshat = param->string;
+		if (!ctx->fshat)
+			goto error;
+		break;
+	case Opt_fsroot:
+		if (ctx->fsroot)
+			goto error_dup;
+		ctx->fsroot = param->string;
+		if (!ctx->fsroot)
+			goto error;
+		break;
+	case Opt_fstransmute:
+		if (ctx->fstransmute)
+			goto error_dup;
+		ctx->fstransmute = param->string;
+		if (!ctx->fstransmute)
+			goto error;
+		break;
+	default:
+		pr_warn("Smack:  unknown mount option\n");
+		goto error_inval;
+	}
+
+	param->string = NULL;
+	return 0;
+
+error_dup:
+	warnf(fc, "Smack: duplicate mount option\n");
+error_inval:
+	ret = -EINVAL;
+error:
+	return ret;
+}
+
+/**
+ * smack_fs_context_validate - Validate the filesystem context security data
+ * @fc: The filesystem context.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_fs_context_validate(struct fs_context *fc)
+{
+	struct smack_fs_context *ctx = fc->security;
+	struct superblock_smack *sbsp = ctx->sbsp;
+	struct inode_smack *isp = ctx->isp;
+	struct smack_known *skp;
+
+	if (ctx->fsdefault) {
+		skp = smk_import_entry(ctx->fsdefault, 0);
+		if (IS_ERR(skp))
+			return PTR_ERR(skp);
+		sbsp->smk_default = skp;
+	}
+
+	if (ctx->fsfloor) {
+		skp = smk_import_entry(ctx->fsfloor, 0);
+		if (IS_ERR(skp))
+			return PTR_ERR(skp);
+		sbsp->smk_floor = skp;
+	}
+
+	if (ctx->fshat) {
+		skp = smk_import_entry(ctx->fshat, 0);
+		if (IS_ERR(skp))
+			return PTR_ERR(skp);
+		sbsp->smk_hat = skp;
+	}
+
+	if (ctx->fsroot || ctx->fstransmute) {
+		skp = smk_import_entry(ctx->fstransmute ?: ctx->fsroot, 0);
+		if (IS_ERR(skp))
+			return PTR_ERR(skp);
+		sbsp->smk_root = skp;
+		ctx->transmute = !!ctx->fstransmute;
+	}
+
+	isp->smk_inode = sbsp->smk_root;
+	return 0;
+}
+
+/**
+ * smack_sb_get_tree - Assign the context to a newly created superblock
+ * @fc: The new filesystem context.
+ *
+ * Returns 0 on success or -ENOMEM on error.
+ */
+static int smack_sb_get_tree(struct fs_context *fc)
+{
+	struct smack_fs_context *ctx = fc->security;
+	struct superblock_smack *sbsp = ctx->sbsp;
+	struct dentry *root = fc->root;
+	struct inode *inode = d_backing_inode(root);
+	struct super_block *sb = root->d_sb;
+	struct inode_smack *isp;
+	bool transmute = ctx->transmute;
+
+	if (sb->s_security)
+		return 0;
+
+	if (!smack_privileged(CAP_MAC_ADMIN)) {
+		/*
+		 * For a handful of fs types with no user-controlled
+		 * backing store it's okay to trust security labels
+		 * in the filesystem. The rest are untrusted.
+		 */
+		if (fc->user_ns != &init_user_ns &&
+		    sb->s_magic != SYSFS_MAGIC && sb->s_magic != TMPFS_MAGIC &&
+		    sb->s_magic != RAMFS_MAGIC) {
+			transmute = true;
+			sbsp->smk_flags |= SMK_SB_UNTRUSTED;
+		}
+	}
+
+	sbsp->smk_flags |= SMK_SB_INITIALIZED;
+	sb->s_security = sbsp;
+	ctx->sbsp = NULL;
+
+	/* Initialize the root inode. */
+	isp = inode->i_security;
+	if (isp == NULL) {
+		isp = ctx->isp;
+		ctx->isp = NULL;
+		inode->i_security = isp;
+	} else
+		isp->smk_inode = sbsp->smk_root;
+
+	if (transmute)
+		isp->smk_flags |= SMK_INODE_TRANSMUTE;
+
+	return 0;
+}
 
 /*
  * Superblock Hooks.
@@ -4649,6 +4979,13 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(ptrace_traceme, smack_ptrace_traceme),
 	LSM_HOOK_INIT(syslog, smack_syslog),
 
+	LSM_HOOK_INIT(fs_context_alloc, smack_fs_context_alloc),
+	LSM_HOOK_INIT(fs_context_dup, smack_fs_context_dup),
+	LSM_HOOK_INIT(fs_context_free, smack_fs_context_free),
+	LSM_HOOK_INIT(fs_context_parse_param, smack_fs_context_parse_param),
+	LSM_HOOK_INIT(fs_context_validate, smack_fs_context_validate),
+	LSM_HOOK_INIT(sb_get_tree, smack_sb_get_tree),
+
 	LSM_HOOK_INIT(sb_alloc_security, smack_sb_alloc_security),
 	LSM_HOOK_INIT(sb_free_security, smack_sb_free_security),
 	LSM_HOOK_INIT(sb_copy_data, smack_sb_copy_data),

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 11/33] apparmor: Implement security hooks for the new mount API [ver #11]
  2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
                   ` (2 preceding siblings ...)
  2018-08-01 15:25 ` [PATCH 10/33] smack: Implement filesystem context security " David Howells
@ 2018-08-01 15:25 ` David Howells
  2018-08-01 15:25 ` [PATCH 12/33] tomoyo: " David Howells
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 57+ messages in thread
From: David Howells @ 2018-08-01 15:25 UTC (permalink / raw)
  To: linux-security-module

Implement hooks to check the creation of new mountpoints for AppArmor.

Unfortunately, the DFA evaluation puts the option data in last, after the
details of the mountpoint, so we have to cache the mount options in the
fs_context using those hooks till we get to the new mountpoint hook.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: John Johansen <john.johansen@canonical.com>
cc: apparmor at lists.ubuntu.com
cc: linux-security-module at vger.kernel.org
---

 security/apparmor/include/mount.h |   11 +++-
 security/apparmor/lsm.c           |  107 +++++++++++++++++++++++++++++++++++++
 security/apparmor/mount.c         |   46 ++++++++++++++++
 3 files changed, 162 insertions(+), 2 deletions(-)

diff --git a/security/apparmor/include/mount.h b/security/apparmor/include/mount.h
index 25d6067fa6ef..0441bfae30fa 100644
--- a/security/apparmor/include/mount.h
+++ b/security/apparmor/include/mount.h
@@ -16,6 +16,7 @@
 
 #include <linux/fs.h>
 #include <linux/path.h>
+#include <linux/fs_context.h>
 
 #include "domain.h"
 #include "policy.h"
@@ -27,7 +28,13 @@
 #define AA_AUDIT_DATA		0x40
 #define AA_MNT_CONT_MATCH	0x40
 
-#define AA_MS_IGNORE_MASK (MS_KERNMOUNT | MS_NOSEC | MS_ACTIVE | MS_BORN)
+#define AA_SB_IGNORE_MASK (SB_KERNMOUNT | SB_NOSEC | SB_ACTIVE | SB_BORN)
+
+struct apparmor_fs_context {
+	struct fs_context	fc;
+	char			*saved_options;
+	size_t			saved_size;
+};
 
 int aa_remount(struct aa_label *label, const struct path *path,
 	       unsigned long flags, void *data);
@@ -45,6 +52,8 @@ int aa_move_mount(struct aa_label *label, const struct path *path,
 int aa_new_mount(struct aa_label *label, const char *dev_name,
 		 const struct path *path, const char *type, unsigned long flags,
 		 void *data);
+int aa_new_mount_fc(struct aa_label *label, struct fs_context *fc,
+		    const struct path *mountpoint);
 
 int aa_umount(struct aa_label *label, struct vfsmount *mnt, int flags);
 
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 84e644ce3583..5555947aed97 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -520,6 +520,105 @@ static int apparmor_file_mprotect(struct vm_area_struct *vma,
 			   !(vma->vm_flags & VM_SHARED) ? MAP_PRIVATE : 0);
 }
 
+static int apparmor_fs_context_alloc(struct fs_context *fc, struct dentry *reference)
+{
+	struct apparmor_fs_context *afc;
+
+	afc = kzalloc(sizeof(*afc), GFP_KERNEL);
+	if (!afc)
+		return -ENOMEM;
+
+	fc->security = afc;
+	return 0;
+}
+
+static int apparmor_fs_context_dup(struct fs_context *fc, struct fs_context *src_fc)
+{
+	fc->security = NULL;
+	return 0;
+}
+
+static void apparmor_fs_context_free(struct fs_context *fc)
+{
+	struct apparmor_fs_context *afc = fc->security;
+
+	if (afc) {
+		kfree(afc->saved_options);
+		kfree(afc);
+	}
+}
+
+/*
+ * As a temporary hack, we buffer all the options.  The problem is that we need
+ * to pass them to the DFA evaluator *after* mount point parameters, which
+ * means deferring the entire check to the sb_mountpoint hook.
+ */
+static int apparmor_fs_context_parse_param(struct fs_context *fc,
+					   struct fs_parameter *param)
+{
+	struct apparmor_fs_context *afc = fc->security;
+	const char *value;
+	size_t space = 0, k_len = strlen(param->key), len = k_len, v_len;
+	char *p, *q;
+
+	if (afc->saved_size > 0)
+		space = 1;
+
+	switch (param->type) {
+	case fs_value_is_string:
+		value = param->string;
+		v_len = param->size;
+		len += 1 + v_len;
+		break;
+	case fs_value_is_filename:
+	case fs_value_is_filename_empty: {
+		value = param->name->name;
+		v_len = param->size;
+		len += 1 + v_len;
+		break;
+	}
+	default:
+		value = NULL;
+		v_len = 0;
+		break;
+	}
+
+	p = krealloc(afc->saved_options, afc->saved_size + space + len + 1,
+		     GFP_KERNEL);
+	if (!p)
+		return -ENOMEM;
+
+	q = p + afc->saved_size;
+	if (q != p)
+		*q++ = ' ';
+	memcpy(q, param->key, k_len);
+	q += k_len;
+	if (value) {
+		*q++ = '=';
+		memcpy(q, value, v_len);
+		q += v_len;
+	}
+	*q = 0;
+
+	afc->saved_options = p;
+	afc->saved_size += 1 + len;
+	return 0;
+}
+
+static int apparmor_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+				  unsigned int mnt_flags)
+{
+	struct aa_label *label;
+	int error = 0;
+
+	label = __begin_current_label_crit_section();
+	if (!unconfined(label))
+		error = aa_new_mount_fc(label, fc, mountpoint);
+	__end_current_label_crit_section(label);
+
+	return error;
+}
+
 static int apparmor_sb_mount(const char *dev_name, const struct path *path,
 			     const char *type, unsigned long flags,
 			     void *data, size_t data_size)
@@ -531,7 +630,7 @@ static int apparmor_sb_mount(const char *dev_name, const struct path *path,
 	if ((flags & MS_MGC_MSK) == MS_MGC_VAL)
 		flags &= ~MS_MGC_MSK;
 
-	flags &= ~AA_MS_IGNORE_MASK;
+	flags &= ~AA_SB_IGNORE_MASK;
 
 	label = __begin_current_label_crit_section();
 	if (!unconfined(label)) {
@@ -1134,6 +1233,12 @@ static struct security_hook_list apparmor_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(capget, apparmor_capget),
 	LSM_HOOK_INIT(capable, apparmor_capable),
 
+	LSM_HOOK_INIT(fs_context_alloc, apparmor_fs_context_alloc),
+	LSM_HOOK_INIT(fs_context_dup, apparmor_fs_context_dup),
+	LSM_HOOK_INIT(fs_context_free, apparmor_fs_context_free),
+	LSM_HOOK_INIT(fs_context_parse_param, apparmor_fs_context_parse_param),
+	LSM_HOOK_INIT(sb_mountpoint, apparmor_sb_mountpoint),
+
 	LSM_HOOK_INIT(sb_mount, apparmor_sb_mount),
 	LSM_HOOK_INIT(sb_umount, apparmor_sb_umount),
 	LSM_HOOK_INIT(sb_pivotroot, apparmor_sb_pivotroot),
diff --git a/security/apparmor/mount.c b/security/apparmor/mount.c
index 8c3787399356..3c95fffb76ac 100644
--- a/security/apparmor/mount.c
+++ b/security/apparmor/mount.c
@@ -554,6 +554,52 @@ int aa_new_mount(struct aa_label *label, const char *dev_name,
 	return error;
 }
 
+int aa_new_mount_fc(struct aa_label *label, struct fs_context *fc,
+		    const struct path *mountpoint)
+{
+	struct apparmor_fs_context *afc = fc->security;
+	struct aa_profile *profile;
+	char *buffer = NULL, *dev_buffer = NULL;
+	bool binary;
+	int error;
+	struct path tmp_path, *dev_path = NULL;
+
+	AA_BUG(!label);
+	AA_BUG(!mountpoint);
+
+	binary = fc->fs_type->fs_flags & FS_BINARY_MOUNTDATA;
+
+	if (fc->fs_type->fs_flags & FS_REQUIRES_DEV) {
+		if (!fc->source)
+			return -ENOENT;
+
+		error = kern_path(fc->source, LOOKUP_FOLLOW, &tmp_path);
+		if (error)
+			return error;
+		dev_path = &tmp_path;
+	}
+
+	get_buffers(buffer, dev_buffer);
+	if (dev_path) {
+		error = fn_for_each_confined(label, profile,
+			match_mnt(profile, mountpoint, buffer, dev_path, dev_buffer,
+				  fc->fs_type->name,
+				  fc->sb_flags & ~AA_SB_IGNORE_MASK,
+				  afc->saved_options, binary));
+	} else {
+		error = fn_for_each_confined(label, profile,
+			match_mnt_path_str(profile, mountpoint, buffer,
+					   fc->source, fc->fs_type->name,
+					   fc->sb_flags & ~AA_SB_IGNORE_MASK,
+					   afc->saved_options, binary, NULL));
+	}
+	put_buffers(buffer, dev_buffer);
+	if (dev_path)
+		path_put(dev_path);
+
+	return error;
+}
+
 static int profile_umount(struct aa_profile *profile, struct path *path,
 			  char *buffer)
 {

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 12/33] tomoyo: Implement security hooks for the new mount API [ver #11]
  2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
                   ` (3 preceding siblings ...)
  2018-08-01 15:25 ` [PATCH 11/33] apparmor: Implement security hooks for the new mount API " David Howells
@ 2018-08-01 15:25 ` David Howells
  2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 57+ messages in thread
From: David Howells @ 2018-08-01 15:25 UTC (permalink / raw)
  To: linux-security-module

Implement the security hook to check the creation of a new mountpoint for
Tomoyo.

As far as I can tell, Tomoyo doesn't make use of the mount data or parse
any mount options, so I haven't implemented any of the fs_context hooks for
it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
cc: tomoyo-dev-en at lists.sourceforge.jp
cc: linux-security-module at vger.kernel.org
---

 security/tomoyo/common.h |    3 +++
 security/tomoyo/mount.c  |   45 +++++++++++++++++++++++++++++++++++++++++++++
 security/tomoyo/tomoyo.c |   15 +++++++++++++++
 3 files changed, 63 insertions(+)

diff --git a/security/tomoyo/common.h b/security/tomoyo/common.h
index 539bcdd30bb8..e637ce73f7f9 100644
--- a/security/tomoyo/common.h
+++ b/security/tomoyo/common.h
@@ -971,6 +971,9 @@ int tomoyo_init_request_info(struct tomoyo_request_info *r,
 			     const u8 index);
 int tomoyo_mkdev_perm(const u8 operation, const struct path *path,
 		      const unsigned int mode, unsigned int dev);
+int tomoyo_mount_permission_fc(struct fs_context *fc,
+			       const struct path *mountpoint,
+			       unsigned int mnt_flags);
 int tomoyo_mount_permission(const char *dev_name, const struct path *path,
 			    const char *type, unsigned long flags,
 			    void *data_page);
diff --git a/security/tomoyo/mount.c b/security/tomoyo/mount.c
index 7dc7f59b7dde..9ec84ab6f5e1 100644
--- a/security/tomoyo/mount.c
+++ b/security/tomoyo/mount.c
@@ -6,6 +6,7 @@
  */
 
 #include <linux/slab.h>
+#include <linux/fs_context.h>
 #include <uapi/linux/mount.h>
 #include "common.h"
 
@@ -236,3 +237,47 @@ int tomoyo_mount_permission(const char *dev_name, const struct path *path,
 	tomoyo_read_unlock(idx);
 	return error;
 }
+
+/**
+ * tomoyo_mount_permission_fc - Check permission to create a new mount.
+ * @fc:		Context describing the object to be mounted.
+ * @mountpoint:	The target object to mount on.
+ * @mnt:	The MNT_* flags to be set on the mountpoint.
+ *
+ * Check the permission to create a mount of the object described in @fc.  Note
+ * that the source object may be a newly created superblock or may be an
+ * existing one picked from the filesystem (bind mount).
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
+int tomoyo_mount_permission_fc(struct fs_context *fc,
+			       const struct path *mountpoint,
+			       unsigned int mnt_flags)
+{
+	struct tomoyo_request_info r;
+	unsigned int ms_flags = 0;
+	int error;
+	int idx;
+
+	if (tomoyo_init_request_info(&r, NULL, TOMOYO_MAC_FILE_MOUNT) ==
+	    TOMOYO_CONFIG_DISABLED)
+		return 0;
+
+	/* Convert MNT_* flags to MS_* equivalents. */
+	if (mnt_flags & MNT_NOSUID)	ms_flags |= MS_NOSUID;
+	if (mnt_flags & MNT_NODEV)	ms_flags |= MS_NODEV;
+	if (mnt_flags & MNT_NOEXEC)	ms_flags |= MS_NOEXEC;
+	if (mnt_flags & MNT_NOATIME)	ms_flags |= MS_NOATIME;
+	if (mnt_flags & MNT_NODIRATIME)	ms_flags |= MS_NODIRATIME;
+	if (mnt_flags & MNT_RELATIME)	ms_flags |= MS_RELATIME;
+	if (mnt_flags & MNT_READONLY)	ms_flags |= MS_RDONLY;
+
+	idx = tomoyo_read_lock();
+	/* TODO: There may be multiple sources; for the moment, just pick the
+	 * first if there is one.
+	 */
+	error = tomoyo_mount_acl(&r, fc->source, mountpoint, fc->fs_type->name,
+				 ms_flags);
+	tomoyo_read_unlock(idx);
+	return error;
+}
diff --git a/security/tomoyo/tomoyo.c b/security/tomoyo/tomoyo.c
index e5e349392e7b..c3a0ae4fa7ce 100644
--- a/security/tomoyo/tomoyo.c
+++ b/security/tomoyo/tomoyo.c
@@ -391,6 +391,20 @@ static int tomoyo_path_chroot(const struct path *path)
 	return tomoyo_path_perm(TOMOYO_TYPE_CHROOT, path, NULL);
 }
 
+/**
+ * tomoyo_sb_mount - Target for security_sb_mountpoint().
+ * @fc:		Context describing the object to be mounted.
+ * @mountpoint:	The target object to mount on.
+ * @mnt_flags:	Mountpoint specific options (as MNT_* flags).
+ *
+ * Returns 0 on success, negative value otherwise.
+ */
+static int tomoyo_sb_mountpoint(struct fs_context *fc, struct path *mountpoint,
+				unsigned int mnt_flags)
+{
+	return tomoyo_mount_permission_fc(fc, mountpoint, mnt_flags);
+}
+
 /**
  * tomoyo_sb_mount - Target for security_sb_mount().
  *
@@ -521,6 +535,7 @@ static struct security_hook_list tomoyo_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(path_chmod, tomoyo_path_chmod),
 	LSM_HOOK_INIT(path_chown, tomoyo_path_chown),
 	LSM_HOOK_INIT(path_chroot, tomoyo_path_chroot),
+	LSM_HOOK_INIT(sb_mountpoint, tomoyo_sb_mountpoint),
 	LSM_HOOK_INIT(sb_mount, tomoyo_sb_mount),
 	LSM_HOOK_INIT(sb_umount, tomoyo_sb_umount),
 	LSM_HOOK_INIT(sb_pivotroot, tomoyo_sb_pivotroot),

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 08/33] vfs: Add LSM hooks for the new mount API [ver #11]
  2018-08-01 15:24 ` [PATCH 08/33] vfs: Add LSM hooks for the new mount API " David Howells
@ 2018-08-01 20:50   ` James Morris
  2018-08-01 22:53   ` David Howells
  1 sibling, 0 replies; 57+ messages in thread
From: James Morris @ 2018-08-01 20:50 UTC (permalink / raw)
  To: linux-security-module

On Wed, 1 Aug 2018, David Howells wrote:

>  (2) A hook to snoop source specifications.  


What are source specifications?


-- 
James Morris
<jmorris@namei.org>

--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 08/33] vfs: Add LSM hooks for the new mount API [ver #11]
  2018-08-01 15:24 ` [PATCH 08/33] vfs: Add LSM hooks for the new mount API " David Howells
  2018-08-01 20:50   ` James Morris
@ 2018-08-01 22:53   ` David Howells
  1 sibling, 0 replies; 57+ messages in thread
From: David Howells @ 2018-08-01 22:53 UTC (permalink / raw)
  To: linux-security-module

James Morris <jmorris@namei.org> wrote:

> >  (2) A hook to snoop source specifications.  
> 
> 
> What are source specifications?

"/dev/sda1" or "my.nfs.server:/foo/bar".

Actually, this hook is now gone.  Source specification is done by way of a
parameter with key of "source" and this can be specified multiple times if the
filesystem supports it (bcachefs for example).

David
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
                   ` (4 preceding siblings ...)
  2018-08-01 15:25 ` [PATCH 12/33] tomoyo: " David Howells
@ 2018-08-10 14:05 ` Eric W. Biederman
  2018-08-10 14:36   ` Andy Lutomirski
                     ` (3 more replies)
  2018-08-10 15:11 ` David Howells
  2018-08-15 16:31 ` Should we split the network filesystem setup into two phases? David Howells
  7 siblings, 4 replies; 57+ messages in thread
From: Eric W. Biederman @ 2018-08-10 14:05 UTC (permalink / raw)
  To: linux-security-module


There is a serious problem with mount options today that fsopen does not
address.  The problem is that mount options are ignored for block based
filesystems, and any other type of filesystem that follows the same
pattern.

The script below demonstrates this bug.  Showing this bug can cause the
ext4 "acl" "quota" and "user_xattr" options to be silently ignored.

fsopen has my nack until it addresses this issue.

I don't know if we can fix this in the context of sys_mount.  But we if
we are redoing the option parsing of how we mount filesystems this needs
to be fixed before we start worrying about bug compatibility.

Hopefully this report is simple and clear enough that we can at least
agree on the problem.

Eric

# cat ~/bin/bdev-loop0.sh
#!/bin/sh
set -x
set -e

LOOP=loop0

dd if=/dev/zero bs=1024 count=1048576 of=$LOOP-file
losetup /dev/$LOOP $LOOP-file
mkfs.ext4 /dev/$LOOP

mkdir $LOOP-noacl-noquota-nouser_xattr
mount -t ext4 /dev/$LOOP -o "noacl,noquota,nouser_xattr" $LOOP-noacl-noquota-nouser_xattr

mkdir $LOOP-acl-quota-user_xattr
mount -t ext4 /dev/$LOOP  -o "acl,quota,user_xattr" $LOOP-acl-quota-user_xattr

cat /proc/mounts | grep loop0


root at finagle:~# ~/bin/bdev-loop0.sh
+ set -e
+ LOOP=loop0
+ dd if=/dev/zero bs=1024 count=1048576 of=loop0-file
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 4.37645 s, 245 MB/s
+ losetup /dev/loop0 loop0-file
+ mkfs.ext4 /dev/loop0
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
65536 inodes, 262144 blocks
13107 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=268435456
8 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376

Writing inode tables: done                            
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 29 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
+ mkdir loop0-noacl-noquota-nouser_xattr
+ mount -t ext4 /dev/loop0 -o noacl,noquota,nouser_xattr loop0-noacl-noquota-nouser_xattr
+ mkdir loop0-acl-quota-user_xattr
+ mount -t ext4 /dev/loop0 -o acl,quota,user_xattr loop0-acl-quota-user_xattr
+ + grep loop0
cat /proc/mounts
/dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
/dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
@ 2018-08-10 14:36   ` Andy Lutomirski
  2018-08-10 15:17     ` Eric W. Biederman
  2018-08-10 15:24     ` Al Viro
  2018-08-10 15:11   ` Tetsuo Handa
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 57+ messages in thread
From: Andy Lutomirski @ 2018-08-10 14:36 UTC (permalink / raw)
  To: linux-security-module



> On Aug 10, 2018, at 7:05 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> 
> 
> There is a serious problem with mount options today that fsopen does not
> address.  The problem is that mount options are ignored for block based
> filesystems, and any other type of filesystem that follows the same
> pattern.
> 

> /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0

To make sure I understand correctly: the problem is that the second mount ignored the options because the device was already mounted, right?

For the new API, I think the only remotely sane approach is to refuse to mount or init or whatever you call it an already mounted bdev. If user code genuinely needs to bind-mount an existing mount that is known only by its bdev, we can add a specific API just for that.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
                   ` (5 preceding siblings ...)
  2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
@ 2018-08-10 15:11 ` David Howells
  2018-08-10 15:39   ` Theodore Y. Ts'o
                     ` (3 more replies)
  2018-08-15 16:31 ` Should we split the network filesystem setup into two phases? David Howells
  7 siblings, 4 replies; 57+ messages in thread
From: David Howells @ 2018-08-10 15:11 UTC (permalink / raw)
  To: linux-security-module

Eric W. Biederman <ebiederm@xmission.com> wrote:

> There is a serious problem with mount options today that fsopen does not
> address.  The problem is that mount options are ignored for block based
> filesystems, and any other type of filesystem that follows the same
> pattern.

Yes.  Since you *absolutely* *insist* on this being fixed *right* *now* *or*
*else*, I'm working up a set of additional patches to give userspace the
option of whether they want no sharing; sharing, but only with exactly the
same parameters; or to ignore the parameter differences and just accept
sharing of what's already already mounted (ie. the current behaviour).

The second option, however, is not trivial as it needs to compare the fs
contexts, including the LSM parameters.  To make that work, I really need to
remove the old security_mnt_opts stuff - which means I need to port btrfs to
the new context stuff.

We discussed this yesterday, and I proposed a solution, and I'm working on it.

Yes, I agree it would be nice to have, but it *doesn't* really need supporting
right this minute, since what I have now oughtn't to break the current
behaviour.

David

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
  2018-08-10 14:36   ` Andy Lutomirski
@ 2018-08-10 15:11   ` Tetsuo Handa
  2018-08-10 15:13   ` David Howells
  2018-08-10 15:16   ` Al Viro
  3 siblings, 0 replies; 57+ messages in thread
From: Tetsuo Handa @ 2018-08-10 15:11 UTC (permalink / raw)
  To: linux-security-module

On 2018/08/10 23:05, Eric W. Biederman wrote:
> 
> There is a serious problem with mount options today that fsopen does not
> address.  The problem is that mount options are ignored for block based
> filesystems, and any other type of filesystem that follows the same
> pattern.
> 
> The script below demonstrates this bug.  Showing this bug can cause the
> ext4 "acl" "quota" and "user_xattr" options to be silently ignored.
> 
> fsopen has my nack until it addresses this issue.
> 
> I don't know if we can fix this in the context of sys_mount.  But we if
> we are redoing the option parsing of how we mount filesystems this needs
> to be fixed before we start worrying about bug compatibility.
> 
> Hopefully this report is simple and clear enough that we can at least
> agree on the problem.
> 
> Eric

This might be related to a problem that syzbot is failing to reproduce a problem.

  https://groups.google.com/forum/#!msg/syzkaller-bugs/R03vI7RCVco/0PijCTrcCgAJ

  syzbot found a reproducer, and the reproducer was working until next-20180803.
  But the reproducer is failing to reproduce this problem in next-20180806 despite
  there is no mm related change between next-20180803 and next-20180806.

  Therefore, I suspect that the reproducer is no longer working as intended. And
  there was parser change (David Howells' patch) between next-20180803 and next-20180806.

I'm waiting for response from David Howells...

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
  2018-08-10 14:36   ` Andy Lutomirski
  2018-08-10 15:11   ` Tetsuo Handa
@ 2018-08-10 15:13   ` David Howells
  2018-08-10 15:16   ` Al Viro
  3 siblings, 0 replies; 57+ messages in thread
From: David Howells @ 2018-08-10 15:13 UTC (permalink / raw)
  To: linux-security-module

Andy Lutomirski <luto@amacapital.net> wrote:

> > /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> > /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> 
> To make sure I understand correctly: the problem is that the second mount
> ignored the options because the device was already mounted, right?
> 
> For the new API, I think the only remotely sane approach is to refuse to
> mount or init or whatever you call it an already mounted bdev. If user code
> genuinely needs to bind-mount an existing mount that is known only by its
> bdev, we can add a specific API just for that.

I'm adding some flags to fsopen() to allow userspace to say whether it wants
no sharing, same parameters-only sharing or anything-goes sharing (as now).

I'm also adding a flag whereby userspace can forbid anyone else from sharing a
new superblock it has just set up.

David

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
                     ` (2 preceding siblings ...)
  2018-08-10 15:13   ` David Howells
@ 2018-08-10 15:16   ` Al Viro
  2018-08-11  1:05     ` Eric W. Biederman
  3 siblings, 1 reply; 57+ messages in thread
From: Al Viro @ 2018-08-10 15:16 UTC (permalink / raw)
  To: linux-security-module

On Fri, Aug 10, 2018 at 09:05:22AM -0500, Eric W. Biederman wrote:
> 
> There is a serious problem with mount options today that fsopen does not
> address.  The problem is that mount options are ignored for block based
> filesystems, and any other type of filesystem that follows the same
> pattern.
> 
> The script below demonstrates this bug.  Showing this bug can cause the
> ext4 "acl" "quota" and "user_xattr" options to be silently ignored.
> 
> fsopen has my nack until it addresses this issue.
> 
> I don't know if we can fix this in the context of sys_mount.  But we if
> we are redoing the option parsing of how we mount filesystems this needs
> to be fixed before we start worrying about bug compatibility.
> 
> Hopefully this report is simple and clear enough that we can at least
> agree on the problem.

Sure, it is simple.  So's the solution: MNT_USERNS_SPECIAL_SEMANTICS that
would get passed to filesystems, so that Eric would be able to implement
his mount(2)-incompatible behaviour at leisure, without worrying about
compatibility issues.

Does that address your complaint?  Because one thing we are not going
to do is changing mount(2) behaviour.  Reason: userland-visible
behaviour of hell knows how many local scripts.  Another thing that
is flat-out not feasible is some kind of blanket "compare options"
stuff; it *can* be done as helpers to be used by filesystem when
it sees that new flag, but it's simply not going to work at the
fs-independent level.  Trivial example with the same ext4:
mount /dev/sda1 /mnt/a -o bsddf vs. mount /dev/sda1 /mnt/b
ext4 can tell that these are the same.  syscall itself has no
clue.  What's more, it's not just explicitly spelled default
options - it's the stuff that has more than one form.  And while
we are at it, the things like two NFS mounts of different trees
from the same server; they might or might not get the same superblock.
Depending upon the options.

Convenience helper that would allow ext4 to compare options and reject
the incompatible mount?  Not sure how much ext4-specific knowledge
would have to go in it, but if you can come up with one - more power
to you.  But the decision to use it *must* be ext4-specific.  Because
for e.g. NFS such thing as -o fsid=..., while certainly a part of
options, has a very different meaning - it's "use a separate fs instance"
(and let the server deal with coherency issues on its end).

Decision to use sget() (and the way it's used) is up to filesystem.
We *can't* lift that into syscall.  Not without breaking the fuck out
of existing behaviour.

Having something like a second callback for mount_bdev() that would
be called when we'd found an existing instance for the same block
device?  Sure, no problem.  Having a helper for doing such comparison
that would work in enough cases to bother, so that different fs
could avoid boilerplate in that callback?  Again, more power to you.

But I don't see what the hell does that have to the syscall interface.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 14:36   ` Andy Lutomirski
@ 2018-08-10 15:17     ` Eric W. Biederman
  2018-08-10 15:24     ` Al Viro
  1 sibling, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2018-08-10 15:17 UTC (permalink / raw)
  To: linux-security-module

Andy Lutomirski <luto@amacapital.net> writes:

>> On Aug 10, 2018, at 7:05 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> 
>> 
>> There is a serious problem with mount options today that fsopen does not
>> address.  The problem is that mount options are ignored for block based
>> filesystems, and any other type of filesystem that follows the same
>> pattern.
>> 
>
>> /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
>> /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
>
> To make sure I understand correctly: the problem is that the second
> mount ignored the options because the device was already mounted,
> right?

Yes.

> For the new API, I think the only remotely sane approach is to refuse
> to mount or init or whatever you call it an already mounted bdev. If
> user code genuinely needs to bind-mount an existing mount that is
> known only by its bdev, we can add a specific API just for that.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 14:36   ` Andy Lutomirski
  2018-08-10 15:17     ` Eric W. Biederman
@ 2018-08-10 15:24     ` Al Viro
  1 sibling, 0 replies; 57+ messages in thread
From: Al Viro @ 2018-08-10 15:24 UTC (permalink / raw)
  To: linux-security-module

On Fri, Aug 10, 2018 at 07:36:17AM -0700, Andy Lutomirski wrote:
> 
> 
> > On Aug 10, 2018, at 7:05 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> > 
> > 
> > There is a serious problem with mount options today that fsopen does not
> > address.  The problem is that mount options are ignored for block based
> > filesystems, and any other type of filesystem that follows the same
> > pattern.
> > 
> 
> > /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> > /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> 
> To make sure I understand correctly: the problem is that the second mount ignored the options because the device was already mounted, right?
> 
> For the new API, I think the only remotely sane approach is to refuse to mount or init or whatever you call it an already mounted bdev. If user code genuinely needs to bind-mount an existing mount that is known only by its bdev, we can add a specific API just for that.

First of all, that does NOT belong anywhere other than fs itself.
Example: NFS.  Not every attempt to mount something leads to creation
of new fs instance; moreover, whether it will or not can't be predicted
in general.

PS: for pity sake, fix your MUA; 270-character lines are way over the
top.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 15:11 ` David Howells
@ 2018-08-10 15:39   ` Theodore Y. Ts'o
  2018-08-10 15:55     ` Casey Schaufler
                       ` (2 more replies)
  2018-08-10 15:53   ` David Howells
                     ` (2 subsequent siblings)
  3 siblings, 3 replies; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-08-10 15:39 UTC (permalink / raw)
  To: linux-security-module

On Fri, Aug 10, 2018 at 04:11:31PM +0100, David Howells wrote:
> 
> Yes.  Since you *absolutely* *insist* on this being fixed *right* *now* *or*
> *else*, I'm working up a set of additional patches to give userspace the
> option of whether they want no sharing; sharing, but only with exactly the
> same parameters; or to ignore the parameter differences and just accept
> sharing of what's already already mounted (ie. the current behaviour).

But there's no way to support "no sharing", at least not in the
general case.  A file system can only be mounted once, and without
file system support, there's no way for a file system to be mounted
with the bsddf or minixdf mount simultaneously.

Even *with* file system support, there's no way today for the VFS to
keep track of whether a pathname resolution came through one
mountpoint or another, so I can't do something like this:

	mount /dev/sdXX -o casefold /android-data
	mount /dev/sdXX -o nocasefold /android-data-2

Which is a pity, since if we could we could much more easily get rid
of the horror which is Android's wrapfs...

So if the file system has been mounted with one set of mount options,
and you want to try to mount it with a conflicting set of mount
options and you don't want it to silently ignore the mount options,
the *only* thing we can today is to refuse the mount and return an
error.  

I'm not sure Eric would really consider that an improvement for the
container use case....

						- Ted

P.S.  And as Al has pointed out, this would require special, per-file
system support to determine whether the mount options are conflicting
or not....

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 15:11 ` David Howells
  2018-08-10 15:39   ` Theodore Y. Ts'o
@ 2018-08-10 15:53   ` David Howells
  2018-08-10 16:14     ` Theodore Y. Ts'o
  2018-08-11  1:19   ` Eric W. Biederman
  2018-08-11  7:29   ` David Howells
  3 siblings, 1 reply; 57+ messages in thread
From: David Howells @ 2018-08-10 15:53 UTC (permalink / raw)
  To: linux-security-module

Theodore Y. Ts'o <tytso@mit.edu> wrote:

> Even *with* file system support, there's no way today for the VFS to
> keep track of whether a pathname resolution came through one
> mountpoint or another, so I can't do something like this:

Ummm...  Isn't that encoded in the vfsmount pointer in struct path?

However, the case folding stuff - is that a superblockism of a mountpointism?

> So if the file system has been mounted with one set of mount options,
> and you want to try to mount it with a conflicting set of mount
> options and you don't want it to silently ignore the mount options,
> the *only* thing we can today is to refuse the mount and return an
> error.  

With fsopen() there is the option to have the filesystem and the LSM attempt
to compare the non-key[*] mount options and reject the attempt to share if
they differ in any way.

David


[*] sget lookup keys, that is.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 15:39   ` Theodore Y. Ts'o
@ 2018-08-10 15:55     ` Casey Schaufler
  2018-08-10 16:11     ` David Howells
  2018-08-10 18:00     ` Eric W. Biederman
  2 siblings, 0 replies; 57+ messages in thread
From: Casey Schaufler @ 2018-08-10 15:55 UTC (permalink / raw)
  To: linux-security-module

On 8/10/2018 8:39 AM, Theodore Y. Ts'o wrote:
> On Fri, Aug 10, 2018 at 04:11:31PM +0100, David Howells wrote:
>> Yes.  Since you *absolutely* *insist* on this being fixed *right* *now* *or*
>> *else*, I'm working up a set of additional patches to give userspace the
>> option of whether they want no sharing; sharing, but only with exactly the
>> same parameters; or to ignore the parameter differences and just accept
>> sharing of what's already already mounted (ie. the current behaviour).
> But there's no way to support "no sharing", at least not in the
> general case.  A file system can only be mounted once, and without
> file system support, there's no way for a file system to be mounted
> with the bsddf or minixdf mount simultaneously.
>
> Even *with* file system support, there's no way today for the VFS to
> keep track of whether a pathname resolution came through one
> mountpoint or another, so I can't do something like this:
>
> 	mount /dev/sdXX -o casefold /android-data
> 	mount /dev/sdXX -o nocasefold /android-data-2
>
> Which is a pity, since if we could we could much more easily get rid
> of the horror which is Android's wrapfs...
>
> So if the file system has been mounted with one set of mount options,
> and you want to try to mount it with a conflicting set of mount
> options and you don't want it to silently ignore the mount options,
> the *only* thing we can today is to refuse the mount and return an
> error.  
>
> I'm not sure Eric would really consider that an improvement for the
> container use case....
>
> 						- Ted
>
> P.S.  And as Al has pointed out, this would require special, per-file
> system support to determine whether the mount options are conflicting
> or not....

This extends to LSMs that support mount options (SELinux and Smack)
as well. 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 15:39   ` Theodore Y. Ts'o
  2018-08-10 15:55     ` Casey Schaufler
@ 2018-08-10 16:11     ` David Howells
  2018-08-10 18:00     ` Eric W. Biederman
  2 siblings, 0 replies; 57+ messages in thread
From: David Howells @ 2018-08-10 16:11 UTC (permalink / raw)
  To: linux-security-module

Casey Schaufler <casey@schaufler-ca.com> wrote:

> > P.S.  And as Al has pointed out, this would require special, per-file
> > system support to determine whether the mount options are conflicting
> > or not....
> 
> This extends to LSMs that support mount options (SELinux and Smack)
> as well. 

Yes.  I'm doing that.

David

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 15:53   ` David Howells
@ 2018-08-10 16:14     ` Theodore Y. Ts'o
  2018-08-10 20:06       ` Andy Lutomirski
  2018-08-11  0:28       ` Eric W. Biederman
  0 siblings, 2 replies; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-08-10 16:14 UTC (permalink / raw)
  To: linux-security-module

On Fri, Aug 10, 2018 at 04:53:58PM +0100, David Howells wrote:
> Theodore Y. Ts'o <tytso@mit.edu> wrote:
> 
> > Even *with* file system support, there's no way today for the VFS to
> > keep track of whether a pathname resolution came through one
> > mountpoint or another, so I can't do something like this:
> 
> Ummm...  Isn't that encoded in the vfsmount pointer in struct path?

Well, yes, and we do use this as a hack to make read-only bind mounts
work.  But that's done as a special case, and it's for permissions
checking only.

The big problem is that there is single dentry cache object regardless
of which mount point was used to access it.  So that makes it
impossible to support case folding as a mount-pointism.

> 
> However, the case folding stuff - is that a superblockism of a mountpointism?

It's a superblock-ism.  As far as I know the *only* thing that we can
support as a mount-pointism is the ro flag, and that's handled as a
special case, and only if the original superblock was mounted
read/write.  ey That was my point; aside from the ro flag, we can't
support any other mount options as a per-mount point thing, so the
only thing we can do is to fail the mount if there are conflicting
mount options.  And I'm not really sure it helps the container use
case, since the whole point is they want their "guest" to be able to
blithely run "mount /dev/sda1 -o noxattr /mnt" and not worry about the
fact that in some other container, someone had run "mount /dev/sda1 -o
xattr /mnt".  But having the second mount fail because of conflicting
mount option breaks the illusion that containers are functionally as
rich as VM's.

So before you put in lots of work to support rejecting the attmpted
mount if the mount options conflict, are we sure people will actually
find this to be useful?  Because it's not only fsopen() work for you,
but each file system is going to have to implement new functions to
answer the question "are these mount options conflicting or not?".
Are we sure it's worth the effort?

					- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 15:39   ` Theodore Y. Ts'o
  2018-08-10 15:55     ` Casey Schaufler
  2018-08-10 16:11     ` David Howells
@ 2018-08-10 18:00     ` Eric W. Biederman
  2 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2018-08-10 18:00 UTC (permalink / raw)
  To: linux-security-module

"Theodore Y. Ts'o" <tytso@mit.edu> writes:

> On Fri, Aug 10, 2018 at 04:11:31PM +0100, David Howells wrote:
>> 
>> Yes.  Since you *absolutely* *insist* on this being fixed *right* *now* *or*
>> *else*, I'm working up a set of additional patches to give userspace the
>> option of whether they want no sharing; sharing, but only with exactly the
>> same parameters; or to ignore the parameter differences and just accept
>> sharing of what's already already mounted (ie. the current behaviour).
>
> But there's no way to support "no sharing", at least not in the
> general case.  A file system can only be mounted once, and without
> file system support, there's no way for a file system to be mounted
> with the bsddf or minixdf mount simultaneously.
>
> Even *with* file system support, there's no way today for the VFS to
> keep track of whether a pathname resolution came through one
> mountpoint or another, so I can't do something like this:
>
> 	mount /dev/sdXX -o casefold /android-data
> 	mount /dev/sdXX -o nocasefold /android-data-2
>
> Which is a pity, since if we could we could much more easily get rid
> of the horror which is Android's wrapfs...
>
> So if the file system has been mounted with one set of mount options,
> and you want to try to mount it with a conflicting set of mount
> options and you don't want it to silently ignore the mount options,
> the *only* thing we can today is to refuse the mount and return an
> error.  
>
> I'm not sure Eric would really consider that an improvement for the
> container use case....

I think I would consider it an improvement.  I keep running into cases
where the mount options differed and something was done silently and
that causes problems.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 16:14     ` Theodore Y. Ts'o
@ 2018-08-10 20:06       ` Andy Lutomirski
  2018-08-10 20:46         ` Theodore Y. Ts'o
  2018-08-13 16:35         ` Alan Cox
  2018-08-11  0:28       ` Eric W. Biederman
  1 sibling, 2 replies; 57+ messages in thread
From: Andy Lutomirski @ 2018-08-10 20:06 UTC (permalink / raw)
  To: linux-security-module

On Fri, Aug 10, 2018 at 9:14 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> And I'm not really sure it helps the container use
> case, since the whole point is they want their "guest" to be able to
> blithely run "mount /dev/sda1 -o noxattr /mnt" and not worry about the
> fact that in some other container, someone had run "mount /dev/sda1 -o
> xattr /mnt".  But having the second mount fail because of conflicting
> mount option breaks the illusion that containers are functionally as
> rich as VM's.

If the same block device is visible, with rw access, in two different
containers, I don't see any anything good can happen.  Sure, with the
current somewhat erratic semantics of mount(2), something kind of sort
of reasonable happens if they both mount it.  But if one or both of
them try to use, say, tune2fs or fsck, it's not going to go well.  And
a situation where they mount with different options and the result
depends on the order of the mounts is just plain bad.

I see four sane ways to deal with this:

1. Don't put the block device in the container at all.  The container
manager mounts it.

2. Use seccomp or a similar mechanism to intercept and emulate the
mount request.

3. Teach the filesystem driver to do something sensible.  This will
inherently be per-fs, and probably involves some serious magic or
allowing filesystem-specific vfsmount options.

4. Introduce a concept of a special kind of fake block device that
refers to an existing superblock, doesn't allow direct read or write,
and does the right thing when mounted.  Not obviously worth the
effort.

It seems to me that the current approach mostly involves crossing our fingers.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 20:06       ` Andy Lutomirski
@ 2018-08-10 20:46         ` Theodore Y. Ts'o
  2018-08-10 22:12           ` Darrick J. Wong
  2018-08-13 16:35         ` Alan Cox
  1 sibling, 1 reply; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-08-10 20:46 UTC (permalink / raw)
  To: linux-security-module

On Fri, Aug 10, 2018 at 01:06:54PM -0700, Andy Lutomirski wrote:
> If the same block device is visible, with rw access, in two different
> containers, I don't see any anything good can happen.

It's worse than that.  I've fixed a lot of bugs which cause the kernel
to crash, and a few that might be levered into a privilege escalationh
attack, when you mount a maliciously corrupted file system using ext4.
I'm told told the security researcher filed similar reports with the
XFS community, and he was told, "that's what metadata checksums are
for; go away".  Given how much time it takes to work with these
security researchers, I don't blame them.

But in light of that, I'd make a somewhat stronger statement.  If you
let an untrusted container mount arbitrary block devices where they
have rw acccess to the underlying block device, nothing good can
happen.  Period.  :-)

Which is why I don't think the lack of being able to reject
"conflicting mount options" is really all that important.  It
certainly shouldn't block the fsopen patch series.  #1, it's a problem
we have today, and #2, I'm really not all sure supporting bind mounts
via specifying block device was ever a good idea to begin with.  And
#3, while I've been fixing ext4 against security issues caused by
maliciously corrupted file system images, I'm still sure that allowing
untrusted containers access to mount *any* file system via a block
device for which they have r/w access is a Really Bad Idea.

> It seems to me that the current approach mostly involves crossing our fingers.

Agreed!

						- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 20:46         ` Theodore Y. Ts'o
@ 2018-08-10 22:12           ` Darrick J. Wong
  2018-08-10 23:54             ` Theodore Y. Ts'o
  0 siblings, 1 reply; 57+ messages in thread
From: Darrick J. Wong @ 2018-08-10 22:12 UTC (permalink / raw)
  To: linux-security-module

On Fri, Aug 10, 2018 at 04:46:39PM -0400, Theodore Y. Ts'o wrote:
> On Fri, Aug 10, 2018 at 01:06:54PM -0700, Andy Lutomirski wrote:
> > If the same block device is visible, with rw access, in two different
> > containers, I don't see any anything good can happen.
> 
> It's worse than that.  I've fixed a lot of bugs which cause the kernel
> to crash, and a few that might be levered into a privilege escalationh
> attack, when you mount a maliciously corrupted file system using ext4.
> I'm told told the security researcher filed similar reports with the
> XFS community, and he was told, "that's what metadata checksums are
> for; go away".

Hey now, there was a little more nuance to it than that[1][2].  The
complaint in the first instance had much more to do with breaking
existing V4 filesystems by adding format requirements that mkfs didn't
know about when the filesystem was created.  Yes, you can create V4
filesystems that will hang the system if the log was totally unformatted
and metadata updates are made, but OTOH it's fairly obvious when that
happens, you have to be root to mount a disk filesystem, and we try to
avoid breaking existing users.

XFS developers have been and will continue to examine security problems
when they are brought to our attention and strengthen validation as
needed to minimize the risk of incorrect behaviors, but filesystems are
complex machines, complex machinery is risky, and we arbitrate some of
that risk by requiring administrators to elect to mount an XFS.

> Given how much time it takes to work with these security researchers,
> I don't blame them.
> 
> But in light of that, I'd make a somewhat stronger statement.  If you
> let an untrusted container mount arbitrary block devices where they
> have rw acccess to the underlying block device, nothing good can
> happen.  Period.  :-)
> 
> Which is why I don't think the lack of being able to reject
> "conflicting mount options" is really all that important.  It
> certainly shouldn't block the fsopen patch series.  #1, it's a problem
> we have today, and #2, I'm really not all sure supporting bind mounts
> via specifying block device was ever a good idea to begin with.  And
> #3, while I've been fixing ext4 against security issues caused by
> maliciously corrupted file system images, I'm still sure that allowing
> untrusted containers access to mount *any* file system via a block
> device for which they have r/w access is a Really Bad Idea.
> 
> > It seems to me that the current approach mostly involves crossing our fingers.
> 
> Agreed!

Crossing our fingers and demanding administrator intentionality when
mounting filesystems off some piece of storage.

--D

[1] https://lkml.org/lkml/2018/5/21/649
[2] https://lkml.org/lkml/2018/4/2/572

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 22:12           ` Darrick J. Wong
@ 2018-08-10 23:54             ` Theodore Y. Ts'o
  2018-08-11  0:38               ` Darrick J. Wong
  0 siblings, 1 reply; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-08-10 23:54 UTC (permalink / raw)
  To: linux-security-module

On Fri, Aug 10, 2018 at 03:12:34PM -0700, Darrick J. Wong wrote:
> Hey now, there was a little more nuance to it than that[1][2].  The
> complaint in the first instance had much more to do with breaking
> existing V4 filesystems by adding format requirements that mkfs didn't
> know about when the filesystem was created.  Yes, you can create V4
> filesystems that will hang the system if the log was totally unformatted
> and metadata updates are made, but OTOH it's fairly obvious when that
> happens, you have to be root to mount a disk filesystem, and we try to
> avoid breaking existing users.

I wasn't thinking about syzbot reports; I've largely written them off
as far as file system testing is concerned, but rather Wen Xu at
Georgia Tech, who is much more reasonable than Dmitry, and has helpeyd
me out a lot; and has complained that the XFS folks haven't been
engaging with him.

In either case, both security researchers are fuzzing file system
images, and then fixing the checksums, and discovering that this can
lead to kernel crashes, and in a few cases, buffer overruns that can
lead to potential privilege escalations.  Wen can generate reports
faster than syzbot, but at least he gives me file system images (as
opposed to having to dig them out of syzbot repro C files) and he
actually does some analysis and explains what he thinks is going on.

I don't think anyone was claiming that format requirements should be
added to ext4 or xfs file systems.  But rather, that kernel code
should be made more robust against maliciously corrupted file system
images that have valid checksums.  I've been more willing to work with
Wen; Dave has expressed the opinion that these are not realistic bug
reports, and since only root can mount file systems, it's not high
priority.

The reason why I bring this up here is that in container land, there
are those who believe that "container root" should be able to mount
file systems, and if the "container root" isn't trusted, the fact that
the "container root" can crash the host kernel, or worse, corrupt the
host kernel and break out of the container as a result, that would be
sad.

I was pretty sure most file system developers are on the same page
that allowing untrusted "container roots" the ability to mount
arbitrary block device file systems is insanity.  Whether or not we
try to fix these sorts of bugs submitted by security researchers.  :-)

	  	       	    	       		  - Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 16:14     ` Theodore Y. Ts'o
  2018-08-10 20:06       ` Andy Lutomirski
@ 2018-08-11  0:28       ` Eric W. Biederman
  1 sibling, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2018-08-11  0:28 UTC (permalink / raw)
  To: linux-security-module

"Theodore Y. Ts'o" <tytso@mit.edu> writes:

> On Fri, Aug 10, 2018 at 04:53:58PM +0100, David Howells wrote:
>> Theodore Y. Ts'o <tytso@mit.edu> wrote:
>> 
>> > Even *with* file system support, there's no way today for the VFS to
>> > keep track of whether a pathname resolution came through one
>> > mountpoint or another, so I can't do something like this:
>> 

>> However, the case folding stuff - is that a superblockism of a mountpointism?
>
> It's a superblock-ism.  As far as I know the *only* thing that we can
> support as a mount-pointism is the ro flag, and that's handled as a
> special case, and only if the original superblock was mounted
> read/write.  ey That was my point; aside from the ro flag, we can't
> support any other mount options as a per-mount point thing, so the
> only thing we can do is to fail the mount if there are conflicting
> mount options.  And I'm not really sure it helps the container use
> case, since the whole point is they want their "guest" to be able to
> blithely run "mount /dev/sda1 -o noxattr /mnt" and not worry about the
> fact that in some other container, someone had run "mount /dev/sda1 -o
> xattr /mnt".  But having the second mount fail because of conflicting
> mount option breaks the illusion that containers are functionally as
> rich as VM's.

Ted this isn't about some container case.

It about the fact that practically every filesystem in the kernel has
the behavior I have described and it means that if root is not super
careful root will shoot himself in the foot with the shotgun we have
pointed there.

It really is about loosing acls or some other filesystem option.


Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 23:54             ` Theodore Y. Ts'o
@ 2018-08-11  0:38               ` Darrick J. Wong
  2018-08-11  1:32                 ` Eric W. Biederman
  0 siblings, 1 reply; 57+ messages in thread
From: Darrick J. Wong @ 2018-08-11  0:38 UTC (permalink / raw)
  To: linux-security-module

On Fri, Aug 10, 2018 at 07:54:47PM -0400, Theodore Y. Ts'o wrote:
> On Fri, Aug 10, 2018 at 03:12:34PM -0700, Darrick J. Wong wrote:
> > Hey now, there was a little more nuance to it than that[1][2].  The
> > complaint in the first instance had much more to do with breaking
> > existing V4 filesystems by adding format requirements that mkfs didn't
> > know about when the filesystem was created.  Yes, you can create V4
> > filesystems that will hang the system if the log was totally unformatted
> > and metadata updates are made, but OTOH it's fairly obvious when that
> > happens, you have to be root to mount a disk filesystem, and we try to
> > avoid breaking existing users.
> 
> I wasn't thinking about syzbot reports; I've largely written them off
> as far as file system testing is concerned, but rather Wen Xu at
> Georgia Tech, who is much more reasonable than Dmitry, and has helpeyd
> me out a lot; and has complained that the XFS folks haven't been
> engaging with him.

Ahh, ok.  Yes, Wen has been easier to work with, and gives out
filesystem images.  Hm, I'll go comb the bugzilla again...

> In either case, both security researchers are fuzzing file system
> images, and then fixing the checksums, and discovering that this can
> lead to kernel crashes, and in a few cases, buffer overruns that can
> lead to potential privilege escalations.  Wen can generate reports
> faster than syzbot, but at least he gives me file system images (as
> opposed to having to dig them out of syzbot repro C files) and he
> actually does some analysis and explains what he thinks is going on.

(FWIW I tried to figure out how to add fs image dumping to syzbot and
whoah that was horrifying.

> I don't think anyone was claiming that format requirements should be
> added to ext4 or xfs file systems.  But rather, that kernel code
> should be made more robust against maliciously corrupted file system
> images that have valid checksums.  I've been more willing to work with
> Wen; Dave has expressed the opinion that these are not realistic bug
> reports, and since only root can mount file systems, it's not high
> priority.

I don't think they're high priority either, but they're at least worth
/some/ attention.

> The reason why I bring this up here is that in container land, there
> are those who believe that "container root" should be able to mount
> file systems, and if the "container root" isn't trusted, the fact that
> the "container root" can crash the host kernel, or worse, corrupt the
> host kernel and break out of the container as a result, that would be
> sad.
> 
> I was pretty sure most file system developers are on the same page
> that allowing untrusted "container roots" the ability to mount
> arbitrary block device file systems is insanity.

Agreed.

> Whether or not we try to fix these sorts of bugs submitted by security
> researchers.  :-)

and agreed. :)

--D

> 	  	       	    	       		  - Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 15:16   ` Al Viro
@ 2018-08-11  1:05     ` Eric W. Biederman
  2018-08-11  1:46       ` Theodore Y. Ts'o
  2018-08-11  1:58       ` Al Viro
  0 siblings, 2 replies; 57+ messages in thread
From: Eric W. Biederman @ 2018-08-11  1:05 UTC (permalink / raw)
  To: linux-security-module

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Fri, Aug 10, 2018 at 09:05:22AM -0500, Eric W. Biederman wrote:
>> 
>> There is a serious problem with mount options today that fsopen does not
>> address.  The problem is that mount options are ignored for block based
>> filesystems, and any other type of filesystem that follows the same
>> pattern.
>> 
>> The script below demonstrates this bug.  Showing this bug can cause the
>> ext4 "acl" "quota" and "user_xattr" options to be silently ignored.
>> 
>> fsopen has my nack until it addresses this issue.
>> 
>> I don't know if we can fix this in the context of sys_mount.  But we if
>> we are redoing the option parsing of how we mount filesystems this needs
>> to be fixed before we start worrying about bug compatibility.
>> 
>> Hopefully this report is simple and clear enough that we can at least
>> agree on the problem.
>
> Sure, it is simple.  So's the solution: MNT_USERNS_SPECIAL_SEMANTICS that
> would get passed to filesystems, so that Eric would be able to implement
> his mount(2)-incompatible behaviour at leisure, without worrying about
> compatibility issues.
>
> Does that address your complaint?

Absolutely not.

My complaint is that the current implemented behavior of practically
every filesystem in the kernel, is that it will ignore mount options
when mounted a second time.  

It is not some weird special case.

It is not some container thing.

It is that the behavior of mount(2) with practically every filesystem
type when that filesystem is already mounted somewhere else behaves
in ways no one would expect.

With the new fsopen api the easy thing to do is simply have CMD_CREATE
CMD_BIND_INTERNAL and be done with it.  CMD_CREATE guarantee that a new
superblock is created.  CMD_BIND_INTERNAL would only work with an
existing superblock.  Then root would at least know that he is
connecting to an already mounted filesystem and could look at the
options etc and fail if he didn't like what he saw.  No surprises, no
muss, no fuss simple.

But I have been told the simple solution above is somehow unacceptable.
And an option to compare the mount options and see if they are the same
was offered.  That would will work to.

I just care that we define the semantics in such a way that it is not
easy for root to get confused and do something stupid that will bite
later, and that we build the infrastructure so that all filesystems
can implement it easily.

So yes this is 100% a question about how filesystems should behave with
respect to their option when mounted for a second time.  That is what
Dave Howells patchset is addressing.

> Because one thing we are not going to do is changing mount(2)
> behaviour.

I have not asked for that.  I have asked that we get it right for
fsopen.

> Reason: userland-visible behaviour of hell knows how many local scripts.

> Another thing that
> is flat-out not feasible is some kind of blanket "compare options"
> stuff; it *can* be done as helpers to be used by filesystem when
> it sees that new flag, but it's simply not going to work at the
> fs-independent level.
>
> Trivial example with the same ext4:
> mount /dev/sda1 /mnt/a -o bsddf vs. mount /dev/sda1 /mnt/b
> ext4 can tell that these are the same.  syscall itself has no
> clue.  What's more, it's not just explicitly spelled default
> options - it's the stuff that has more than one form.  And while
> we are at it, the things like two NFS mounts of different trees
> from the same server; they might or might not get the same superblock.
> Depending upon the options.
>
> Convenience helper that would allow ext4 to compare options and reject
> the incompatible mount?  Not sure how much ext4-specific knowledge
> would have to go in it, but if you can come up with one - more power
> to you.  But the decision to use it *must* be ext4-specific.  Because
> for e.g. NFS such thing as -o fsid=..., while certainly a part of
> options, has a very different meaning - it's "use a separate fs instance"
> (and let the server deal with coherency issues on its end).
>
> Decision to use sget() (and the way it's used) is up to filesystem.
> We *can't* lift that into syscall.  Not without breaking the fuck out
> of existing behaviour.

I have never proposed that.  See above.  I may have talked in terms
of what sget does and muddied the waters.  If so I apologize.

All I proposed was that we distinguish between a first mount and an
additional mount so that userspace knows the options will be ignored.

Then the code to replicate the current behavior can look like:

	fd = fsopen(...);
	fsconfig(fd, ...);
	fsconfig(fd, ...);
	fsconfig(fd, ...);
	fsconfig(fd, ...);
	fsconfig(fd, ...);
	fsconfig(fd, ...);
	fsconfig(fd, ...);

	if (fsconfig(fd, CMD_CREATE) == -EBUSY) {
        	fsconfig(fd, CMD_BIND_INTERNAL);
        }

But userspace would then be free to issue a warning or do something
else if CMD_CREATE returns -EBUSY.

I don't know how the above wound up being construed as asking that the
code call sget directly but that is what has happened.

> Having something like a second callback for mount_bdev() that would
> be called when we'd found an existing instance for the same block
> device?  Sure, no problem.  Having a helper for doing such comparison
> that would work in enough cases to bother, so that different fs
> could avoid boilerplate in that callback?  Again, more power to you.

Normal forms etc.  If we want to do that it just requires a wee bit of
discipline.  And if all of the option parsing is being rewritten and
retested anyway I don't see why we can't do something like that as well.
So it does not sound unreasonable to me.

It does sound like more work than what I was proposing.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 15:11 ` David Howells
  2018-08-10 15:39   ` Theodore Y. Ts'o
  2018-08-10 15:53   ` David Howells
@ 2018-08-11  1:19   ` Eric W. Biederman
  2018-08-11  7:29   ` David Howells
  3 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2018-08-11  1:19 UTC (permalink / raw)
  To: linux-security-module

David Howells <dhowells@redhat.com> writes:

> Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> There is a serious problem with mount options today that fsopen does not
>> address.  The problem is that mount options are ignored for block based
>> filesystems, and any other type of filesystem that follows the same
>> pattern.
>
> Yes.  Since you *absolutely* *insist* on this being fixed *right* *now* *or*
> *else*, I'm working up a set of additional patches to give userspace the
> option of whether they want no sharing; sharing, but only with exactly the
> same parameters; or to ignore the parameter differences and just accept
> sharing of what's already already mounted (ie. the current behaviour).
>
> The second option, however, is not trivial as it needs to compare the fs
> contexts, including the LSM parameters.  To make that work, I really need to
> remove the old security_mnt_opts stuff - which means I need to port btrfs to
> the new context stuff.
>
> We discussed this yesterday, and I proposed a solution, and I'm working on it.

I repeated this because after some comments from Al on IRC yesterday
and Miklos's email replay. It appeared clear that I had not specified
why my issue was clearly enough for people reading the thread to
understand the problem that I see.

> Yes, I agree it would be nice to have, but it *doesn't* really need supporting
> right this minute, since what I have now oughtn't to break the current
> behaviour.

I am really reluctant to endorse anything that propagates the issues of
the current interface in the new mount interface.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-11  0:38               ` Darrick J. Wong
@ 2018-08-11  1:32                 ` Eric W. Biederman
  0 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2018-08-11  1:32 UTC (permalink / raw)
  To: linux-security-module

"Darrick J. Wong" <darrick.wong@oracle.com> writes:

> On Fri, Aug 10, 2018 at 07:54:47PM -0400, Theodore Y. Ts'o wrote:

>> The reason why I bring this up here is that in container land, there
>> are those who believe that "container root" should be able to mount
>> file systems, and if the "container root" isn't trusted, the fact that
>> the "container root" can crash the host kernel, or worse, corrupt the
>> host kernel and break out of the container as a result, that would be
>> sad.
>> 
>> I was pretty sure most file system developers are on the same page
>> that allowing untrusted "container roots" the ability to mount
>> arbitrary block device file systems is insanity.
>
> Agreed.

For me I am happy with fuse.  That is sufficient to cover any container
use cases people have.   If anyone comes bugging you for more I will be
happy to push back.

The only thing that containers have to do with this is I wind up
touching a lot of the kernel/user boundary so I get to see a lot of it
and sometimes see weird things.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-11  1:05     ` Eric W. Biederman
@ 2018-08-11  1:46       ` Theodore Y. Ts'o
  2018-08-11  4:48         ` Eric W. Biederman
  2018-08-11  1:58       ` Al Viro
  1 sibling, 1 reply; 57+ messages in thread
From: Theodore Y. Ts'o @ 2018-08-11  1:46 UTC (permalink / raw)
  To: linux-security-module

On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
> 
> My complaint is that the current implemented behavior of practically
> every filesystem in the kernel, is that it will ignore mount options
> when mounted a second time.

The file system is ***not*** mounted a second time.

The design bug is that we allow bind mounts to be specified via a
block device.  A bind mount is not "a second mount" of the file
system.  Bind mounts != mounts.

I had assumed we had allowed bind mounts to be specified via the block
device because of container use cases.  If the container folks don't
want it, I would be pushing to simply not allow bind mounts to be
specified via block device at all.

The only reason why we should support it is because we don't want to
break scripts; and if the goal is not to break scripts, then we have
to keep to the current semantics, however broken you think it is.

					- Ted

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-11  1:05     ` Eric W. Biederman
  2018-08-11  1:46       ` Theodore Y. Ts'o
@ 2018-08-11  1:58       ` Al Viro
  2018-08-11  2:17         ` Al Viro
  2018-08-13 12:54         ` Miklos Szeredi
  1 sibling, 2 replies; 57+ messages in thread
From: Al Viro @ 2018-08-11  1:58 UTC (permalink / raw)
  To: linux-security-module

On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:

> All I proposed was that we distinguish between a first mount and an
> additional mount so that userspace knows the options will be ignored.

For pity sake, just what does it take to explain to you that your
notions of "first mount" and "additional mount" ARE HEAVILY FS-DEPENDENT
and may depend upon the pieces of state userland (especially in container)
simply does not have?

One more time, slowly:

mount -t nfs4 wank.example.org:/foo/bar /mnt/a
mount -t nfs4 wank.example.org:/baz/barf /mnt/b

yield the same superblock.  Is anyone who mounts something over NFS
required to know if anybody else has mounted something from the same
server, and if so how the hell are they supposed to find that out,
so that they could decide whether they are creating the "first" or
"additional" mount, whatever that might mean in this situation?

And how, kernel-side, is that supposed to be handled by generic code
of any description?  

While we are at it,
mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c
is *NOT* the same superblock as the previous two.

> I don't know how the above wound up being construed as asking that the
> code call sget directly but that is what has happened.

Not by me.  What I'm saying is that the entire superblock-creating
machinery - all of it - is nothing but library helpers.  With the
decision of when/how/if they are to be used being down to filesystem
driver.  Your "first mount"/"additional mount" simply do not map
to anything universally applicable.

> > Having something like a second callback for mount_bdev() that would
> > be called when we'd found an existing instance for the same block
> > device?  Sure, no problem.  Having a helper for doing such comparison
> > that would work in enough cases to bother, so that different fs
> > could avoid boilerplate in that callback?  Again, more power to you.
> 
> Normal forms etc.  If we want to do that it just requires a wee bit of
> discipline.  And if all of the option parsing is being rewritten and
> retested anyway I don't see why we can't do something like that as well.
> So it does not sound unreasonable to me.

See above.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-11  1:58       ` Al Viro
@ 2018-08-11  2:17         ` Al Viro
  2018-08-11  4:43           ` Eric W. Biederman
  2018-08-13 12:54         ` Miklos Szeredi
  1 sibling, 1 reply; 57+ messages in thread
From: Al Viro @ 2018-08-11  2:17 UTC (permalink / raw)
  To: linux-security-module

On Sat, Aug 11, 2018 at 02:58:15AM +0100, Al Viro wrote:
> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
> 
> > All I proposed was that we distinguish between a first mount and an
> > additional mount so that userspace knows the options will be ignored.
> 
> For pity sake, just what does it take to explain to you that your
> notions of "first mount" and "additional mount" ARE HEAVILY FS-DEPENDENT
> and may depend upon the pieces of state userland (especially in container)
> simply does not have?
> 
> One more time, slowly:
> 
> mount -t nfs4 wank.example.org:/foo/bar /mnt/a
> mount -t nfs4 wank.example.org:/baz/barf /mnt/b
> 
> yield the same superblock.  Is anyone who mounts something over NFS
> required to know if anybody else has mounted something from the same
> server, and if so how the hell are they supposed to find that out,
> so that they could decide whether they are creating the "first" or
> "additional" mount, whatever that might mean in this situation?
> 
> And how, kernel-side, is that supposed to be handled by generic code
> of any description?  
> 
> While we are at it,
> mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c
> is *NOT* the same superblock as the previous two.

s/as the previous two/as in the previous two cases/, that is - the first two
examples yield one superblock, this one - another.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-11  2:17         ` Al Viro
@ 2018-08-11  4:43           ` Eric W. Biederman
  0 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2018-08-11  4:43 UTC (permalink / raw)
  To: linux-security-module

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Sat, Aug 11, 2018 at 02:58:15AM +0100, Al Viro wrote:
>> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
>> 
>> > All I proposed was that we distinguish between a first mount and an
>> > additional mount so that userspace knows the options will be ignored.
>> 
>> For pity sake, just what does it take to explain to you that your
>> notions of "first mount" and "additional mount" ARE HEAVILY FS-DEPENDENT
>> and may depend upon the pieces of state userland (especially in container)
>> simply does not have?
>> 
>> One more time, slowly:
>> 
>> mount -t nfs4 wank.example.org:/foo/bar /mnt/a
>> mount -t nfs4 wank.example.org:/baz/barf /mnt/b
>> 
>> yield the same superblock.  Is anyone who mounts something over NFS
>> required to know if anybody else has mounted something from the same
>> server, and if so how the hell are they supposed to find that out,
>> so that they could decide whether they are creating the "first" or
>> "additional" mount, whatever that might mean in this situation?
>> 
>> And how, kernel-side, is that supposed to be handled by generic code
>> of any description?  
>> 
>> While we are at it,
>> mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c
>> is *NOT* the same superblock as the previous two.
>
> s/as the previous two/as in the previous two cases/, that is - the first two
> examples yield one superblock, this one - another.

Exactly because the mount options differ.

I don't have a problem if we have something sophisticated like nfs that
handles all of the hairy details and does not reuse a superblock unless the
mount options match.

What I have a problem with is the helper for ordinary filesystems that
are not as sophisticated as nfs that don't handle all of the option
magic and give userspace something different from what userspace asked
for.

It may take a little generalization of the definitions I proposed but it
still remains simple and straight forward.

CMD_THESE_MOUNT_OPTIONS_NO_SURPRISES
CMD_WHATEVER_ALREADY_EXISTS

Or we can make the filesystems more sophisticated when we move
them to the new API and perform the comparisons there.  I think
that is what David Howells is working on.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-11  1:46       ` Theodore Y. Ts'o
@ 2018-08-11  4:48         ` Eric W. Biederman
  2018-08-11 17:47           ` Casey Schaufler
  0 siblings, 1 reply; 57+ messages in thread
From: Eric W. Biederman @ 2018-08-11  4:48 UTC (permalink / raw)
  To: linux-security-module

"Theodore Y. Ts'o" <tytso@mit.edu> writes:

> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
>> 
>> My complaint is that the current implemented behavior of practically
>> every filesystem in the kernel, is that it will ignore mount options
>> when mounted a second time.
>
> The file system is ***not*** mounted a second time.
>
> The design bug is that we allow bind mounts to be specified via a
> block device.  A bind mount is not "a second mount" of the file
> system.  Bind mounts != mounts.
>
> I had assumed we had allowed bind mounts to be specified via the block
> device because of container use cases.  If the container folks don't
> want it, I would be pushing to simply not allow bind mounts to be
> specified via block device at all.

No it is not a container thing.

> The only reason why we should support it is because we don't want to
> break scripts; and if the goal is not to break scripts, then we have
> to keep to the current semantics, however broken you think it is.

But we don't have to support returning filesystems with mismatched mount
options in the new fsopen api.   That is my concern.  Confusing
userspace this way has been shown to be harmful let's not keep doing it.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 15:11 ` David Howells
                     ` (2 preceding siblings ...)
  2018-08-11  1:19   ` Eric W. Biederman
@ 2018-08-11  7:29   ` David Howells
  2018-08-11 16:31     ` Andy Lutomirski
  3 siblings, 1 reply; 57+ messages in thread
From: David Howells @ 2018-08-11  7:29 UTC (permalink / raw)
  To: linux-security-module

Eric W. Biederman <ebiederm@xmission.com> wrote:

> > Yes, I agree it would be nice to have, but it *doesn't* really need
> > supporting right this minute, since what I have now oughtn't to break the
> > current behaviour.
> 
> I am really reluctant to endorse anything that propagates the issues of
> the current interface in the new mount interface.

Do realise that your problem cannot be solved through fsopen() until every
filesystem is converted to the new fs_context-based sget() since the flag has
to make it from the VFS through the filesystem to sget().

I'm reluctant to add this flag till that point until that time unless we error
out if the flag is set against a legacy filesystem.

David

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-11  7:29   ` David Howells
@ 2018-08-11 16:31     ` Andy Lutomirski
  2018-08-11 16:51       ` Al Viro
  0 siblings, 1 reply; 57+ messages in thread
From: Andy Lutomirski @ 2018-08-11 16:31 UTC (permalink / raw)
  To: linux-security-module

> On Aug 11, 2018, at 12:29 AM, David Howells <dhowells@redhat.com> wrote:
> 
> Eric W. Biederman <ebiederm@xmission.com> wrote:
> 
>>> Yes, I agree it would be nice to have, but it *doesn't* really need
>>> supporting right this minute, since what I have now oughtn't to break the
>>> current behaviour.
>> 
>> I am really reluctant to endorse anything that propagates the issues of
>> the current interface in the new mount interface.
> 
> Do realise that your problem cannot be solved through fsopen() until every
> filesystem is converted to the new fs_context-based sget() since the flag has
> to make it from the VFS through the filesystem to sget().
> 
> I'm reluctant to add this flag till that point until that time unless we error
> out if the flag is set against a legacy filesystem.
> 
> 

I don?t see why we need all this fancy ?do the options match? stuff.  For the handful of filesystems (like NFS) that do something intelligent when multiple non-bind mount requests against the same underlying storage happen,  we can keep that behavior in the new API. For other filesystems that don?t have this feature, we should simply fail the request.

IOW I see so compelling reason to call sget() at all from the new API.  The only sort-of-legit use case I can think of is mounting more than one btrfs subvolume. But even that should probably not be done by asking the kernel to separately instantiate the filesystem.

As another way of looking at it: for a network filesystem, mounting the same target ip and path from two different Linux machines works, so mounting it twice from the same machine should also work.  But mounting the same underlying ext4 block device from two different Linux machines (using nbd, iscsi, etc) would be a catastrophe, so I see no reason that it needs to be supported if it?s two mounts from one machine.

The case folding example is interesting, and I think it should probably have a slightly different API. A program could open_tree a nocasefold mount and then make a request to create what is functionally a bind mount but with different options.

mount(8) will presumably just keep using mount(2).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-11 16:31     ` Andy Lutomirski
@ 2018-08-11 16:51       ` Al Viro
  0 siblings, 0 replies; 57+ messages in thread
From: Al Viro @ 2018-08-11 16:51 UTC (permalink / raw)
  To: linux-security-module

On Sat, Aug 11, 2018 at 09:31:29AM -0700, Andy Lutomirski wrote:

> I don?t see why we need all this fancy ?do the options match? stuff.  For the handful of filesystems (like NFS) that do something intelligent when multiple non-bind mount requests against the same underlying storage happen,  we can keep that behavior in the new API. For other filesystems that don?t have this feature, we should simply fail the request.

> IOW I see so compelling reason to call sget() at all from the new API.  The only sort-of-legit use case I can think of is mounting more than one btrfs subvolume. But even that should probably not be done by asking the kernel to separately instantiate the filesystem.

May I politely suggest the esteemed participants of that conversation
to RTFS?  Yes, I know that it's less fun that talking about your
rather vague ideas of how the things (surely) work, but it just might
avoid the feats of idiocy like the above.

Andy, I don't know how to put it more plainly: read the fucking source.
Even grep would do.  The same NFS you've granted (among the "handful"
of filesystems) an exception, *DOES* *CALL* *THE* *FUCKING* sget().

Yes, really.  And in some obscure[1] cases (including the one mentioned
upthread) it does reuse a pre-existing superblock.  For a very good
reason.

[1] such as, oh, mounting two filesystems from the same server with
default options - who would've ever thought of doing something so
perverted?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-11  4:48         ` Eric W. Biederman
@ 2018-08-11 17:47           ` Casey Schaufler
  2018-08-15  4:03             ` Eric W. Biederman
  0 siblings, 1 reply; 57+ messages in thread
From: Casey Schaufler @ 2018-08-11 17:47 UTC (permalink / raw)
  To: linux-security-module

On 8/10/2018 9:48 PM, Eric W. Biederman wrote:
> "Theodore Y. Ts'o" <tytso@mit.edu> writes:
>
>> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote:
>>> My complaint is that the current implemented behavior of practically
>>> every filesystem in the kernel, is that it will ignore mount options
>>> when mounted a second time.
>> The file system is ***not*** mounted a second time.
>>
>> The design bug is that we allow bind mounts to be specified via a
>> block device.  A bind mount is not "a second mount" of the file
>> system.  Bind mounts != mounts.
>>
>> I had assumed we had allowed bind mounts to be specified via the block
>> device because of container use cases.  If the container folks don't
>> want it, I would be pushing to simply not allow bind mounts to be
>> specified via block device at all.
> No it is not a container thing.

	Inigo: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."
	Rugen: "Stop saying that!"

	Eric:  "It is not a container thing."
	Casey: "Stop saying that!"

Yes, Virginia, it *is* a container thing. Your container manager expects all
filesystems to be server-client based. It makes bad assumptions. It is doing
things that we would fire a sysadmin for doing. Don't blame the filesystems
for behaving as documented. Export the filesystem using NFS and mount them
using the NFS mechanism, which is designed to do what you're asking for. The
problem is not in the mount mechanism, it's in the way you want to abuse it.

>> The only reason why we should support it is because we don't want to
>> break scripts; and if the goal is not to break scripts, then we have
>> to keep to the current semantics, however broken you think it is.
> But we don't have to support returning filesystems with mismatched mount
> options in the new fsopen api.   That is my concern.  Confusing
> userspace this way has been shown to be harmful let's not keep doing it.

It's not "userspace" that's confused. Developers of userspace code
implementing system behavior (e.g. systemd, container managers) need to
understand how the system works. The container manager needs to know
that it can't mount filesystems with different options. That's the kind
of thing "managers" do. If it has to go to the mount table and check
on how the device is already mounted before doing a mount, so be it.

Unless, of course, you want the concept of "container" introduced into
the kernel. There's a whole lot of feldercarb that container managers
have to deal with that would be lots easier to deal with down below.
I'm not advocating that, and I understand the arguments against it.
On the other hand, if you want a platform that is optimized for a
container environment ...

> Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-11  1:58       ` Al Viro
  2018-08-11  2:17         ` Al Viro
@ 2018-08-13 12:54         ` Miklos Szeredi
  1 sibling, 0 replies; 57+ messages in thread
From: Miklos Szeredi @ 2018-08-13 12:54 UTC (permalink / raw)
  To: linux-security-module

On Sat, Aug 11, 2018 at 3:58 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:

>  What I'm saying is that the entire superblock-creating
> machinery - all of it - is nothing but library helpers.  With the
> decision of when/how/if they are to be used being down to filesystem
> driver.  Your "first mount"/"additional mount" simply do not map
> to anything universally applicable.

Why so?   (Note: using the "mount" terminology here is fundamentally
broken to start with, mounts have nothing to do with this...
Filesystem instance is better word.)

You bring up NFS as an example, but creating and/or reusing an nfs
client instance connected to a certain server is certainly a clear and
well defined concept.

The question becomes:  does it make  sense to generalize this concept
and export it to userspace with the new API?

You know the Plan 9 fs interface much better, but to me it looks like
there's a separate namespace for filesystem instances, and the mount
command just refers to such an instance.  So there's no comparing of
options or any such horror, just the need to explicitly instantiate a
new instance when necessary.  Doesn't sound very difficult to
implement in the new API.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-10 20:06       ` Andy Lutomirski
  2018-08-10 20:46         ` Theodore Y. Ts'o
@ 2018-08-13 16:35         ` Alan Cox
  2018-08-13 16:48           ` Andy Lutomirski
  1 sibling, 1 reply; 57+ messages in thread
From: Alan Cox @ 2018-08-13 16:35 UTC (permalink / raw)
  To: linux-security-module

> If the same block device is visible, with rw access, in two different
> containers, I don't see any anything good can happen.  Sure, with the

At the raw level there are lots of use cases involving high performance
data capture, media streaming and the like.

At the file system layer you can use GFS2 for example.

So there are cases where it's possible. There are even cases where it's
actually useful at the filesystem level although not many I agree.

Alan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-13 16:35         ` Alan Cox
@ 2018-08-13 16:48           ` Andy Lutomirski
  2018-08-13 17:29             ` Al Viro
  0 siblings, 1 reply; 57+ messages in thread
From: Andy Lutomirski @ 2018-08-13 16:48 UTC (permalink / raw)
  To: linux-security-module

On Mon, Aug 13, 2018 at 9:35 AM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
>> If the same block device is visible, with rw access, in two different
>> containers, I don't see any anything good can happen.  Sure, with the
>
> At the raw level there are lots of use cases involving high performance
> data capture, media streaming and the like.
>
> At the file system layer you can use GFS2 for example.

Ugh.  I even thought of this case, and I should have been a bit more precise:

I would consider the GFS2 case to be essentially equivalent to the NFS
case.  I think we can probably divide all the filesystems into three
or four types:

pseudo file systems: Multiple instantiations of the same fs driver
pointing at the same backing store give separate filesystems.  (Same
backing store includes the case where there isn't any backing store.)
tmpfs is an example.  This isn't particularly interesting.

network-like file systems: Multiple instantiations of the same fs
driver pointing at the same backing store are expected.  This includes
NFS, GFS2, AFS, CIFS, etc.  This is only really interesting to the
extent that, if the fs driver internally wants to share state between
multiple instantiations, it should be smart enough to make sure the
options are compatible or that it can otherwise handle mismatched
options correctly.  NFS does this right.

non-network-like filesystems: There are complicated ones like btrfs
and ZFS and simple ones like ext4.  In either case, multiple totally
separate instantiations of the driver sharing the backing store will
lead to corruption.  In cases like ext4, we seem to support it for
legacy reasons, because we're afraid that there are scripts that try
to mount the same block device more than once, and I think the new API
has no need to support this.  In cases like btrfs, we also seem to
support multiple user requests for "mounts" with the same underlying
block devices because we need it for full functionality.  But I think
this is because our API is wrong.

Are there cases I'm missing?  It sounds like the API could be improved
to fully model the last case, and everything will work nicely.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-13 16:48           ` Andy Lutomirski
@ 2018-08-13 17:29             ` Al Viro
  2018-08-13 19:00               ` James Morris
  0 siblings, 1 reply; 57+ messages in thread
From: Al Viro @ 2018-08-13 17:29 UTC (permalink / raw)
  To: linux-security-module

On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote:

> I would consider the GFS2 case to be essentially equivalent to the NFS
> case.  I think we can probably divide all the filesystems into three
> or four types:
> 
> pseudo file systems: Multiple instantiations of the same fs driver
> pointing at the same backing store give separate filesystems.  (Same
> backing store includes the case where there isn't any backing store.)
> tmpfs is an example.  This isn't particularly interesting.
> 
> network-like file systems: Multiple instantiations of the same fs
> driver pointing at the same backing store are expected.  This includes
> NFS, GFS2, AFS, CIFS, etc.  This is only really interesting to the
> extent that, if the fs driver internally wants to share state between
> multiple instantiations, it should be smart enough to make sure the
> options are compatible or that it can otherwise handle mismatched
> options correctly.  NFS does this right.
> 
> non-network-like filesystems: There are complicated ones like btrfs
> and ZFS and simple ones like ext4.  In either case, multiple totally
> separate instantiations of the driver sharing the backing store will
> lead to corruption.  In cases like ext4, we seem to support it for
> legacy reasons, because we're afraid that there are scripts that try
> to mount the same block device more than once, and I think the new API
> has no need to support this.  In cases like btrfs, we also seem to
> support multiple user requests for "mounts" with the same underlying
> block devices because we need it for full functionality.  But I think
> this is because our API is wrong.
> 
> Are there cases I'm missing?  It sounds like the API could be improved
> to fully model the last case, and everything will work nicely.

	You know, that's starting to remind of this little gem of Borges:
http://www.alamut.com/subj/artiface/language/johnWilkins.html
Especially the delightful (fake) quote contained in there:
[...] it is written that the animals are divided into:
	(a) belonging to the emperor,
	(b) embalmed,
	(c) tame,
	(d) sucking pigs,
	(e) sirens,
	(f) fabulous,
	(g) stray dogs,
	(h) included in the present classification,
	(i) frenzied,
	(j) innumerable,
	(k) drawn with a very fine camelhair brush,
	(l) et cetera,
	(m) having just broken the water pitcher,
	(n) that from a long way off look like flies.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-13 17:29             ` Al Viro
@ 2018-08-13 19:00               ` James Morris
  2018-08-13 19:20                 ` Casey Schaufler
  2018-08-15 23:29                 ` Serge E. Hallyn
  0 siblings, 2 replies; 57+ messages in thread
From: James Morris @ 2018-08-13 19:00 UTC (permalink / raw)
  To: linux-security-module

On Mon, 13 Aug 2018, Al Viro wrote:

> On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote:

> > Are there cases I'm missing?  It sounds like the API could be improved
> > to fully model the last case, and everything will work nicely.
> 
> 	You know, that's starting to remind of this little gem of Borges:
> http://www.alamut.com/subj/artiface/language/johnWilkins.html
> Especially the delightful (fake) quote contained in there:
> [...] it is written that the animals are divided into:
> 	(a) belonging to the emperor,
> 	(b) embalmed,
> 	(c) tame,
> 	(d) sucking pigs,
> 	(e) sirens,
> 	(f) fabulous,
> 	(g) stray dogs,
> 	(h) included in the present classification,
> 	(i) frenzied,
> 	(j) innumerable,
> 	(k) drawn with a very fine camelhair brush,
> 	(l) et cetera,
> 	(m) having just broken the water pitcher,
> 	(n) that from a long way off look like flies.


Coincidentally, this was also the model for Linux capabilities.


-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-13 19:00               ` James Morris
@ 2018-08-13 19:20                 ` Casey Schaufler
  2018-08-15 23:29                 ` Serge E. Hallyn
  1 sibling, 0 replies; 57+ messages in thread
From: Casey Schaufler @ 2018-08-13 19:20 UTC (permalink / raw)
  To: linux-security-module

On 8/13/2018 12:00 PM, James Morris wrote:
> On Mon, 13 Aug 2018, Al Viro wrote:
>
>> On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote:
>>> Are there cases I'm missing?  It sounds like the API could be improved
>>> to fully model the last case, and everything will work nicely.
>> 	You know, that's starting to remind of this little gem of Borges:
>> http://www.alamut.com/subj/artiface/language/johnWilkins.html
>> Especially the delightful (fake) quote contained in there:
>> [...] it is written that the animals are divided into:
>> 	(a) belonging to the emperor,
>> 	(b) embalmed,
>> 	(c) tame,
>> 	(d) sucking pigs,
>> 	(e) sirens,
>> 	(f) fabulous,
>> 	(g) stray dogs,
>> 	(h) included in the present classification,
>> 	(i) frenzied,
>> 	(j) innumerable,
>> 	(k) drawn with a very fine camelhair brush,
>> 	(l) et cetera,
>> 	(m) having just broken the water pitcher,
>> 	(n) that from a long way off look like flies.
>
> Coincidentally, this was also the model for Linux capabilities.

Linux capabilities are POSIX capabilities which are modeled closely
to accommodate the historical behavior manifest in the P1003.1 specification.
So except for (c), (f) and (k) you can use this characterization. 

On a slightly more serious note, there's a lot of Linux, mount semantics
included, that have grow organically and that aren't quite up to the
usage models they are being applied to. I applaud David's work in part
because it may make it possible to accommodate more of those cases going
forward.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-11 17:47           ` Casey Schaufler
@ 2018-08-15  4:03             ` Eric W. Biederman
  0 siblings, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2018-08-15  4:03 UTC (permalink / raw)
  To: linux-security-module

Casey Schaufler <casey@schaufler-ca.com> writes:

> Don't blame the filesystems for behaving as documented.

No.  This behavior is not documented.  At least I certainly don't see a
word about this in any of the man pages.  Where does it say mounting a
filesystem will not honor it's mount options?

It is also rare enough in practice it is something it is reasonable to
expect people to be surprised by.

> The problem is not in the mount mechanism, it's in the way you want to
> abuse it.

I am not asking for this behavior.  I am pointing out this behavior
exists.  I am pointing out this behavior is harmful.  I am asking we
stop doing this harmful thing in the new API where we don't have a
chance of breaking anything.

The place where this has bitten the hardest is someone wrote a script to
do something for Xen in a chroot.  That script involved a chroot that
mounted devpts and in doing so happend to change the options of the main
/dev/pts.  Which resulted in ptys created with /dev/ptmx outside the
chroot with the wrong permissions.  That in turn caused several distros
to retain the ancient suid pt_chown binary from libc that the devpts
filesystem was built to make obsolete.  As the world turned that
pt_chown binary could be confused into chowning the wrong pty if a pty
from a container was used.

The fix was to mount a new instance of devpts every time mount of devpts
is called.  That simplified the code, and allowed pt_chown to be removed
permanently.  The tricky bit was figuring out how keep /dev/ptmx
working.  I wound up testing on every distribution I could think of to
ensure no one would notice the slightly changed behavior of the devpts
filesystem.

The behavior in other filesystems of ignoring the options instead of
changing them on the filesystem isn't quite as bad.  But it still has
the potential for a lot of mischief.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Should we split the network filesystem setup into two phases?
  2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
                   ` (6 preceding siblings ...)
  2018-08-10 15:11 ` David Howells
@ 2018-08-15 16:31 ` David Howells
  2018-08-15 16:51   ` Andy Lutomirski
                     ` (2 more replies)
  7 siblings, 3 replies; 57+ messages in thread
From: David Howells @ 2018-08-15 16:31 UTC (permalink / raw)
  To: linux-security-module

Having just re-ported NFS on top of the new mount API stuff, I find that I
don't really like the idea of superblocks being separated by communication
parameters - especially when it might seem reasonable to be able to adjust
those parameters.

Does it make sense to abstract out the remote peer and allow (a) that to be
configured separately from any superblocks using it and (b) that to be used to
create superblocks?

Note that what a 'remote peer' is would be different for different
filesystems:

 (*) For NFS, it would probably be a named server, with address(es) attached
     to the name.  In lieu of actually having a name, the initial IP address
     could be used.

 (*) For CIFS, it would probably be a named server.  I'm not sure if CIFS
     allows an abstraction for a share that can move about inside a domain.

 (*) For AFS, it would be a cell, I think, where the actual fileserver(s) used
     are a matter of direction from the Volume Location server.

 (*) For 9P and Ceph, I don't really know.

What could be configured?  Well, addresses, ports, timeouts.  Maybe protocol
level negotiation - though not being able to explicitly specify, say, the
particular version and minorversion on an NFS share would be problematic for
backward compatibility.

One advantage it could give us is that it might make it easier if someone asks
for server X to query userspace in some way for the default parameters for X
are.

What might this look like in terms of userspace?  Well, we could overload the
new mount API:

	peer1 = fsopen("nfs", FSOPEN_CREATE_PEER);
	fsconfig(peer1, FSCONFIG_SET_NS, "net", NULL, netns_fd);
	fsconfig(peer1, FSCONFIG_SET_STRING, "peer_name", "server.home");
	fsconfig(peer1, FSCONFIG_SET_STRING, "vers", "4.2");
	fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.1");
	fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.2");
	fsconfig(peer1, FSCONFIG_SET_STRING, "timeo", "122");
	fsconfig(peer1, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);

	peer2 = fsopen("nfs", FSOPEN_CREATE_PEER);
	fsconfig(peer2, FSCONFIG_SET_NS, "net", NULL, netns_fd);
	fsconfig(peer2, FSCONFIG_SET_STRING, "peer_name", "server2.home");
	fsconfig(peer2, FSCONFIG_SET_STRING, "vers", "3");
	fsconfig(peer2, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.3");
	fsconfig(peer2, FSCONFIG_SET_STRING, "address", "udp:192.168.1.4+6001");
	fsconfig(peer2, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);

	fs = fsopen("nfs", 0);
	fsconfig(fs, FSCONFIG_SET_PEER, "peer.1", NULL, peer1);
	fsconfig(fs, FSCONFIG_SET_PEER, "peer.2", NULL, peer2);
	fsconfig(fs, FSCONFIG_SET_STRING, "source", "/home/dhowells", 0);
	m = fsmount(fs, 0, 0);

[Note that Eric's oft-repeated point about the 'creation' operation altering
 established parameters still stands here.]

You could also then reopen it for configuration, maybe by:

	peer = fspick(AT_FDCWD, "/mnt", FSPICK_PEER);

or:

	peer = fspick(AT_FDCWD, "nfs:server.home", FSPICK_PEER_BY_NAME);

though it might be better to give it its own syscall:

	peer = fspeer("nfs", "server.home", O_CLOEXEC);
	fsconfig(peer, FSCONFIG_SET_NS, "net", NULL, netns_fd);
	...
	fsconfig(peer, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);

In terms of alternative interfaces, I'm not sure how easy it would be to make
it like cgroups where you go and create a dir in a special filesystem, say,
"/sys/peers/nfs", because the peers records and names would have to be network
namespaced.  Also, it might make it more difficult to use to create a root fs.

On the other hand, being able to adjust the peer configuration by:

	echo 71 >/sys/peers/nfs/server.home/timeo

does have a certain appeal.

Also, netlink might be the right option, but I'm not sure how you'd pin the
resultant object whilst you make use of it.

A further thought is that is it worth making this idea more general and
encompassing non-network devices also?  This would run into issues of some
logical sources being visible across namespaces and but not others.

David

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Should we split the network filesystem setup into two phases?
  2018-08-15 16:31 ` Should we split the network filesystem setup into two phases? David Howells
@ 2018-08-15 16:51   ` Andy Lutomirski
  2018-08-16  3:51   ` Steve French
  2018-08-16  5:06   ` Eric W. Biederman
  2 siblings, 0 replies; 57+ messages in thread
From: Andy Lutomirski @ 2018-08-15 16:51 UTC (permalink / raw)
  To: linux-security-module



> On Aug 15, 2018, at 9:31 AM, David Howells <dhowells@redhat.com> wrote:
> 
> Having just re-ported NFS on top of the new mount API stuff, I find that I
> don't really like the idea of superblocks being separated by communication
> parameters - especially when it might seem reasonable to be able to adjust
> those parameters.
> 
> Does it make sense to abstract out the remote peer and allow (a) that to be
> configured separately from any superblocks using it and (b) that to be used to
> create superblocks?
> 
> Note that what a 'remote peer' is would be different for different
> filesystems:

...

I think this looks rather nice.  But maybe you should generalize the concept of ?peer? so that it works for btrfs too. In the case where you mount two different subvolumes, you?re creating a *something*, and you?re then creating a filesystem that references it. It?s almost the same thing.

> 
> 


> 
>    fs = fsopen("nfs", 0);
>    fsconfig(fs, FSCONFIG_SET_PEER, "peer.1", NULL, peer1);

As you mention below, this seems like it might have namespacing issues.

>     

> In terms of alternative interfaces, I'm not sure how easy it would be to make
> it like cgroups where you go and create a dir in a special filesystem, say,
> "/sys/peers/nfs", because the peers records and names would have to be network
> namespaced.  Also, it might make it more difficult to use to create a root fs.
> 
> On the other hand, being able to adjust the peer configuration by:
> 
>    echo 71 >/sys/peers/nfs/server.home/timeo
> 
> does have a certain appeal.
> 
> Also, netlink might be the right option, but I'm not sure how you'd pin the
> resultant object whilst you make use of it.
> 

My suggestion would be to avoid giving these things names at all. I think that referring to them by fd should be sufficient, especially if you allow them to be reopened based on a mount that uses them and allow them to get bind-mounted somewhere a la namespaces to make them permanent if needed.

> A further thought is that is it worth making this idea more general and
> encompassing non-network devices also?  This would run into issues of some
> logical sources being visible across namespaces and but not others.

Indeed :)

It probably pays to rope a btrfs person into this discussion.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* BUG: Mount ignores mount options
  2018-08-13 19:00               ` James Morris
  2018-08-13 19:20                 ` Casey Schaufler
@ 2018-08-15 23:29                 ` Serge E. Hallyn
  1 sibling, 0 replies; 57+ messages in thread
From: Serge E. Hallyn @ 2018-08-15 23:29 UTC (permalink / raw)
  To: linux-security-module

Quoting James Morris (jmorris at namei.org):
> On Mon, 13 Aug 2018, Al Viro wrote:
> 
> > On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote:
> 
> > > Are there cases I'm missing?  It sounds like the API could be improved
> > > to fully model the last case, and everything will work nicely.
> > 
> > 	You know, that's starting to remind of this little gem of Borges:
> > http://www.alamut.com/subj/artiface/language/johnWilkins.html
> > Especially the delightful (fake) quote contained in there:
> > [...] it is written that the animals are divided into:
> > 	(a) belonging to the emperor,
> > 	(b) embalmed,
> > 	(c) tame,
> > 	(d) sucking pigs,
> > 	(e) sirens,
> > 	(f) fabulous,
> > 	(g) stray dogs,
> > 	(h) included in the present classification,
> > 	(i) frenzied,
> > 	(j) innumerable,
> > 	(k) drawn with a very fine camelhair brush,
> > 	(l) et cetera,
> > 	(m) having just broken the water pitcher,
> > 	(n) that from a long way off look like flies.
> 
> 
> Coincidentally, this was also the model for Linux capabilities.

But maybe we want to split the stray dogs up by breed.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Should we split the network filesystem setup into two phases?
  2018-08-15 16:31 ` Should we split the network filesystem setup into two phases? David Howells
  2018-08-15 16:51   ` Andy Lutomirski
@ 2018-08-16  3:51   ` Steve French
  2018-08-16  5:06   ` Eric W. Biederman
  2 siblings, 0 replies; 57+ messages in thread
From: Steve French @ 2018-08-16  3:51 UTC (permalink / raw)
  To: linux-security-module

This is worth further detailed discussion re:SMB3 as there are some fascinating
protocol features that might help here, but my first thought is just the obvious
one - this could help 'DFS' (the global name space feature almost all modern
CIFS/SMB3 implement) work a little better in the client.  A share can
be represented by an array of \\server\share\path targets although typically
only one except in the DFS case  (and server can be an ipv4 or
ipv6 address or host name (which could have multiple addresses).
It could be over RDMA, TCP, and even other protocols (as the transport).

There are various examples of DFS referrals in
https://msdn.microsoft.com/en-us/library/cc227066.aspx section 4.

But since SMB3 also supports transparent failover, and "share move"
and "server move" features, as well as multichannel - I would like
to better understand the patch set to see if it helps/hurts.

But until I dive into the patch set more and try it, hard for me to speculate.
Has anyone looked at the CIFS/SMB3 changes needed?

On Wed, Aug 15, 2018 at 11:32 AM David Howells <dhowells@redhat.com> wrote:
>
> Having just re-ported NFS on top of the new mount API stuff, I find that I
> don't really like the idea of superblocks being separated by communication
> parameters - especially when it might seem reasonable to be able to adjust
> those parameters.
>
> Does it make sense to abstract out the remote peer and allow (a) that to be
> configured separately from any superblocks using it and (b) that to be used to
> create superblocks?
>
> Note that what a 'remote peer' is would be different for different
> filesystems:
>
>  (*) For NFS, it would probably be a named server, with address(es) attached
>      to the name.  In lieu of actually having a name, the initial IP address
>      could be used.
>
>  (*) For CIFS, it would probably be a named server.  I'm not sure if CIFS
>      allows an abstraction for a share that can move about inside a domain.

CIFS/SMB3 has fairly mature support (in the protocol) for various types
of share redirection (not just 'DFS' that is supported by most every
NAS server, and Macs, Windows, Linux clients etc).  There are also
very interesting features introduced with SMB 3.1.1 allowing 'tree
connect contexts"
which some important servers in the last few years implement.

This is worth more discussion - SMB3 (in particular the SMB3.1.1 dialect) has
a lot of interesting features here.

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Should we split the network filesystem setup into two phases?
  2018-08-15 16:31 ` Should we split the network filesystem setup into two phases? David Howells
  2018-08-15 16:51   ` Andy Lutomirski
  2018-08-16  3:51   ` Steve French
@ 2018-08-16  5:06   ` Eric W. Biederman
  2018-08-16 16:24     ` Steve French
  2018-08-17 23:11     ` Al Viro
  2 siblings, 2 replies; 57+ messages in thread
From: Eric W. Biederman @ 2018-08-16  5:06 UTC (permalink / raw)
  To: linux-security-module

David Howells <dhowells@redhat.com> writes:

> Having just re-ported NFS on top of the new mount API stuff, I find that I
> don't really like the idea of superblocks being separated by communication
> parameters - especially when it might seem reasonable to be able to adjust
> those parameters.
>
> Does it make sense to abstract out the remote peer and allow (a) that to be
> configured separately from any superblocks using it and (b) that to be used to
> create superblocks?
>
> Note that what a 'remote peer' is would be different for different
> filesystems:
>
>  (*) For NFS, it would probably be a named server, with address(es) attached
>      to the name.  In lieu of actually having a name, the initial IP address
>      could be used.
>
>  (*) For CIFS, it would probably be a named server.  I'm not sure if CIFS
>      allows an abstraction for a share that can move about inside a domain.
>
>  (*) For AFS, it would be a cell, I think, where the actual fileserver(s) used
>      are a matter of direction from the Volume Location server.
>
>  (*) For 9P and Ceph, I don't really know.
>
> What could be configured?  Well, addresses, ports, timeouts.  Maybe protocol
> level negotiation - though not being able to explicitly specify, say, the
> particular version and minorversion on an NFS share would be problematic for
> backward compatibility.
>
> One advantage it could give us is that it might make it easier if someone asks
> for server X to query userspace in some way for the default parameters for X
> are.
>
> What might this look like in terms of userspace?  Well, we could overload the
> new mount API:
>
> 	peer1 = fsopen("nfs", FSOPEN_CREATE_PEER);
> 	fsconfig(peer1, FSCONFIG_SET_NS, "net", NULL, netns_fd);
> 	fsconfig(peer1, FSCONFIG_SET_STRING, "peer_name", "server.home");
> 	fsconfig(peer1, FSCONFIG_SET_STRING, "vers", "4.2");
> 	fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.1");
> 	fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.2");
> 	fsconfig(peer1, FSCONFIG_SET_STRING, "timeo", "122");
> 	fsconfig(peer1, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);
>
> 	peer2 = fsopen("nfs", FSOPEN_CREATE_PEER);
> 	fsconfig(peer2, FSCONFIG_SET_NS, "net", NULL, netns_fd);
> 	fsconfig(peer2, FSCONFIG_SET_STRING, "peer_name", "server2.home");
> 	fsconfig(peer2, FSCONFIG_SET_STRING, "vers", "3");
> 	fsconfig(peer2, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.3");
> 	fsconfig(peer2, FSCONFIG_SET_STRING, "address", "udp:192.168.1.4+6001");
> 	fsconfig(peer2, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);
>
> 	fs = fsopen("nfs", 0);
> 	fsconfig(fs, FSCONFIG_SET_PEER, "peer.1", NULL, peer1);
> 	fsconfig(fs, FSCONFIG_SET_PEER, "peer.2", NULL, peer2);
> 	fsconfig(fs, FSCONFIG_SET_STRING, "source", "/home/dhowells", 0);
> 	m = fsmount(fs, 0, 0);
>
> [Note that Eric's oft-repeated point about the 'creation' operation altering
>  established parameters still stands here.]
>
> You could also then reopen it for configuration, maybe by:
>
> 	peer = fspick(AT_FDCWD, "/mnt", FSPICK_PEER);
>
> or:
>
> 	peer = fspick(AT_FDCWD, "nfs:server.home", FSPICK_PEER_BY_NAME);
>
> though it might be better to give it its own syscall:
>
> 	peer = fspeer("nfs", "server.home", O_CLOEXEC);
> 	fsconfig(peer, FSCONFIG_SET_NS, "net", NULL, netns_fd);
> 	...
> 	fsconfig(peer, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0);
>
> In terms of alternative interfaces, I'm not sure how easy it would be to make
> it like cgroups where you go and create a dir in a special filesystem, say,
> "/sys/peers/nfs", because the peers records and names would have to be network
> namespaced.  Also, it might make it more difficult to use to create a root fs.
>
> On the other hand, being able to adjust the peer configuration by:
>
> 	echo 71 >/sys/peers/nfs/server.home/timeo
>
> does have a certain appeal.
>
> Also, netlink might be the right option, but I'm not sure how you'd pin the
> resultant object whilst you make use of it.
>
> A further thought is that is it worth making this idea more general and
> encompassing non-network devices also?  This would run into issues of some
> logical sources being visible across namespaces and but not others.

Even network filesystems are going to have challenges of filesystems
being visible in some network namespaces and not others.  As some
filesystems will be visible on the internet and some filesystems will
only be visible on the appropriate local network.  Network namespaces
are sometimes used to deal with the case of local networks with
overlapping ip addresses.

I think you are proposing a model for network filesystems that is
essentially the same situation where we are with most block devices
filesystems today.  Where some parameters identitify the local
filesystem instance and some parameters identify how the kernel
interacts with that filesystem instance.


For system efficiency there is a strong argument for having the fewest
number of filesystem instances we can.  Otherwise we will be caching the
same data twice and wasting space in RAM etc.


So I like the idea.


At least for devpts we always create a new filesystem instance every
time mount(2) is called.  NFS seems to have the option to create a new
filesystem instance every time mount(2) is called as well, (even if the
filesystem parameters are the same).  And depending on the case I can
see the attraction for other filesystems as well.

So I don't think we can completely abandon the option for filesystems
to always create a new filesystem instance when mount(8) is called.



I most definitely support thinking this through and figuring out how it
best make sense for the new filesystem API to create new filesystem
instances or fail to create new filesystems instances.


Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Should we split the network filesystem setup into two phases?
  2018-08-16  5:06   ` Eric W. Biederman
@ 2018-08-16 16:24     ` Steve French
  2018-08-16 17:21       ` Eric W. Biederman
  2018-08-16 17:23       ` Aurélien Aptel
  2018-08-17 23:11     ` Al Viro
  1 sibling, 2 replies; 57+ messages in thread
From: Steve French @ 2018-08-16 16:24 UTC (permalink / raw)
  To: linux-security-module

On Thu, Aug 16, 2018 at 2:56 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> David Howells <dhowells@redhat.com> writes:
>
> > Having just re-ported NFS on top of the new mount API stuff, I find that I
> > don't really like the idea of superblocks being separated by communication
> > parameters - especially when it might seem reasonable to be able to adjust
> > those parameters.
> >
> > Does it make sense to abstract out the remote peer and allow (a) that to be
> > configured separately from any superblocks using it and (b) that to be used to
> > create superblocks?
<snip>
> At least for devpts we always create a new filesystem instance every
> time mount(2) is called.  NFS seems to have the option to create a new
> filesystem instance every time mount(2) is called as well, (even if the
> filesystem parameters are the same).  And depending on the case I can
> see the attraction for other filesystems as well.
>
> So I don't think we can completely abandon the option for filesystems
> to always create a new filesystem instance when mount(8) is called.

In cifs we attempt to match new mounts to existing tree connections
(instances of connections to a \\server\share) from other mount(s)
based first on whether security settings match (e.g. are both
Kerberos) and then on whether encryption is on/off and whether this is
a snapshot mount (smb3 previous versions feature).  If neither is
mounted with a snaphsot and the encryption settings match then
we will use the same tree id to talk with the server as the other
mounts use.  Interesting idea to allow mount to force a new
tree id.

What was the NFS mount option you were talking about?
Looking at the nfs man page the only one that looked similar
was "nosharecache"

> I most definitely support thinking this through and figuring out how it
> best make sense for the new filesystem API to create new filesystem
> instances or fail to create new filesystems instances.

Yes - it is an interesting question.

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Should we split the network filesystem setup into two phases?
  2018-08-16 16:24     ` Steve French
@ 2018-08-16 17:21       ` Eric W. Biederman
  2018-08-16 17:23       ` Aurélien Aptel
  1 sibling, 0 replies; 57+ messages in thread
From: Eric W. Biederman @ 2018-08-16 17:21 UTC (permalink / raw)
  To: linux-security-module

Steve French <smfrench@gmail.com> writes:

> On Thu, Aug 16, 2018 at 2:56 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> David Howells <dhowells@redhat.com> writes:
>>
>> > Having just re-ported NFS on top of the new mount API stuff, I find that I
>> > don't really like the idea of superblocks being separated by communication
>> > parameters - especially when it might seem reasonable to be able to adjust
>> > those parameters.
>> >
>> > Does it make sense to abstract out the remote peer and allow (a) that to be
>> > configured separately from any superblocks using it and (b) that to be used to
>> > create superblocks?
> <snip>
>> At least for devpts we always create a new filesystem instance every
>> time mount(2) is called.  NFS seems to have the option to create a new
>> filesystem instance every time mount(2) is called as well, (even if the
>> filesystem parameters are the same).  And depending on the case I can
>> see the attraction for other filesystems as well.
>>
>> So I don't think we can completely abandon the option for filesystems
>> to always create a new filesystem instance when mount(8) is called.
>
> In cifs we attempt to match new mounts to existing tree connections
> (instances of connections to a \\server\share) from other mount(s)
> based first on whether security settings match (e.g. are both
> Kerberos) and then on whether encryption is on/off and whether this is
> a snapshot mount (smb3 previous versions feature).  If neither is
> mounted with a snaphsot and the encryption settings match then
> we will use the same tree id to talk with the server as the other
> mounts use.  Interesting idea to allow mount to force a new
> tree id.
>
> What was the NFS mount option you were talking about?
> Looking at the nfs man page the only one that looked similar
> was "nosharecache"

I was remembering this from reading the nfs mount code:

static int nfs_compare_super(struct super_block *sb, void *data)
{
...
	if (!nfs_compare_super_address(old, server))
		return 0;
	/* Note: NFS_MOUNT_UNSHARED == NFS4_MOUNT_UNSHARED */
	if (old->flags & NFS_MOUNT_UNSHARED)
		return 0;
...
}

If a filesystem has NFS_MOUNT_UNSHARED set it does not serve as a
candidate for new mount requests.  Skimming the code it looks like
nosharecache is what sets NFS_MOUNT_UNSHARED.


Another interesting and common case is tmpfs which always creates a new
filesystem instance whenever it is mounted.

Eric

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Should we split the network filesystem setup into two phases?
  2018-08-16 16:24     ` Steve French
  2018-08-16 17:21       ` Eric W. Biederman
@ 2018-08-16 17:23       ` Aurélien Aptel
  2018-08-16 18:36         ` Steve French
  1 sibling, 1 reply; 57+ messages in thread
From: Aurélien Aptel @ 2018-08-16 17:23 UTC (permalink / raw)
  To: linux-security-module

Steve French <smfrench@gmail.com> writes:
> In cifs we attempt to match new mounts to existing tree connections
> (instances of connections to a \\server\share) from other mount(s)
> based first on whether security settings match (e.g. are both
> Kerberos) and then on whether encryption is on/off and whether this is
> a snapshot mount (smb3 previous versions feature).  If neither is
> mounted with a snaphsot and the encryption settings match then
> we will use the same tree id to talk with the server as the other
> mounts use.  Interesting idea to allow mount to force a new
> tree id.

We actually already have this mount option in cifs.ko, it's "nosharesock".

> What was the NFS mount option you were talking about?
> Looking at the nfs man page the only one that looked similar
> was "nosharecache"

Cheers,
-- 
Aur?lien Aptel / SUSE Labs Samba Team
GPG: 1839 CB5F 9F5B FB9B AA97  8C99 03C8 A49B 521B D5D3
SUSE Linux GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
GF: Felix Imend?rffer, Jane Smithard, Graham Norton, HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Should we split the network filesystem setup into two phases?
  2018-08-16 17:23       ` Aurélien Aptel
@ 2018-08-16 18:36         ` Steve French
  0 siblings, 0 replies; 57+ messages in thread
From: Steve French @ 2018-08-16 18:36 UTC (permalink / raw)
  To: linux-security-module

On Thu, Aug 16, 2018 at 12:23 PM Aur?lien Aptel <aaptel@suse.com> wrote:
>
> Steve French <smfrench@gmail.com> writes:
> > In cifs we attempt to match new mounts to existing tree connections
> > (instances of connections to a \\server\share) from other mount(s)
> > based first on whether security settings match (e.g. are both
> > Kerberos) and then on whether encryption is on/off and whether this is
> > a snapshot mount (smb3 previous versions feature).  If neither is
> > mounted with a snaphsot and the encryption settings match then
> > we will use the same tree id to talk with the server as the other
> > mounts use.  Interesting idea to allow mount to force a new
> > tree id.
>
> We actually already have this mount option in cifs.ko, it's "nosharesock".

Yes - good point.  It is very easy to do on cifs.  I mainly use that to simulate
multiple clients for testing servers (so each mount to the same server
whether or not the share matched, looks like a different client, coming
from a different socket and thus with different session ids and tree
ids as well).

It is very useful when trying to simulate multiple clients running to the same
server while using only one client machine (or VM).

> > What was the NFS mount option you were talking about?
> > Looking at the nfs man page the only one that looked similar
> > was "nosharecache"

The nfs man page apparently discourages its use:

"As of kernel 2.6.18, the behavior specified by nosharecache is legacy
caching behavior. This is  considered  a  data risk"

-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Should we split the network filesystem setup into two phases?
  2018-08-16  5:06   ` Eric W. Biederman
  2018-08-16 16:24     ` Steve French
@ 2018-08-17 23:11     ` Al Viro
  1 sibling, 0 replies; 57+ messages in thread
From: Al Viro @ 2018-08-17 23:11 UTC (permalink / raw)
  To: linux-security-module

On Thu, Aug 16, 2018 at 12:06:06AM -0500, Eric W. Biederman wrote:

> So I don't think we can completely abandon the option for filesystems
> to always create a new filesystem instance when mount(8) is called.

Huh?  If filesystem wants to create a new instance on each ->mount(),
it can bloody well do so.  Quite a few do - if that fs can handle
that, more power to it.

The problem is what to do with filesystems that *can't* do that.
You really, really can't have two ext4 (or xfs, etc.) instances over
the same device at the same time.  Cache coherency, locking, etc.
will kill you.

And that's not to mention the joy of defining the semantics of
having the same ext4 mounted with two logs at the same time ;-)

I've seen "reject unless the options are compatible/identical/whatever",
but that ignores the real problem with existing policy.  It's *NOT*
"I've mounted this and got an existing instance with non-matching
options".  That's a minor annoyance (and back when that decision
had been made, mount(2) was very definitly root-only).  The real
problem is different and much worse - it's remount.

I have asked to mount something and it had already been mounted,
with identical options.  OK, so what happens if I do mount -o remount
on my instance?  *IF* we are operating in the "only sysadmin can
mount new filesystems", it's not a big deal - there are already
lots of ways you can shoot yourself in the foot and mount(2) is
certainly a powerful one.  But if we get to "Joe R. Luser can do
it in his container", we have a big problem.

Decision back then had been mostly for usability reasons - it was
back in 2001 (well before the containermania, userns or anything
of that sort) and it was more about "how many hoops does one have
to jump through to get something mounted, assuming the sanity of
sysadmin doing that?".  If *anything* like userns had been a concern
back then, it probably would've been different.  However, it's
17 years too late and if anyone has a functional TARDIS, I can
easily think of better uses for it...

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2018-08-17 23:11 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-08-01 15:23 [PATCH 00/33] VFS: Introduce filesystem context [ver #11] David Howells
2018-08-01 15:24 ` [PATCH 08/33] vfs: Add LSM hooks for the new mount API " David Howells
2018-08-01 20:50   ` James Morris
2018-08-01 22:53   ` David Howells
2018-08-01 15:25 ` [PATCH 09/33] selinux: Implement the new mount API LSM hooks " David Howells
2018-08-01 15:25 ` [PATCH 10/33] smack: Implement filesystem context security " David Howells
2018-08-01 15:25 ` [PATCH 11/33] apparmor: Implement security hooks for the new mount API " David Howells
2018-08-01 15:25 ` [PATCH 12/33] tomoyo: " David Howells
2018-08-10 14:05 ` BUG: Mount ignores mount options Eric W. Biederman
2018-08-10 14:36   ` Andy Lutomirski
2018-08-10 15:17     ` Eric W. Biederman
2018-08-10 15:24     ` Al Viro
2018-08-10 15:11   ` Tetsuo Handa
2018-08-10 15:13   ` David Howells
2018-08-10 15:16   ` Al Viro
2018-08-11  1:05     ` Eric W. Biederman
2018-08-11  1:46       ` Theodore Y. Ts'o
2018-08-11  4:48         ` Eric W. Biederman
2018-08-11 17:47           ` Casey Schaufler
2018-08-15  4:03             ` Eric W. Biederman
2018-08-11  1:58       ` Al Viro
2018-08-11  2:17         ` Al Viro
2018-08-11  4:43           ` Eric W. Biederman
2018-08-13 12:54         ` Miklos Szeredi
2018-08-10 15:11 ` David Howells
2018-08-10 15:39   ` Theodore Y. Ts'o
2018-08-10 15:55     ` Casey Schaufler
2018-08-10 16:11     ` David Howells
2018-08-10 18:00     ` Eric W. Biederman
2018-08-10 15:53   ` David Howells
2018-08-10 16:14     ` Theodore Y. Ts'o
2018-08-10 20:06       ` Andy Lutomirski
2018-08-10 20:46         ` Theodore Y. Ts'o
2018-08-10 22:12           ` Darrick J. Wong
2018-08-10 23:54             ` Theodore Y. Ts'o
2018-08-11  0:38               ` Darrick J. Wong
2018-08-11  1:32                 ` Eric W. Biederman
2018-08-13 16:35         ` Alan Cox
2018-08-13 16:48           ` Andy Lutomirski
2018-08-13 17:29             ` Al Viro
2018-08-13 19:00               ` James Morris
2018-08-13 19:20                 ` Casey Schaufler
2018-08-15 23:29                 ` Serge E. Hallyn
2018-08-11  0:28       ` Eric W. Biederman
2018-08-11  1:19   ` Eric W. Biederman
2018-08-11  7:29   ` David Howells
2018-08-11 16:31     ` Andy Lutomirski
2018-08-11 16:51       ` Al Viro
2018-08-15 16:31 ` Should we split the network filesystem setup into two phases? David Howells
2018-08-15 16:51   ` Andy Lutomirski
2018-08-16  3:51   ` Steve French
2018-08-16  5:06   ` Eric W. Biederman
2018-08-16 16:24     ` Steve French
2018-08-16 17:21       ` Eric W. Biederman
2018-08-16 17:23       ` Aurélien Aptel
2018-08-16 18:36         ` Steve French
2018-08-17 23:11     ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).