linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT PULL] vfs mount
@ 2025-01-18 13:06 Christian Brauner
  2025-01-20  0:10 ` Sasha Levin
  2025-01-20 18:59 ` [GIT PULL] vfs mount pr-tracker-bot
  0 siblings, 2 replies; 34+ messages in thread
From: Christian Brauner @ 2025-01-18 13:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel

Hey Linus,

/* Summary */

This contains mount update for this cycle:

- Add a mountinfo program to demonstrate statmount()/listmount()

  Add a new "mountinfo" sample userland program that demonstrates how to
  use statmount() and listmount() to get at the same info that
  /proc/pid/mountinfo provides.

- Remove pointless nospec.h include.

- Prepend statmount.mnt_opts string with security_sb_mnt_opts()

  Currently these mount options aren't accessible via statmount().

- Add new mount namespaces to mount namespace rbtree outside of the
  namespace semaphore.

- Lockless mount namespace lookup

  Currently we take the read lock when looking for a mount namespace to
  list mounts in. We can make this lockless. The simple search case can
  just use a sequence counter to detect concurrent changes to the
  rbtree.

  For walking the list of mount namespaces sequentially via nsfs we keep
  a separate rcu list as rb_prev() and rb_next() aren't usable safely
  with rcu. Currently there is no primitive for retrieving the previous
  list member. To do this we need a new deletion primitive that doesn't
  poison the prev pointer and a corresponding retrieval helper.

  Since creating mount namespaces is a relatively rare event compared
  with querying mounts in a foreign mount namespace this is worth it.
  Once libmount and systemd pick up this mechanism to list mounts in
  foreign mount namespaces this will be used very frequently.

  - Add extended selftests for lockless mount namespace iteration.

  - Add a sample program to list all mounts on the system, i.e., in all
    mount namespaces.

- Improve mount namespace iteration performance

  Make finding the last or first mount to start iterating the mount
  namespace from an O(1) operation and add selftests for iterating the
  mount table starting from the first and last mount.

- Use an xarray for the old mount id

  While the ida does use the xarray internally we can use it explicitly
  which allows us to increment the unique mount id under the xa lock.
  This allows us to remove the atomic as we're now allocating both ids
  in one go.

/* Testing */

gcc version 14.2.0 (Debian 14.2.0-6)
Debian clang version 16.0.6 (27+b1)

No build failures or warnings were observed.

/* Conflicts */

Merge conflicts with mainline
=============================

No known conflicts.

Merge conflicts with other trees
================================

This will have a merge conflict with vfs-6.14.mount pull request sent in
https://lore.kernel.org/r/20250118-vfs-pidfs-5921bfa5632a@brauner
and it can be resolved as follows:

diff --cc fs/namespace.c
index 64deda6f5b2c,371c860f49de..000000000000
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@@ -32,8 -32,6 +32,7 @@@
  #include <linux/fs_context.h>
  #include <linux/shmem_fs.h>
  #include <linux/mnt_idmapping.h>
 +#include <linux/pidfs.h>
- #include <linux/nospec.h>

  #include "pnode.h"
  #include "internal.h"

The following changes since commit 344bac8f0d73fe970cd9f5b2f132906317d29e8b:

  fs: kill MNT_ONRB (2025-01-09 16:58:50 +0100)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.14-rc1.mount

for you to fetch changes up to f79e6eb84d4d2bff99e3ca6c1f140b2af827e904:

  samples/vfs/mountinfo: Use __u64 instead of uint64_t (2025-01-10 12:08:27 +0100)

Please consider pulling these changes from the signed vfs-6.14-rc1.mount tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.14-rc1.mount

----------------------------------------------------------------
Christian Brauner (17):
      mount: remove inlude/nospec.h include
      fs: add mount namespace to rbtree late
      Merge patch series "fs: listmount()/statmount() fix and sample program"
      fs: lockless mntns rbtree lookup
      rculist: add list_bidir_{del,prev}_rcu()
      fs: lockless mntns lookup for nsfs
      fs: simplify rwlock to spinlock
      seltests: move nsfs into filesystems subfolder
      selftests: add tests for mntns iteration
      selftests: remove unneeded include
      samples: add test-list-all-mounts
      Merge patch series "fs: lockless mntns lookup"
      fs: cache first and last mount
      selftests: add listmount() iteration tests
      Merge patch series "fs: tweak mntns iteration"
      fs: use xarray for old mount id
      fs: remove useless lockdep assertion

Geert Uytterhoeven (1):
      samples/vfs/mountinfo: Use __u64 instead of uint64_t

Jeff Layton (2):
      samples: add a mountinfo program to demonstrate statmount()/listmount()
      fs: prepend statmount.mnt_opts string with security_sb_mnt_opts()

 fs/mount.h                                         |  31 ++-
 fs/namespace.c                                     | 200 +++++++++------
 fs/nsfs.c                                          |   5 +-
 include/linux/rculist.h                            |  44 ++++
 samples/vfs/.gitignore                             |   2 +
 samples/vfs/Makefile                               |   2 +-
 samples/vfs/mountinfo.c                            | 273 +++++++++++++++++++++
 samples/vfs/test-list-all-mounts.c                 | 235 ++++++++++++++++++
 .../selftests/{ => filesystems}/nsfs/.gitignore    |   1 +
 .../selftests/{ => filesystems}/nsfs/Makefile      |   4 +-
 .../selftests/{ => filesystems}/nsfs/config        |   0
 .../selftests/filesystems/nsfs/iterate_mntns.c     | 149 +++++++++++
 .../selftests/{ => filesystems}/nsfs/owner.c       |   0
 .../selftests/{ => filesystems}/nsfs/pidns.c       |   0
 .../selftests/filesystems/statmount/Makefile       |   2 +-
 .../filesystems/statmount/listmount_test.c         |  66 +++++
 tools/testing/selftests/pidfd/pidfd.h              |   1 -
 17 files changed, 918 insertions(+), 97 deletions(-)
 create mode 100644 samples/vfs/mountinfo.c
 create mode 100644 samples/vfs/test-list-all-mounts.c
 rename tools/testing/selftests/{ => filesystems}/nsfs/.gitignore (78%)
 rename tools/testing/selftests/{ => filesystems}/nsfs/Makefile (50%)
 rename tools/testing/selftests/{ => filesystems}/nsfs/config (100%)
 create mode 100644 tools/testing/selftests/filesystems/nsfs/iterate_mntns.c
 rename tools/testing/selftests/{ => filesystems}/nsfs/owner.c (100%)
 rename tools/testing/selftests/{ => filesystems}/nsfs/pidns.c (100%)
 create mode 100644 tools/testing/selftests/filesystems/statmount/listmount_test.c

^ permalink raw reply	[flat|nested] 34+ messages in thread
* [GIT PULL] vfs mount
@ 2025-03-22 10:13 Christian Brauner
  2025-03-24 21:00 ` pr-tracker-bot
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2025-03-22 10:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel

Hey Linus,

/* Summary */

This contains the first batch of mount updates for this cycle:

- Mount notifications

  The day has come where we finally provide a new api to listen for
  mount topology changes outside of /proc/<pid>/mountinfo. A mount
  namespace file descriptor can be supplied and registered with fanotify
  to listen for mount topology changes.

  Currently notifications for mount, umount and moving mounts are
  generated. The generated notification record contains the unique mount
  id of the mount.

  The listmount() and statmount() api can be used to query detailed
  information about the mount using the received unique mount id.

  This allows userspace to figure out exactly how the mount topology
  changed without having to generating diffs of /proc/<pid>/mountinfo in
  userspace.

- Support O_PATH file descriptors with FSCONFIG_SET_FD in the new mount api.

- Support detached mounts in overlayfs.

  Since last cycle we support specifying overlayfs layers via file
  descriptors. However, we don't allow detached mounts which means
  userspace cannot user file descriptors received via
  open_tree(OPEN_TREE_CLONE) and fsmount() directly. They have to attach
  them to a mount namespace via move_mount() first. This is cumbersome
  and means they have to undo mounts via umount(). This allows them to
  directly use detached mounts.

- Allow to retrieve idmappings with statmount.

  Currently it isn't possible to figure out what idmapping has been
  attached to an idmapped mount. Add an extension to statmount() which
  allows to read the idmapping from the mount.

- Allow creating idmapped mounts from mounts that are already idmapped.

  So far it isn't possible to allow the creation of idmapped mounts from
  already idmapped mounts as this has significant lifetime implications.
  Make the creation of idmapped mounts atomic by allow to pass struct
  mount_attr together with the open_tree_attr() system call allowing to
  solve these issues without complicating VFS lookup in any way.

  The system call has in general the benefit that creating a detached
  mount and applying mount attributes to it becomes an atomic operation
  for userspace.

- Add a way to query statmount() for supported options.

  Allow userspace to query which mount information can be retrieved
  through statmount().

- Allow superblock owners to force unmount.

/* Testing */

gcc version 14.2.0 (Debian 14.2.0-6)
Debian clang version 16.0.6 (27+b1)

No build failures or warnings were observed.

/* Conflicts */

Merge conflicts with mainline
=============================

No known conflicts.

Merge conflicts with other trees
================================

This contains a merge conflict with the vfs-6.15.misc pull request:

diff --cc fs/internal.h
index 82127c69e641,db6094d5cb0b..000000000000
--- a/fs/internal.h
+++ b/fs/internal.h
@@@ -337,4 -338,4 +337,5 @@@ static inline bool path_mounted(const s
        return path->mnt->mnt_root == path->dentry;
  }
  void file_f_owner_release(struct file *file);
 +bool file_seek_cur_needs_f_lock(struct file *file);
+ int statmount_mnt_idmap(struct mnt_idmap *idmap, struct seq_file *seq, bool uid_map);

The following changes since commit 2014c95afecee3e76ca4a56956a936e23283f05b:

  Linux 6.14-rc1 (2025-02-02 15:39:26 -0800)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.15-rc1.mount

for you to fetch changes up to e1ff7aa34dec7e650159fd7ca8ec6af7cc428d9f:

  umount: Allow superblock owners to force umount (2025-03-19 09:19:04 +0100)

Please consider pulling these changes from the signed vfs-6.15-rc1.mount tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.15-rc1.mount

----------------------------------------------------------------
Arnd Bergmann (1):
      samples/vfs: fix printf format string for size_t

Christian Brauner (18):
      Merge patch series "mount notification"
      fs: support O_PATH fds with FSCONFIG_SET_FD
      selftests/overlayfs: test specifying layers as O_PATH file descriptors
      Merge patch series "ovl: allow O_PATH file descriptor when specifying layers"
      fs: allow detached mounts in clone_private_mount()
      uidgid: add map_id_range_up()
      statmount: allow to retrieve idmappings
      samples/vfs: check whether flag was raised
      selftests: add tests for using detached mount with overlayfs
      samples/vfs: add STATMOUNT_MNT_{G,U}IDMAP
      Merge patch series "fs: allow detached mounts in clone_private_mount()"
      fs: add vfs_open_tree() helper
      fs: add copy_mount_setattr() helper
      fs: add open_tree_attr()
      fs: add kflags member to struct mount_kattr
      fs: allow changing idmappings
      Merge patch series "statmount: allow to retrieve idmappings"
      Merge patch series "fs: allow changing idmappings"

Jeff Layton (1):
      statmount: add a new supported_mask field

Miklos Szeredi (5):
      fsnotify: add mount notification infrastructure
      fanotify: notify on mount attach and detach
      vfs: add notifications for mount attach and detach
      selinux: add FILE__WATCH_MOUNTNS
      selftests: add tests for mount notification

Trond Myklebust (1):
      umount: Allow superblock owners to force umount

 arch/alpha/kernel/syscalls/syscall.tbl             |   1 +
 arch/arm/tools/syscall.tbl                         |   1 +
 arch/arm64/tools/syscall_32.tbl                    |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl              |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl        |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl          |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl          |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl          |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl            |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl           |   1 +
 arch/s390/kernel/syscalls/syscall.tbl              |   1 +
 arch/sh/kernel/syscalls/syscall.tbl                |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl             |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl             |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl             |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl            |   1 +
 fs/autofs/autofs_i.h                               |   2 +
 fs/fsopen.c                                        |   2 +-
 fs/internal.h                                      |   1 +
 fs/mnt_idmapping.c                                 |  51 ++
 fs/mount.h                                         |  26 ++
 fs/namespace.c                                     | 485 ++++++++++++++-----
 fs/notify/fanotify/fanotify.c                      |  38 +-
 fs/notify/fanotify/fanotify.h                      |  18 +
 fs/notify/fanotify/fanotify_user.c                 |  89 +++-
 fs/notify/fdinfo.c                                 |   5 +
 fs/notify/fsnotify.c                               |  47 +-
 fs/notify/fsnotify.h                               |  11 +
 fs/notify/mark.c                                   |  14 +-
 fs/pnode.c                                         |   4 +-
 include/linux/fanotify.h                           |  12 +-
 include/linux/fsnotify.h                           |  20 +
 include/linux/fsnotify_backend.h                   |  42 ++
 include/linux/mnt_idmapping.h                      |   5 +
 include/linux/syscalls.h                           |   4 +
 include/linux/uidgid.h                             |   6 +
 include/uapi/asm-generic/unistd.h                  |   4 +-
 include/uapi/linux/fanotify.h                      |  10 +
 include/uapi/linux/mount.h                         |  10 +-
 kernel/user_namespace.c                            |  26 +-
 samples/vfs/samples-vfs.h                          |  14 +-
 samples/vfs/test-list-all-mounts.c                 |  35 +-
 scripts/syscall.tbl                                |   1 +
 security/selinux/hooks.c                           |   3 +
 security/selinux/include/classmap.h                |   2 +-
 tools/testing/selftests/Makefile                   |   1 +
 .../selftests/filesystems/mount-notify/.gitignore  |   2 +
 .../selftests/filesystems/mount-notify/Makefile    |   6 +
 .../filesystems/mount-notify/mount-notify_test.c   | 516 +++++++++++++++++++++
 .../filesystems/overlayfs/set_layers_via_fds.c     | 195 ++++++++
 .../selftests/filesystems/overlayfs/wrappers.h     |  17 +
 .../selftests/filesystems/statmount/statmount.h    |   2 +-
 52 files changed, 1567 insertions(+), 175 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/mount-notify/.gitignore
 create mode 100644 tools/testing/selftests/filesystems/mount-notify/Makefile
 create mode 100644 tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c

^ permalink raw reply	[flat|nested] 34+ messages in thread
* [GIT PULL] vfs mount
@ 2024-09-13 14:41 Christian Brauner
  2024-09-14  2:33 ` Stephen Rothwell
  2024-09-16 11:09 ` pr-tracker-bot
  0 siblings, 2 replies; 34+ messages in thread
From: Christian Brauner @ 2024-09-13 14:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel

Hey Linus,

/* Summary */

Recently, we added the ability to list mounts in other mount namespaces
and the ability to retrieve namespace file descriptors without having to
go through procfs by deriving them from pidfds.

This extends nsfs in two ways:

(1) Add the ability to retrieve information about a mount namespace via
    NS_MNT_GET_INFO. This will return the mount namespace id and the
    number of mounts currently in the mount namespace. The number of
    mounts can be used to size the buffer that needs to be used for
    listmount() and is in general useful without having to actually
    iterate through all the mounts.

    The structure is extensible.

(2) Add the ability to iterate through all mount namespaces over which
    the caller holds privilege returning the file descriptor for the
    next or previous mount namespace.

    To retrieve a mount namespace the caller must be privileged wrt to
    it's owning user namespace. This means that PID 1 on the host can
    list all mounts in all mount namespaces or that a container can list
    all mounts of its nested containers.

    Optionally pass a structure for NS_MNT_GET_INFO with
    NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount
    namespace in one go.

(1) and (2) can be implemented for other namespace types easily.

Together with recent api additions this means one can iterate through
all mounts in all mount namespaces without ever touching procfs. The
merge message contains example code how to do this.

/* Testing */

gcc version 14.2.0 (Debian 14.2.0-3)
Debian clang version 16.0.6 (27+b1)

All patches are based on v6.11-rc1 and have been sitting in linux-next.
No build failures or warnings were observed.

/* Conflicts */

Merge conflicts with mainline
=============================

No known conflicts.

Merge conflicts with other trees
================================

(1) linux-next: build failure after merge of the bpf-next tree
    https://lore.kernel.org/r/20240913133240.066ae790@canb.auug.org.au

    The reported merge conflict isn't really with bpf-next but with the
    series to convert to fd_file() accessors for the changed struct fd
    representation.

    The patch you need to fix this however is correct in that draft. But
    honestly, it's pretty easy for you to figure out on your own anyway.

The following changes since commit 8400291e289ee6b2bf9779ff1c83a291501f017b:

  Linux 6.11-rc1 (2024-07-28 14:19:55 -0700)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.12.mount

for you to fetch changes up to 49224a345c488a0e176f193a60a2a76e82349e3e:

  Merge patch series "nsfs: iterate through mount namespaces" (2024-08-09 12:47:05 +0200)

Please consider pulling these changes from the signed vfs-6.12.mount tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.12.mount

----------------------------------------------------------------
Christian Brauner (5):
      fs: allow mount namespace fd
      fs: add put_mnt_ns() cleanup helper
      file: add fput() cleanup helper
      nsfs: iterate through mount namespaces
      Merge patch series "nsfs: iterate through mount namespaces"

 fs/mount.h                    |  13 ++++++
 fs/namespace.c                |  74 +++++++++++++++++++++++++-----
 fs/nsfs.c                     | 102 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/file.h          |   2 +
 include/linux/mnt_namespace.h |   4 ++
 include/uapi/linux/nsfs.h     |  16 +++++++
 6 files changed, 198 insertions(+), 13 deletions(-)

^ permalink raw reply	[flat|nested] 34+ messages in thread
* [GIT PULL] vfs mount
@ 2024-05-10 11:46 Christian Brauner
  2024-05-13 19:38 ` pr-tracker-bot
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2024-05-10 11:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel

Hey Linus,

/* Summary */
This converts qnx6, minix, debugfs, tracefs, freevxfs, and openpromfs to
the new mount api further reducing the number of filesystems relying on
the legacy mount api.

/* Testing */
clang: Debian clang version 16.0.6 (26)
gcc: (Debian 13.2.0-24)

All patches are based on v6.9-rc1 and have been sitting in linux-next.
No build failures or warnings were observed.

/* Conflicts */

Merge conflicts with mainline
=============================

[1] linux-next: manual merge of the vfs-brauner tree with Linus' tree
    https://lore.kernel.org/linux-next/20240506095258.05b5deae@canb.auug.org.au

The following changes since commit 4cece764965020c22cff7665b18a012006359095:

  Linux 6.9-rc1 (2024-03-24 14:10:05 -0700)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.10.mount

for you to fetch changes up to 7cd7bfe59328741185ef6db3356489c22919e59b:

  minix: convert minix to use the new mount api (2024-03-26 09:04:55 +0100)

Please consider pulling these changes from the signed vfs-6.10.mount tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.10.mount

----------------------------------------------------------------
Bill O'Donnell (2):
      qnx6: convert qnx6 to use the new mount api
      minix: convert minix to use the new mount api

David Howells (2):
      vfs: Convert debugfs to use the new mount API
      vfs: Convert tracefs to use the new mount API

Eric Sandeen (2):
      freevxfs: Convert freevxfs to the new mount API.
      openpromfs: finish conversion to the new mount API

 fs/debugfs/inode.c       | 198 ++++++++++++++++++++++-------------------------
 fs/freevxfs/vxfs_super.c |  69 ++++++++++-------
 fs/minix/inode.c         |  48 +++++++-----
 fs/openpromfs/inode.c    |   8 +-
 fs/qnx6/inode.c          | 117 ++++++++++++++++------------
 fs/tracefs/inode.c       | 196 ++++++++++++++++++++++------------------------
 6 files changed, 327 insertions(+), 309 deletions(-)

^ permalink raw reply	[flat|nested] 34+ messages in thread
* [GIT PULL] vfs: mount
@ 2023-06-23 11:03 Christian Brauner
  2023-06-26 17:34 ` pr-tracker-bot
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2023-06-23 11:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel

Hey Linus,

/* Summary */
This contains the work to extend move_mount() to allow adding a mount
beneath the topmost mount of a mount stack.

There are two LWN articles about this. One covers the original patch
series in [1]. The other in [2] summarizes the session and roughly the
discussion between Al and me at LSFMM. The second article also goes into
some good questions from attendees.

Since all details are found in the relevant commit with a technical dive
into semantics and locking at the end I'm only adding the motivation and
core functionality for this from commit message and leave out the
invasive details. The code is also heavily commented and annotated as
well which was explicitly requested.

TL;DR:

> mount -t ext4 /dev/sda /mnt
  |
  └─/mnt    /dev/sda    ext4

> mount --beneath -t xfs /dev/sdb /mnt
  |
  └─/mnt    /dev/sdb    xfs
    └─/mnt  /dev/sda    ext4

> umount /mnt
  |
  └─/mnt    /dev/sdb    xfs

The longer motivation is that various distributions are adding or are in
the process of adding support for system extensions and in the future
configuration extensions through various tools. A more detailed
explanation on system and configuration extensions can be found on the
manpage which is listed below at [3].

System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.

When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.

System configuration images are similar but operate on directories
containing system or service configuration.

On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):

       TARGET  SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /       /       ext4    shared:1     29      1

On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.

Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.

In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:

       TARGET  SOURCE  FSTYPE  PROPAGATION         MNT_ID  PARENT_ID
       /       /       ext4    shared:24 master:1  71      47

For people not too familiar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.

A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.

For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:

       TARGET                    SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /home/ubuntu/debian-tree  /       ext4    shared:99    61      60

So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:

       TARGET                  SOURCE              FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /propagate/debian-tree  /run/host/incoming  tmpfs   master:5     71      68

Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [4]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.

Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.

The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.

The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the system is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.

The vfs already exposes a mechanism to userspace whereby mounts can be
mounted beneath an existing mount. Such mounts are internally referred
to as "tucked". The patch series exposes the ability to mount beneath a
top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount()
system call. This allows userspace to seamlessly upgrade mounts. After
this series the only thing that will have changed is that mounting
beneath an existing mount can be done explicitly instead of just
implicitly.

The crux is that the proposed mechanism already exists and that it is so
powerful as to cover cases where mounts are supposed to be updated with
new versions. Crucially, it offers an important flexibility. Namely that
updates to a system may either be forced or can be delayed and the
umount of the top mount be left to a service if it is a cooperative one.

Link: https://lwn.net/Articles/927491 [1]
Link: https://lwn.net/Articles/934094 [2]
Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [3]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [4]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

/* Testing */
clang: Ubuntu clang version 15.0.7
gcc: (Ubuntu 12.2.0-3ubuntu1) 12.2.0

All patches are based on v6.4-rc2 and have been sitting in linux-next.
No build failures or warnings were observed. All old and new tests in
fstests, selftests, and LTP pass without regressions.

/* Conflicts */
At the time of creating this PR no merge conflicts were reported from
linux-next and no merge conflicts showed up doing a test-merge with
current mainline.

The following changes since commit f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6:

  Linux 6.4-rc2 (2023-05-14 12:51:40 -0700)

are available in the Git repository at:

  git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.5/vfs.mount

for you to fetch changes up to 6ac392815628f317fcfdca1a39df00b9cc4ebc8b:

  fs: allow to mount beneath top mount (2023-05-19 04:30:22 +0200)

Please consider pulling these changes from the signed v6.5/vfs.mount tag.

Thanks!
Christian

----------------------------------------------------------------
v6.5/vfs.mount

----------------------------------------------------------------
Christian Brauner (4):
      fs: add path_mounted()
      fs: properly document __lookup_mnt()
      fs: use a for loop when locking a mount
      fs: allow to mount beneath top mount

 fs/namespace.c             | 451 +++++++++++++++++++++++++++++++++++++--------
 fs/pnode.c                 |  42 ++++-
 fs/pnode.h                 |   3 +
 include/uapi/linux/mount.h |   3 +-
 4 files changed, 417 insertions(+), 82 deletions(-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-04-08  5:06 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-18 13:06 [GIT PULL] vfs mount Christian Brauner
2025-01-20  0:10 ` Sasha Levin
2025-01-20 12:21   ` Christian Brauner
2025-01-20 12:26     ` [GIT PULL] vfs mount v2 Christian Brauner
2025-01-20 19:00       ` pr-tracker-bot
2025-01-20 18:59 ` [GIT PULL] vfs mount pr-tracker-bot
  -- strict thread matches above, loose matches on Subject: below --
2025-03-22 10:13 Christian Brauner
2025-03-24 21:00 ` pr-tracker-bot
2025-04-01 17:07   ` Leon Romanovsky
2025-04-03  8:29     ` Christian Brauner
2025-04-03 15:15       ` Christian Brauner
2025-04-03 15:34         ` James Bottomley
2025-04-03 17:21           ` Mateusz Guzik
2025-04-03 18:09             ` Linus Torvalds
2025-04-03 19:17               ` Mateusz Guzik
2025-04-04  8:28               ` Christoph Hellwig
2025-04-04 14:19                 ` Linus Torvalds
2025-04-07  8:51                   ` Christoph Hellwig
2025-04-07 16:00                     ` Linus Torvalds
2025-04-08  5:06                       ` Christoph Hellwig
2025-04-07 11:22                   ` Christian Brauner
2025-04-03 18:24         ` Leon Romanovsky
2025-04-03 19:18           ` Linus Torvalds
2025-04-03 19:45             ` Christian Brauner
2025-04-03 19:55               ` Christian Brauner
2025-04-04  6:16             ` Leon Romanovsky
2025-04-03 19:38           ` James Bottomley
2024-09-13 14:41 Christian Brauner
2024-09-14  2:33 ` Stephen Rothwell
2024-09-16 11:09 ` pr-tracker-bot
2024-05-10 11:46 Christian Brauner
2024-05-13 19:38 ` pr-tracker-bot
2023-06-23 11:03 [GIT PULL] vfs: mount Christian Brauner
2023-06-26 17:34 ` pr-tracker-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).