* [GIT PULL] vfs: mount
@ 2023-06-23 11:03 Christian Brauner
2023-06-26 17:34 ` pr-tracker-bot
0 siblings, 1 reply; 32+ messages in thread
From: Christian Brauner @ 2023-06-23 11:03 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel
Hey Linus,
/* Summary */
This contains the work to extend move_mount() to allow adding a mount
beneath the topmost mount of a mount stack.
There are two LWN articles about this. One covers the original patch
series in [1]. The other in [2] summarizes the session and roughly the
discussion between Al and me at LSFMM. The second article also goes into
some good questions from attendees.
Since all details are found in the relevant commit with a technical dive
into semantics and locking at the end I'm only adding the motivation and
core functionality for this from commit message and leave out the
invasive details. The code is also heavily commented and annotated as
well which was explicitly requested.
TL;DR:
> mount -t ext4 /dev/sda /mnt
|
└─/mnt /dev/sda ext4
> mount --beneath -t xfs /dev/sdb /mnt
|
└─/mnt /dev/sdb xfs
└─/mnt /dev/sda ext4
> umount /mnt
|
└─/mnt /dev/sdb xfs
The longer motivation is that various distributions are adding or are in
the process of adding support for system extensions and in the future
configuration extensions through various tools. A more detailed
explanation on system and configuration extensions can be found on the
manpage which is listed below at [3].
System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.
When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.
System configuration images are similar but operate on directories
containing system or service configuration.
On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:1 29 1
On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.
Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.
In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:24 master:1 71 47
For people not too familiar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.
A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.
For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/home/ubuntu/debian-tree / ext4 shared:99 61 60
So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/propagate/debian-tree /run/host/incoming tmpfs master:5 71 68
Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [4]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.
Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.
The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.
The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the system is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.
The vfs already exposes a mechanism to userspace whereby mounts can be
mounted beneath an existing mount. Such mounts are internally referred
to as "tucked". The patch series exposes the ability to mount beneath a
top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount()
system call. This allows userspace to seamlessly upgrade mounts. After
this series the only thing that will have changed is that mounting
beneath an existing mount can be done explicitly instead of just
implicitly.
The crux is that the proposed mechanism already exists and that it is so
powerful as to cover cases where mounts are supposed to be updated with
new versions. Crucially, it offers an important flexibility. Namely that
updates to a system may either be forced or can be delayed and the
umount of the top mount be left to a service if it is a cooperative one.
Link: https://lwn.net/Articles/927491 [1]
Link: https://lwn.net/Articles/934094 [2]
Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [3]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [4]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013
/* Testing */
clang: Ubuntu clang version 15.0.7
gcc: (Ubuntu 12.2.0-3ubuntu1) 12.2.0
All patches are based on v6.4-rc2 and have been sitting in linux-next.
No build failures or warnings were observed. All old and new tests in
fstests, selftests, and LTP pass without regressions.
/* Conflicts */
At the time of creating this PR no merge conflicts were reported from
linux-next and no merge conflicts showed up doing a test-merge with
current mainline.
The following changes since commit f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6:
Linux 6.4-rc2 (2023-05-14 12:51:40 -0700)
are available in the Git repository at:
git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.5/vfs.mount
for you to fetch changes up to 6ac392815628f317fcfdca1a39df00b9cc4ebc8b:
fs: allow to mount beneath top mount (2023-05-19 04:30:22 +0200)
Please consider pulling these changes from the signed v6.5/vfs.mount tag.
Thanks!
Christian
----------------------------------------------------------------
v6.5/vfs.mount
----------------------------------------------------------------
Christian Brauner (4):
fs: add path_mounted()
fs: properly document __lookup_mnt()
fs: use a for loop when locking a mount
fs: allow to mount beneath top mount
fs/namespace.c | 451 +++++++++++++++++++++++++++++++++++++--------
fs/pnode.c | 42 ++++-
fs/pnode.h | 3 +
include/uapi/linux/mount.h | 3 +-
4 files changed, 417 insertions(+), 82 deletions(-)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs: mount
2023-06-23 11:03 [GIT PULL] vfs: mount Christian Brauner
@ 2023-06-26 17:34 ` pr-tracker-bot
0 siblings, 0 replies; 32+ messages in thread
From: pr-tracker-bot @ 2023-06-26 17:34 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, linux-kernel
The pull request you sent on Fri, 23 Jun 2023 13:03:58 +0200:
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/v6.5/vfs.mount
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/c0a572d9d32fe1e95672f24e860776dba0750a38
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* [GIT PULL] vfs mount
@ 2024-05-10 11:46 Christian Brauner
2024-05-13 19:38 ` pr-tracker-bot
0 siblings, 1 reply; 32+ messages in thread
From: Christian Brauner @ 2024-05-10 11:46 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel
Hey Linus,
/* Summary */
This converts qnx6, minix, debugfs, tracefs, freevxfs, and openpromfs to
the new mount api further reducing the number of filesystems relying on
the legacy mount api.
/* Testing */
clang: Debian clang version 16.0.6 (26)
gcc: (Debian 13.2.0-24)
All patches are based on v6.9-rc1 and have been sitting in linux-next.
No build failures or warnings were observed.
/* Conflicts */
Merge conflicts with mainline
=============================
[1] linux-next: manual merge of the vfs-brauner tree with Linus' tree
https://lore.kernel.org/linux-next/20240506095258.05b5deae@canb.auug.org.au
The following changes since commit 4cece764965020c22cff7665b18a012006359095:
Linux 6.9-rc1 (2024-03-24 14:10:05 -0700)
are available in the Git repository at:
git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.10.mount
for you to fetch changes up to 7cd7bfe59328741185ef6db3356489c22919e59b:
minix: convert minix to use the new mount api (2024-03-26 09:04:55 +0100)
Please consider pulling these changes from the signed vfs-6.10.mount tag.
Thanks!
Christian
----------------------------------------------------------------
vfs-6.10.mount
----------------------------------------------------------------
Bill O'Donnell (2):
qnx6: convert qnx6 to use the new mount api
minix: convert minix to use the new mount api
David Howells (2):
vfs: Convert debugfs to use the new mount API
vfs: Convert tracefs to use the new mount API
Eric Sandeen (2):
freevxfs: Convert freevxfs to the new mount API.
openpromfs: finish conversion to the new mount API
fs/debugfs/inode.c | 198 ++++++++++++++++++++++-------------------------
fs/freevxfs/vxfs_super.c | 69 ++++++++++-------
fs/minix/inode.c | 48 +++++++-----
fs/openpromfs/inode.c | 8 +-
fs/qnx6/inode.c | 117 ++++++++++++++++------------
fs/tracefs/inode.c | 196 ++++++++++++++++++++++------------------------
6 files changed, 327 insertions(+), 309 deletions(-)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2024-05-10 11:46 Christian Brauner
@ 2024-05-13 19:38 ` pr-tracker-bot
0 siblings, 0 replies; 32+ messages in thread
From: pr-tracker-bot @ 2024-05-13 19:38 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, linux-kernel
The pull request you sent on Fri, 10 May 2024 13:46:47 +0200:
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.10.mount
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/103fb219cf57fc3641d92af2f4f438080cea3efc
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* [GIT PULL] vfs mount
@ 2024-09-13 14:41 Christian Brauner
2024-09-14 2:33 ` Stephen Rothwell
2024-09-16 11:09 ` pr-tracker-bot
0 siblings, 2 replies; 32+ messages in thread
From: Christian Brauner @ 2024-09-13 14:41 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel
Hey Linus,
/* Summary */
Recently, we added the ability to list mounts in other mount namespaces
and the ability to retrieve namespace file descriptors without having to
go through procfs by deriving them from pidfds.
This extends nsfs in two ways:
(1) Add the ability to retrieve information about a mount namespace via
NS_MNT_GET_INFO. This will return the mount namespace id and the
number of mounts currently in the mount namespace. The number of
mounts can be used to size the buffer that needs to be used for
listmount() and is in general useful without having to actually
iterate through all the mounts.
The structure is extensible.
(2) Add the ability to iterate through all mount namespaces over which
the caller holds privilege returning the file descriptor for the
next or previous mount namespace.
To retrieve a mount namespace the caller must be privileged wrt to
it's owning user namespace. This means that PID 1 on the host can
list all mounts in all mount namespaces or that a container can list
all mounts of its nested containers.
Optionally pass a structure for NS_MNT_GET_INFO with
NS_MNT_GET_{PREV,NEXT} to retrieve information about the mount
namespace in one go.
(1) and (2) can be implemented for other namespace types easily.
Together with recent api additions this means one can iterate through
all mounts in all mount namespaces without ever touching procfs. The
merge message contains example code how to do this.
/* Testing */
gcc version 14.2.0 (Debian 14.2.0-3)
Debian clang version 16.0.6 (27+b1)
All patches are based on v6.11-rc1 and have been sitting in linux-next.
No build failures or warnings were observed.
/* Conflicts */
Merge conflicts with mainline
=============================
No known conflicts.
Merge conflicts with other trees
================================
(1) linux-next: build failure after merge of the bpf-next tree
https://lore.kernel.org/r/20240913133240.066ae790@canb.auug.org.au
The reported merge conflict isn't really with bpf-next but with the
series to convert to fd_file() accessors for the changed struct fd
representation.
The patch you need to fix this however is correct in that draft. But
honestly, it's pretty easy for you to figure out on your own anyway.
The following changes since commit 8400291e289ee6b2bf9779ff1c83a291501f017b:
Linux 6.11-rc1 (2024-07-28 14:19:55 -0700)
are available in the Git repository at:
git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.12.mount
for you to fetch changes up to 49224a345c488a0e176f193a60a2a76e82349e3e:
Merge patch series "nsfs: iterate through mount namespaces" (2024-08-09 12:47:05 +0200)
Please consider pulling these changes from the signed vfs-6.12.mount tag.
Thanks!
Christian
----------------------------------------------------------------
vfs-6.12.mount
----------------------------------------------------------------
Christian Brauner (5):
fs: allow mount namespace fd
fs: add put_mnt_ns() cleanup helper
file: add fput() cleanup helper
nsfs: iterate through mount namespaces
Merge patch series "nsfs: iterate through mount namespaces"
fs/mount.h | 13 ++++++
fs/namespace.c | 74 +++++++++++++++++++++++++-----
fs/nsfs.c | 102 +++++++++++++++++++++++++++++++++++++++++-
include/linux/file.h | 2 +
include/linux/mnt_namespace.h | 4 ++
include/uapi/linux/nsfs.h | 16 +++++++
6 files changed, 198 insertions(+), 13 deletions(-)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2024-09-13 14:41 Christian Brauner
@ 2024-09-14 2:33 ` Stephen Rothwell
2024-09-16 11:09 ` pr-tracker-bot
1 sibling, 0 replies; 32+ messages in thread
From: Stephen Rothwell @ 2024-09-14 2:33 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel, Al Viro
[-- Attachment #1: Type: text/plain, Size: 846 bytes --]
Hi Linus,
On Fri, 13 Sep 2024 16:41:58 +0200 Christian Brauner <brauner@kernel.org> wrote:
>
> (1) linux-next: build failure after merge of the bpf-next tree
> https://lore.kernel.org/r/20240913133240.066ae790@canb.auug.org.au
>
> The reported merge conflict isn't really with bpf-next but with the
> series to convert to fd_file() accessors for the changed struct fd
> representation.
>
> The patch you need to fix this however is correct in that draft. But
> honestly, it's pretty easy for you to figure out on your own anyway.
Except Al Viro told me an earlier time we had this conflict (the commit
the did the convert to fd_file() was removed form linux-next for a while)
that !fd_file(f) should (could)? be replaced by fd_empty(f) - but that
may be done later.
--
Cheers,
Stephen Rothwell
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2024-09-13 14:41 Christian Brauner
2024-09-14 2:33 ` Stephen Rothwell
@ 2024-09-16 11:09 ` pr-tracker-bot
1 sibling, 0 replies; 32+ messages in thread
From: pr-tracker-bot @ 2024-09-16 11:09 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, linux-kernel
The pull request you sent on Fri, 13 Sep 2024 16:41:58 +0200:
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.12.mount
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/9020d0d844ad58a051f90b1e5b82ba34123925b9
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* [GIT PULL] vfs mount
@ 2025-01-18 13:06 Christian Brauner
2025-01-20 0:10 ` Sasha Levin
2025-01-20 18:59 ` pr-tracker-bot
0 siblings, 2 replies; 32+ messages in thread
From: Christian Brauner @ 2025-01-18 13:06 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel
Hey Linus,
/* Summary */
This contains mount update for this cycle:
- Add a mountinfo program to demonstrate statmount()/listmount()
Add a new "mountinfo" sample userland program that demonstrates how to
use statmount() and listmount() to get at the same info that
/proc/pid/mountinfo provides.
- Remove pointless nospec.h include.
- Prepend statmount.mnt_opts string with security_sb_mnt_opts()
Currently these mount options aren't accessible via statmount().
- Add new mount namespaces to mount namespace rbtree outside of the
namespace semaphore.
- Lockless mount namespace lookup
Currently we take the read lock when looking for a mount namespace to
list mounts in. We can make this lockless. The simple search case can
just use a sequence counter to detect concurrent changes to the
rbtree.
For walking the list of mount namespaces sequentially via nsfs we keep
a separate rcu list as rb_prev() and rb_next() aren't usable safely
with rcu. Currently there is no primitive for retrieving the previous
list member. To do this we need a new deletion primitive that doesn't
poison the prev pointer and a corresponding retrieval helper.
Since creating mount namespaces is a relatively rare event compared
with querying mounts in a foreign mount namespace this is worth it.
Once libmount and systemd pick up this mechanism to list mounts in
foreign mount namespaces this will be used very frequently.
- Add extended selftests for lockless mount namespace iteration.
- Add a sample program to list all mounts on the system, i.e., in all
mount namespaces.
- Improve mount namespace iteration performance
Make finding the last or first mount to start iterating the mount
namespace from an O(1) operation and add selftests for iterating the
mount table starting from the first and last mount.
- Use an xarray for the old mount id
While the ida does use the xarray internally we can use it explicitly
which allows us to increment the unique mount id under the xa lock.
This allows us to remove the atomic as we're now allocating both ids
in one go.
/* Testing */
gcc version 14.2.0 (Debian 14.2.0-6)
Debian clang version 16.0.6 (27+b1)
No build failures or warnings were observed.
/* Conflicts */
Merge conflicts with mainline
=============================
No known conflicts.
Merge conflicts with other trees
================================
This will have a merge conflict with vfs-6.14.mount pull request sent in
https://lore.kernel.org/r/20250118-vfs-pidfs-5921bfa5632a@brauner
and it can be resolved as follows:
diff --cc fs/namespace.c
index 64deda6f5b2c,371c860f49de..000000000000
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@@ -32,8 -32,6 +32,7 @@@
#include <linux/fs_context.h>
#include <linux/shmem_fs.h>
#include <linux/mnt_idmapping.h>
+#include <linux/pidfs.h>
- #include <linux/nospec.h>
#include "pnode.h"
#include "internal.h"
The following changes since commit 344bac8f0d73fe970cd9f5b2f132906317d29e8b:
fs: kill MNT_ONRB (2025-01-09 16:58:50 +0100)
are available in the Git repository at:
git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.14-rc1.mount
for you to fetch changes up to f79e6eb84d4d2bff99e3ca6c1f140b2af827e904:
samples/vfs/mountinfo: Use __u64 instead of uint64_t (2025-01-10 12:08:27 +0100)
Please consider pulling these changes from the signed vfs-6.14-rc1.mount tag.
Thanks!
Christian
----------------------------------------------------------------
vfs-6.14-rc1.mount
----------------------------------------------------------------
Christian Brauner (17):
mount: remove inlude/nospec.h include
fs: add mount namespace to rbtree late
Merge patch series "fs: listmount()/statmount() fix and sample program"
fs: lockless mntns rbtree lookup
rculist: add list_bidir_{del,prev}_rcu()
fs: lockless mntns lookup for nsfs
fs: simplify rwlock to spinlock
seltests: move nsfs into filesystems subfolder
selftests: add tests for mntns iteration
selftests: remove unneeded include
samples: add test-list-all-mounts
Merge patch series "fs: lockless mntns lookup"
fs: cache first and last mount
selftests: add listmount() iteration tests
Merge patch series "fs: tweak mntns iteration"
fs: use xarray for old mount id
fs: remove useless lockdep assertion
Geert Uytterhoeven (1):
samples/vfs/mountinfo: Use __u64 instead of uint64_t
Jeff Layton (2):
samples: add a mountinfo program to demonstrate statmount()/listmount()
fs: prepend statmount.mnt_opts string with security_sb_mnt_opts()
fs/mount.h | 31 ++-
fs/namespace.c | 200 +++++++++------
fs/nsfs.c | 5 +-
include/linux/rculist.h | 44 ++++
samples/vfs/.gitignore | 2 +
samples/vfs/Makefile | 2 +-
samples/vfs/mountinfo.c | 273 +++++++++++++++++++++
samples/vfs/test-list-all-mounts.c | 235 ++++++++++++++++++
.../selftests/{ => filesystems}/nsfs/.gitignore | 1 +
.../selftests/{ => filesystems}/nsfs/Makefile | 4 +-
.../selftests/{ => filesystems}/nsfs/config | 0
.../selftests/filesystems/nsfs/iterate_mntns.c | 149 +++++++++++
.../selftests/{ => filesystems}/nsfs/owner.c | 0
.../selftests/{ => filesystems}/nsfs/pidns.c | 0
.../selftests/filesystems/statmount/Makefile | 2 +-
.../filesystems/statmount/listmount_test.c | 66 +++++
tools/testing/selftests/pidfd/pidfd.h | 1 -
17 files changed, 918 insertions(+), 97 deletions(-)
create mode 100644 samples/vfs/mountinfo.c
create mode 100644 samples/vfs/test-list-all-mounts.c
rename tools/testing/selftests/{ => filesystems}/nsfs/.gitignore (78%)
rename tools/testing/selftests/{ => filesystems}/nsfs/Makefile (50%)
rename tools/testing/selftests/{ => filesystems}/nsfs/config (100%)
create mode 100644 tools/testing/selftests/filesystems/nsfs/iterate_mntns.c
rename tools/testing/selftests/{ => filesystems}/nsfs/owner.c (100%)
rename tools/testing/selftests/{ => filesystems}/nsfs/pidns.c (100%)
create mode 100644 tools/testing/selftests/filesystems/statmount/listmount_test.c
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-01-18 13:06 Christian Brauner
@ 2025-01-20 0:10 ` Sasha Levin
2025-01-20 12:21 ` Christian Brauner
2025-01-20 18:59 ` pr-tracker-bot
1 sibling, 1 reply; 32+ messages in thread
From: Sasha Levin @ 2025-01-20 0:10 UTC (permalink / raw)
To: Christian Brauner; +Cc: Linus Torvalds, linux-fsdevel, linux-kernel
On Sat, Jan 18, 2025 at 02:06:58PM +0100, Christian Brauner wrote:
> samples: add a mountinfo program to demonstrate statmount()/listmount()
Hi Jeff, Christian,
LKFT has caught a build error with the above commit:
/builds/linux/samples/vfs/mountinfo.c:235:18: error: 'SYS_pidfd_open' undeclared (first use in this function); did you mean 'SYS_mq_open'?
pidfd = syscall(SYS_pidfd_open, pid, 0);
^~~~~~~~~~~~~~
SYS_mq_open
The full log is here: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-511-g109a8e0fa9d6/testrun/26809210/suite/build/test/gcc-8-allyesconfig/log
--
Thanks,
Sasha
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-01-20 0:10 ` Sasha Levin
@ 2025-01-20 12:21 ` Christian Brauner
0 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-01-20 12:21 UTC (permalink / raw)
To: Sasha Levin; +Cc: Linus Torvalds, linux-fsdevel, linux-kernel
On Sun, Jan 19, 2025 at 07:10:57PM -0500, Sasha Levin wrote:
> On Sat, Jan 18, 2025 at 02:06:58PM +0100, Christian Brauner wrote:
> > samples: add a mountinfo program to demonstrate statmount()/listmount()
>
> Hi Jeff, Christian,
>
> LKFT has caught a build error with the above commit:
>
> /builds/linux/samples/vfs/mountinfo.c:235:18: error: 'SYS_pidfd_open' undeclared (first use in this function); did you mean 'SYS_mq_open'?
> pidfd = syscall(SYS_pidfd_open, pid, 0);
> ^~~~~~~~~~~~~~
> SYS_mq_open
>
> The full log is here: https://qa-reports.linaro.org/lkft/sashal-linus-next/build/v6.13-rc7-511-g109a8e0fa9d6/testrun/26809210/suite/build/test/gcc-8-allyesconfig/log
Thanks for the report. This is a build failure in userspace sample code.
I have pushed a fix and generated a new tag which I will send out
shortly.
Christian
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-01-18 13:06 Christian Brauner
2025-01-20 0:10 ` Sasha Levin
@ 2025-01-20 18:59 ` pr-tracker-bot
1 sibling, 0 replies; 32+ messages in thread
From: pr-tracker-bot @ 2025-01-20 18:59 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, linux-kernel
The pull request you sent on Sat, 18 Jan 2025 14:06:58 +0100:
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.14-rc1.mount
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/f9d94f78a8749e15de8aeb2e281898aa980e62d9
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* [GIT PULL] vfs mount
@ 2025-03-22 10:13 Christian Brauner
2025-03-24 21:00 ` pr-tracker-bot
0 siblings, 1 reply; 32+ messages in thread
From: Christian Brauner @ 2025-03-22 10:13 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Christian Brauner, linux-fsdevel, linux-kernel
Hey Linus,
/* Summary */
This contains the first batch of mount updates for this cycle:
- Mount notifications
The day has come where we finally provide a new api to listen for
mount topology changes outside of /proc/<pid>/mountinfo. A mount
namespace file descriptor can be supplied and registered with fanotify
to listen for mount topology changes.
Currently notifications for mount, umount and moving mounts are
generated. The generated notification record contains the unique mount
id of the mount.
The listmount() and statmount() api can be used to query detailed
information about the mount using the received unique mount id.
This allows userspace to figure out exactly how the mount topology
changed without having to generating diffs of /proc/<pid>/mountinfo in
userspace.
- Support O_PATH file descriptors with FSCONFIG_SET_FD in the new mount api.
- Support detached mounts in overlayfs.
Since last cycle we support specifying overlayfs layers via file
descriptors. However, we don't allow detached mounts which means
userspace cannot user file descriptors received via
open_tree(OPEN_TREE_CLONE) and fsmount() directly. They have to attach
them to a mount namespace via move_mount() first. This is cumbersome
and means they have to undo mounts via umount(). This allows them to
directly use detached mounts.
- Allow to retrieve idmappings with statmount.
Currently it isn't possible to figure out what idmapping has been
attached to an idmapped mount. Add an extension to statmount() which
allows to read the idmapping from the mount.
- Allow creating idmapped mounts from mounts that are already idmapped.
So far it isn't possible to allow the creation of idmapped mounts from
already idmapped mounts as this has significant lifetime implications.
Make the creation of idmapped mounts atomic by allow to pass struct
mount_attr together with the open_tree_attr() system call allowing to
solve these issues without complicating VFS lookup in any way.
The system call has in general the benefit that creating a detached
mount and applying mount attributes to it becomes an atomic operation
for userspace.
- Add a way to query statmount() for supported options.
Allow userspace to query which mount information can be retrieved
through statmount().
- Allow superblock owners to force unmount.
/* Testing */
gcc version 14.2.0 (Debian 14.2.0-6)
Debian clang version 16.0.6 (27+b1)
No build failures or warnings were observed.
/* Conflicts */
Merge conflicts with mainline
=============================
No known conflicts.
Merge conflicts with other trees
================================
This contains a merge conflict with the vfs-6.15.misc pull request:
diff --cc fs/internal.h
index 82127c69e641,db6094d5cb0b..000000000000
--- a/fs/internal.h
+++ b/fs/internal.h
@@@ -337,4 -338,4 +337,5 @@@ static inline bool path_mounted(const s
return path->mnt->mnt_root == path->dentry;
}
void file_f_owner_release(struct file *file);
+bool file_seek_cur_needs_f_lock(struct file *file);
+ int statmount_mnt_idmap(struct mnt_idmap *idmap, struct seq_file *seq, bool uid_map);
The following changes since commit 2014c95afecee3e76ca4a56956a936e23283f05b:
Linux 6.14-rc1 (2025-02-02 15:39:26 -0800)
are available in the Git repository at:
git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.15-rc1.mount
for you to fetch changes up to e1ff7aa34dec7e650159fd7ca8ec6af7cc428d9f:
umount: Allow superblock owners to force umount (2025-03-19 09:19:04 +0100)
Please consider pulling these changes from the signed vfs-6.15-rc1.mount tag.
Thanks!
Christian
----------------------------------------------------------------
vfs-6.15-rc1.mount
----------------------------------------------------------------
Arnd Bergmann (1):
samples/vfs: fix printf format string for size_t
Christian Brauner (18):
Merge patch series "mount notification"
fs: support O_PATH fds with FSCONFIG_SET_FD
selftests/overlayfs: test specifying layers as O_PATH file descriptors
Merge patch series "ovl: allow O_PATH file descriptor when specifying layers"
fs: allow detached mounts in clone_private_mount()
uidgid: add map_id_range_up()
statmount: allow to retrieve idmappings
samples/vfs: check whether flag was raised
selftests: add tests for using detached mount with overlayfs
samples/vfs: add STATMOUNT_MNT_{G,U}IDMAP
Merge patch series "fs: allow detached mounts in clone_private_mount()"
fs: add vfs_open_tree() helper
fs: add copy_mount_setattr() helper
fs: add open_tree_attr()
fs: add kflags member to struct mount_kattr
fs: allow changing idmappings
Merge patch series "statmount: allow to retrieve idmappings"
Merge patch series "fs: allow changing idmappings"
Jeff Layton (1):
statmount: add a new supported_mask field
Miklos Szeredi (5):
fsnotify: add mount notification infrastructure
fanotify: notify on mount attach and detach
vfs: add notifications for mount attach and detach
selinux: add FILE__WATCH_MOUNTNS
selftests: add tests for mount notification
Trond Myklebust (1):
umount: Allow superblock owners to force umount
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/tools/syscall_32.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/autofs/autofs_i.h | 2 +
fs/fsopen.c | 2 +-
fs/internal.h | 1 +
fs/mnt_idmapping.c | 51 ++
fs/mount.h | 26 ++
fs/namespace.c | 485 ++++++++++++++-----
fs/notify/fanotify/fanotify.c | 38 +-
fs/notify/fanotify/fanotify.h | 18 +
fs/notify/fanotify/fanotify_user.c | 89 +++-
fs/notify/fdinfo.c | 5 +
fs/notify/fsnotify.c | 47 +-
fs/notify/fsnotify.h | 11 +
fs/notify/mark.c | 14 +-
fs/pnode.c | 4 +-
include/linux/fanotify.h | 12 +-
include/linux/fsnotify.h | 20 +
include/linux/fsnotify_backend.h | 42 ++
include/linux/mnt_idmapping.h | 5 +
include/linux/syscalls.h | 4 +
include/linux/uidgid.h | 6 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/fanotify.h | 10 +
include/uapi/linux/mount.h | 10 +-
kernel/user_namespace.c | 26 +-
samples/vfs/samples-vfs.h | 14 +-
samples/vfs/test-list-all-mounts.c | 35 +-
scripts/syscall.tbl | 1 +
security/selinux/hooks.c | 3 +
security/selinux/include/classmap.h | 2 +-
tools/testing/selftests/Makefile | 1 +
.../selftests/filesystems/mount-notify/.gitignore | 2 +
.../selftests/filesystems/mount-notify/Makefile | 6 +
.../filesystems/mount-notify/mount-notify_test.c | 516 +++++++++++++++++++++
.../filesystems/overlayfs/set_layers_via_fds.c | 195 ++++++++
.../selftests/filesystems/overlayfs/wrappers.h | 17 +
.../selftests/filesystems/statmount/statmount.h | 2 +-
52 files changed, 1567 insertions(+), 175 deletions(-)
create mode 100644 tools/testing/selftests/filesystems/mount-notify/.gitignore
create mode 100644 tools/testing/selftests/filesystems/mount-notify/Makefile
create mode 100644 tools/testing/selftests/filesystems/mount-notify/mount-notify_test.c
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-03-22 10:13 [GIT PULL] vfs mount Christian Brauner
@ 2025-03-24 21:00 ` pr-tracker-bot
2025-04-01 17:07 ` Leon Romanovsky
0 siblings, 1 reply; 32+ messages in thread
From: pr-tracker-bot @ 2025-03-24 21:00 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, Christian Brauner, linux-fsdevel, linux-kernel
The pull request you sent on Sat, 22 Mar 2025 11:13:18 +0100:
> git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.15-rc1.mount
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/fd101da676362aaa051b4f5d8a941bd308603041
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-03-24 21:00 ` pr-tracker-bot
@ 2025-04-01 17:07 ` Leon Romanovsky
2025-04-03 8:29 ` Christian Brauner
0 siblings, 1 reply; 32+ messages in thread
From: Leon Romanovsky @ 2025-04-01 17:07 UTC (permalink / raw)
To: pr-tracker-bot, Christian Brauner
Cc: Linus Torvalds, linux-fsdevel, linux-kernel
On Mon, Mar 24, 2025 at 09:00:59PM +0000, pr-tracker-bot@kernel.org wrote:
> The pull request you sent on Sat, 22 Mar 2025 11:13:18 +0100:
>
> > git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.15-rc1.mount
>
> has been merged into torvalds/linux.git:
> https://git.kernel.org/torvalds/c/fd101da676362aaa051b4f5d8a941bd308603041
I didn't bisect, but this PR looks like the most relevant candidate.
The latest Linus's master generates the following slab-use-after-free:
[ 1845.404658] ==================================================================
[ 1845.405460] BUG: KASAN: slab-use-after-free in clone_private_mount+0x309/0x390
[ 1845.406205] Read of size 8 at addr ffff8881507b5ab0 by task dockerd/8697
[ 1845.406847]
[ 1845.407081] CPU: 5 UID: 0 PID: 8697 Comm: dockerd Not tainted 6.14.0master_fbece6d #1 NONE
[ 1845.407086] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1845.407097] Call Trace:
[ 1845.407102] <TASK>
[ 1845.407104] dump_stack_lvl+0x69/0xa0
[ 1845.407114] print_report+0x156/0x523
[ 1845.407120] ? __virt_addr_valid+0x1de/0x3c0
[ 1845.407124] ? clone_private_mount+0x309/0x390
[ 1845.407128] kasan_report+0xc1/0xf0
[ 1845.407134] ? clone_private_mount+0x309/0x390
[ 1845.407138] clone_private_mount+0x309/0x390
[ 1845.407144] ovl_fill_super+0x2965/0x59e0 [overlay]
[ 1845.407165] ? ovl_workdir_create+0x900/0x900 [overlay]
[ 1845.407177] ? wait_for_completion_io_timeout+0x20/0x20
[ 1845.407182] ? lockdep_init_map_type+0x58/0x220
[ 1845.407186] ? lockdep_init_map_type+0x58/0x220
[ 1845.407189] ? shrinker_register+0x177/0x200
[ 1845.407194] ? sget_fc+0x449/0xb30
[ 1845.407199] ? ovl_workdir_create+0x900/0x900 [overlay]
[ 1845.407211] ? get_tree_nodev+0xa5/0x130
[ 1845.407214] get_tree_nodev+0xa5/0x130
[ 1845.407218] ? cap_capable+0xd0/0x320
[ 1845.407223] vfs_get_tree+0x83/0x2e0
[ 1845.407227] ? ns_capable+0x55/0xb0
[ 1845.407232] path_mount+0x891/0x1aa0
[ 1845.407237] ? finish_automount+0x860/0x860
[ 1845.407240] ? kmem_cache_free+0x14c/0x4f0
[ 1845.407245] ? user_path_at+0x3d/0x50
[ 1845.407250] __x64_sys_mount+0x2d4/0x3a0
[ 1845.407254] ? path_mount+0x1aa0/0x1aa0
[ 1845.407259] do_syscall_64+0x6d/0x140
[ 1845.407263] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1845.407267] RIP: 0033:0x55e3487f1fea
[ 1845.407274] Code: e8 1b 96 fa ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 4c 8b 54 24 28 4c 8b 44 24 30 4c 8b 4c 24 38 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 40 ff ff ff ff 48 c7 44 24 48
[ 1845.407278] RSP: 002b:000000c000b563b8 EFLAGS: 00000212 ORIG_RAX: 00000000000000a5
[ 1845.407282] RAX: ffffffffffffffda RBX: 000000c00006c000 RCX: 000055e3487f1fea
[ 1845.407285] RDX: 000000c0012cf7d8 RSI: 000000c0012616c0 RDI: 000000c0012cf7d0
[ 1845.407287] RBP: 000000c000b56458 R08: 000000c0004fa600 R09: 0000000000000000
[ 1845.407289] R10: 0000000000000000 R11: 0000000000000212 R12: 000000c0012cf7d0
[ 1845.407291] R13: 0000000000000000 R14: 000000c00098b6c0 R15: ffffffffffffffff
[ 1845.407296] </TASK>
[ 1845.407297]
[ 1845.431635] Allocated by task 17044:
[ 1845.432033] kasan_save_stack+0x1e/0x40
[ 1845.432463] kasan_save_track+0x10/0x30
[ 1845.432882] __kasan_slab_alloc+0x62/0x70
[ 1845.433308] kmem_cache_alloc_noprof+0x1a0/0x4a0
[ 1845.433781] alloc_vfsmnt+0x23/0x6c0
[ 1845.434195] vfs_create_mount+0x82/0x4a0
[ 1845.434623] path_mount+0x939/0x1aa0
[ 1845.435018] __x64_sys_mount+0x2d4/0x3a0
[ 1845.435440] do_syscall_64+0x6d/0x140
[ 1845.435842] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1845.436355]
[ 1845.436601] Freed by task 0:
[ 1845.436945] kasan_save_stack+0x1e/0x40
[ 1845.437354] kasan_save_track+0x10/0x30
[ 1845.437770] kasan_save_free_info+0x37/0x60
[ 1845.438217] __kasan_slab_free+0x33/0x40
[ 1845.438646] kmem_cache_free+0x14c/0x4f0
[ 1845.439068] rcu_core+0x605/0x1d50
[ 1845.439451] handle_softirqs+0x192/0x810
[ 1845.439880] irq_exit_rcu+0x106/0x190
[ 1845.440280] sysvec_apic_timer_interrupt+0x7c/0xb0
[ 1845.440785] asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 1845.441300]
[ 1845.441544] Last potentially related work creation:
[ 1845.442048] kasan_save_stack+0x1e/0x40
[ 1845.442465] kasan_record_aux_stack+0x97/0xa0
[ 1845.442921] __call_rcu_common.constprop.0+0x6d/0xb40
[ 1845.443437] task_work_run+0x111/0x1f0
[ 1845.443851] syscall_exit_to_user_mode+0x1df/0x1f0
[ 1845.444337] do_syscall_64+0x79/0x140
[ 1845.444758] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1845.445272]
[ 1845.445505] Second to last potentially related work creation:
[ 1845.446078] kasan_save_stack+0x1e/0x40
[ 1845.446494] kasan_record_aux_stack+0x97/0xa0
[ 1845.446947] task_work_add+0x178/0x250
[ 1845.447356] mntput_no_expire+0x4fc/0x9f0
[ 1845.447789] path_umount+0x4ed/0x10d0
[ 1845.448190] __x64_sys_umount+0xfb/0x120
[ 1845.448617] do_syscall_64+0x6d/0x140
[ 1845.449016] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1845.449529]
[ 1845.449766] The buggy address belongs to the object at ffff8881507b5a40
[ 1845.449766] which belongs to the cache mnt_cache of size 368
[ 1845.450898] The buggy address is located 112 bytes inside of
[ 1845.450898] freed 368-byte region [ffff8881507b5a40, ffff8881507b5bb0)
[ 1845.452009]
[ 1845.452250] The buggy address belongs to the physical page:
[ 1845.452808] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1507b4
[ 1845.453595] head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 1845.454363] anon flags: 0x200000000000040(head|node=0|zone=2)
[ 1845.454936] page_type: f5(slab)
[ 1845.455300] raw: 0200000000000040 ffff8881009f5680 0000000000000000 dead000000000001
[ 1845.456077] raw: 0000000000000000 0000000080240024 00000000f5000000 0000000000000000
[ 1845.456857] head: 0200000000000040 ffff8881009f5680 0000000000000000 dead000000000001
[ 1845.457616] head: 0000000000000000 0000000080240024 00000000f5000000 0000000000000000
[ 1845.458399] head: 0200000000000002 ffffea000541ed01 ffffffffffffffff 0000000000000000
[ 1845.459169] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
[ 1845.459945] page dumped because: kasan: bad access detected
[ 1845.460506]
[ 1845.460745] Memory state around the buggy address:
[ 1845.461228] ffff8881507b5980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
[ 1845.461963] ffff8881507b5a00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
[ 1845.462759] >ffff8881507b5a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1845.463480] ^
[ 1845.463968] ffff8881507b5b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1845.464704] ffff8881507b5b80: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
[ 1845.465430] ==================================================================
[ 1845.466181] Disabling lock debugging due to kernel taint
[ 1845.466717] ==================================================================
[ 1845.467443] BUG: KASAN: slab-use-after-free in clone_private_mount+0x313/0x390
[ 1845.468192] Read of size 8 at addr ffff8881507b5a58 by task dockerd/8697
[ 1845.468837]
[ 1845.469072] CPU: 5 UID: 0 PID: 8697 Comm: dockerd Tainted: G B 6.14.0master_fbece6d #1 NONE
[ 1845.469078] Tainted: [B]=BAD_PAGE
[ 1845.469079] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1845.469082] Call Trace:
[ 1845.469084] <TASK>
[ 1845.469086] dump_stack_lvl+0x69/0xa0
[ 1845.469093] print_report+0x156/0x523
[ 1845.469098] ? __virt_addr_valid+0x1de/0x3c0
[ 1845.469103] ? clone_private_mount+0x313/0x390
[ 1845.469107] kasan_report+0xc1/0xf0
[ 1845.469112] ? clone_private_mount+0x313/0x390
[ 1845.469116] clone_private_mount+0x313/0x390
[ 1845.469121] ovl_fill_super+0x2965/0x59e0 [overlay]
[ 1845.469140] ? ovl_workdir_create+0x900/0x900 [overlay]
[ 1845.469152] ? wait_for_completion_io_timeout+0x20/0x20
[ 1845.469157] ? lockdep_init_map_type+0x58/0x220
[ 1845.469161] ? lockdep_init_map_type+0x58/0x220
[ 1845.469164] ? shrinker_register+0x177/0x200
[ 1845.469169] ? sget_fc+0x449/0xb30
[ 1845.469174] ? ovl_workdir_create+0x900/0x900 [overlay]
[ 1845.469185] ? get_tree_nodev+0xa5/0x130
[ 1845.469189] get_tree_nodev+0xa5/0x130
[ 1845.469192] ? cap_capable+0xd0/0x320
[ 1845.469198] vfs_get_tree+0x83/0x2e0
[ 1845.469202] ? ns_capable+0x55/0xb0
[ 1845.469206] path_mount+0x891/0x1aa0
[ 1845.469210] ? finish_automount+0x860/0x860
[ 1845.469217] ? kmem_cache_free+0x14c/0x4f0
[ 1845.469221] ? user_path_at+0x3d/0x50
[ 1845.469227] __x64_sys_mount+0x2d4/0x3a0
[ 1845.469231] ? path_mount+0x1aa0/0x1aa0
[ 1845.469235] do_syscall_64+0x6d/0x140
[ 1845.469239] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1845.469242] RIP: 0033:0x55e3487f1fea
[ 1845.469246] Code: e8 1b 96 fa ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 4c 8b 54 24 28 4c 8b 44 24 30 4c 8b 4c 24 38 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 40 ff ff ff ff 48 c7 44 24 48
[ 1845.469249] RSP: 002b:000000c000b563b8 EFLAGS: 00000212 ORIG_RAX: 00000000000000a5
[ 1845.469253] RAX: ffffffffffffffda RBX: 000000c00006c000 RCX: 000055e3487f1fea
[ 1845.469256] RDX: 000000c0012cf7d8 RSI: 000000c0012616c0 RDI: 000000c0012cf7d0
[ 1845.469260] RBP: 000000c000b56458 R08: 000000c0004fa600 R09: 0000000000000000
[ 1845.469261] R10: 0000000000000000 R11: 0000000000000212 R12: 000000c0012cf7d0
[ 1845.469263] R13: 0000000000000000 R14: 000000c00098b6c0 R15: ffffffffffffffff
[ 1845.469268] </TASK>
[ 1845.469269]
[ 1845.494368] Allocated by task 17044:
[ 1845.494768] kasan_save_stack+0x1e/0x40
[ 1845.495185] kasan_save_track+0x10/0x30
[ 1845.495594] __kasan_slab_alloc+0x62/0x70
[ 1845.496024] kmem_cache_alloc_noprof+0x1a0/0x4a0
[ 1845.496518] alloc_vfsmnt+0x23/0x6c0
[ 1845.496911] vfs_create_mount+0x82/0x4a0
[ 1845.497333] path_mount+0x939/0x1aa0
[ 1845.497728] __x64_sys_mount+0x2d4/0x3a0
[ 1845.498167] do_syscall_64+0x6d/0x140
[ 1845.498563] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1845.499064]
[ 1845.499295] Freed by task 0:
[ 1845.499636] kasan_save_stack+0x1e/0x40
[ 1845.500052] kasan_save_track+0x10/0x30
[ 1845.500494] kasan_save_free_info+0x37/0x60
[ 1845.500934] __kasan_slab_free+0x33/0x40
[ 1845.501355] kmem_cache_free+0x14c/0x4f0
[ 1845.501774] rcu_core+0x605/0x1d50
[ 1845.502162] handle_softirqs+0x192/0x810
[ 1845.502587] irq_exit_rcu+0x106/0x190
[ 1845.502995] sysvec_apic_timer_interrupt+0x7c/0xb0
[ 1845.503487] asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 1845.504002]
[ 1845.504236] Last potentially related work creation:
[ 1845.504748] kasan_save_stack+0x1e/0x40
[ 1845.505164] kasan_record_aux_stack+0x97/0xa0
[ 1845.505621] __call_rcu_common.constprop.0+0x6d/0xb40
[ 1845.506136] task_work_run+0x111/0x1f0
[ 1845.506545] syscall_exit_to_user_mode+0x1df/0x1f0
[ 1845.507038] do_syscall_64+0x79/0x140
[ 1845.507439] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1845.507949]
[ 1845.508187] Second to last potentially related work creation:
[ 1845.508760] kasan_save_stack+0x1e/0x40
[ 1845.509175] kasan_record_aux_stack+0x97/0xa0
[ 1845.509630] task_work_add+0x178/0x250
[ 1845.510040] mntput_no_expire+0x4fc/0x9f0
[ 1845.510468] path_umount+0x4ed/0x10d0
[ 1845.510870] __x64_sys_umount+0xfb/0x120
[ 1845.511298] do_syscall_64+0x6d/0x140
[ 1845.511700] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 1845.512210]
[ 1845.512442] The buggy address belongs to the object at ffff8881507b5a40
[ 1845.512442] which belongs to the cache mnt_cache of size 368
[ 1845.513553] The buggy address is located 24 bytes inside of
[ 1845.513553] freed 368-byte region [ffff8881507b5a40, ffff8881507b5bb0)
[ 1845.514650]
[ 1845.514883] The buggy address belongs to the physical page:
[ 1845.515436] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1507b4
[ 1845.516221] head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 1845.516986] anon flags: 0x200000000000040(head|node=0|zone=2)
[ 1845.517549] page_type: f5(slab)
[ 1845.517912] raw: 0200000000000040 ffff8881009f5680 0000000000000000 dead000000000001
[ 1845.518684] raw: 0000000000000000 0000000080240024 00000000f5000000 0000000000000000
[ 1845.519445] head: 0200000000000040 ffff8881009f5680 0000000000000000 dead000000000001
[ 1845.520220] head: 0000000000000000 0000000080240024 00000000f5000000 0000000000000000
[ 1845.521006] head: 0200000000000002 ffffea000541ed01 ffffffffffffffff 0000000000000000
[ 1845.521812] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
[ 1845.522581] page dumped because: kasan: bad access detected
[ 1845.523131]
[ 1845.523362] Memory state around the buggy address:
[ 1845.523851] ffff8881507b5900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1845.524588] ffff8881507b5980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
[ 1845.525321] >ffff8881507b5a00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
[ 1845.526059] ^
[ 1845.526651] ffff8881507b5a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1845.527378] ffff8881507b5b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1845.528095] ==================================================================
>
> Thank you!
>
> --
> Deet-doot-dot, I am a bot.
> https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-01 17:07 ` Leon Romanovsky
@ 2025-04-03 8:29 ` Christian Brauner
2025-04-03 15:15 ` Christian Brauner
0 siblings, 1 reply; 32+ messages in thread
From: Christian Brauner @ 2025-04-03 8:29 UTC (permalink / raw)
To: Leon Romanovsky
Cc: pr-tracker-bot, Linus Torvalds, linux-fsdevel, linux-kernel
On Tue, Apr 01, 2025 at 08:07:15PM +0300, Leon Romanovsky wrote:
> On Mon, Mar 24, 2025 at 09:00:59PM +0000, pr-tracker-bot@kernel.org wrote:
> > The pull request you sent on Sat, 22 Mar 2025 11:13:18 +0100:
> >
> > > git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.15-rc1.mount
> >
> > has been merged into torvalds/linux.git:
> > https://git.kernel.org/torvalds/c/fd101da676362aaa051b4f5d8a941bd308603041
>
> I didn't bisect, but this PR looks like the most relevant candidate.
> The latest Linus's master generates the following slab-use-after-free:
Sorry, did just see this today. I'll take a look now.
>
> [ 1845.404658] ==================================================================
> [ 1845.405460] BUG: KASAN: slab-use-after-free in clone_private_mount+0x309/0x390
> [ 1845.406205] Read of size 8 at addr ffff8881507b5ab0 by task dockerd/8697
> [ 1845.406847]
> [ 1845.407081] CPU: 5 UID: 0 PID: 8697 Comm: dockerd Not tainted 6.14.0master_fbece6d #1 NONE
> [ 1845.407086] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> [ 1845.407097] Call Trace:
> [ 1845.407102] <TASK>
> [ 1845.407104] dump_stack_lvl+0x69/0xa0
> [ 1845.407114] print_report+0x156/0x523
> [ 1845.407120] ? __virt_addr_valid+0x1de/0x3c0
> [ 1845.407124] ? clone_private_mount+0x309/0x390
> [ 1845.407128] kasan_report+0xc1/0xf0
> [ 1845.407134] ? clone_private_mount+0x309/0x390
> [ 1845.407138] clone_private_mount+0x309/0x390
> [ 1845.407144] ovl_fill_super+0x2965/0x59e0 [overlay]
> [ 1845.407165] ? ovl_workdir_create+0x900/0x900 [overlay]
> [ 1845.407177] ? wait_for_completion_io_timeout+0x20/0x20
> [ 1845.407182] ? lockdep_init_map_type+0x58/0x220
> [ 1845.407186] ? lockdep_init_map_type+0x58/0x220
> [ 1845.407189] ? shrinker_register+0x177/0x200
> [ 1845.407194] ? sget_fc+0x449/0xb30
> [ 1845.407199] ? ovl_workdir_create+0x900/0x900 [overlay]
> [ 1845.407211] ? get_tree_nodev+0xa5/0x130
> [ 1845.407214] get_tree_nodev+0xa5/0x130
> [ 1845.407218] ? cap_capable+0xd0/0x320
> [ 1845.407223] vfs_get_tree+0x83/0x2e0
> [ 1845.407227] ? ns_capable+0x55/0xb0
> [ 1845.407232] path_mount+0x891/0x1aa0
> [ 1845.407237] ? finish_automount+0x860/0x860
> [ 1845.407240] ? kmem_cache_free+0x14c/0x4f0
> [ 1845.407245] ? user_path_at+0x3d/0x50
> [ 1845.407250] __x64_sys_mount+0x2d4/0x3a0
> [ 1845.407254] ? path_mount+0x1aa0/0x1aa0
> [ 1845.407259] do_syscall_64+0x6d/0x140
> [ 1845.407263] entry_SYSCALL_64_after_hwframe+0x4b/0x53
> [ 1845.407267] RIP: 0033:0x55e3487f1fea
> [ 1845.407274] Code: e8 1b 96 fa ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 4c 8b 54 24 28 4c 8b 44 24 30 4c 8b 4c 24 38 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 40 ff ff ff ff 48 c7 44 24 48
> [ 1845.407278] RSP: 002b:000000c000b563b8 EFLAGS: 00000212 ORIG_RAX: 00000000000000a5
> [ 1845.407282] RAX: ffffffffffffffda RBX: 000000c00006c000 RCX: 000055e3487f1fea
> [ 1845.407285] RDX: 000000c0012cf7d8 RSI: 000000c0012616c0 RDI: 000000c0012cf7d0
> [ 1845.407287] RBP: 000000c000b56458 R08: 000000c0004fa600 R09: 0000000000000000
> [ 1845.407289] R10: 0000000000000000 R11: 0000000000000212 R12: 000000c0012cf7d0
> [ 1845.407291] R13: 0000000000000000 R14: 000000c00098b6c0 R15: ffffffffffffffff
> [ 1845.407296] </TASK>
> [ 1845.407297]
> [ 1845.431635] Allocated by task 17044:
> [ 1845.432033] kasan_save_stack+0x1e/0x40
> [ 1845.432463] kasan_save_track+0x10/0x30
> [ 1845.432882] __kasan_slab_alloc+0x62/0x70
> [ 1845.433308] kmem_cache_alloc_noprof+0x1a0/0x4a0
> [ 1845.433781] alloc_vfsmnt+0x23/0x6c0
> [ 1845.434195] vfs_create_mount+0x82/0x4a0
> [ 1845.434623] path_mount+0x939/0x1aa0
> [ 1845.435018] __x64_sys_mount+0x2d4/0x3a0
> [ 1845.435440] do_syscall_64+0x6d/0x140
> [ 1845.435842] entry_SYSCALL_64_after_hwframe+0x4b/0x53
> [ 1845.436355]
> [ 1845.436601] Freed by task 0:
> [ 1845.436945] kasan_save_stack+0x1e/0x40
> [ 1845.437354] kasan_save_track+0x10/0x30
> [ 1845.437770] kasan_save_free_info+0x37/0x60
> [ 1845.438217] __kasan_slab_free+0x33/0x40
> [ 1845.438646] kmem_cache_free+0x14c/0x4f0
> [ 1845.439068] rcu_core+0x605/0x1d50
> [ 1845.439451] handle_softirqs+0x192/0x810
> [ 1845.439880] irq_exit_rcu+0x106/0x190
> [ 1845.440280] sysvec_apic_timer_interrupt+0x7c/0xb0
> [ 1845.440785] asm_sysvec_apic_timer_interrupt+0x16/0x20
> [ 1845.441300]
> [ 1845.441544] Last potentially related work creation:
> [ 1845.442048] kasan_save_stack+0x1e/0x40
> [ 1845.442465] kasan_record_aux_stack+0x97/0xa0
> [ 1845.442921] __call_rcu_common.constprop.0+0x6d/0xb40
> [ 1845.443437] task_work_run+0x111/0x1f0
> [ 1845.443851] syscall_exit_to_user_mode+0x1df/0x1f0
> [ 1845.444337] do_syscall_64+0x79/0x140
> [ 1845.444758] entry_SYSCALL_64_after_hwframe+0x4b/0x53
> [ 1845.445272]
> [ 1845.445505] Second to last potentially related work creation:
> [ 1845.446078] kasan_save_stack+0x1e/0x40
> [ 1845.446494] kasan_record_aux_stack+0x97/0xa0
> [ 1845.446947] task_work_add+0x178/0x250
> [ 1845.447356] mntput_no_expire+0x4fc/0x9f0
> [ 1845.447789] path_umount+0x4ed/0x10d0
> [ 1845.448190] __x64_sys_umount+0xfb/0x120
> [ 1845.448617] do_syscall_64+0x6d/0x140
> [ 1845.449016] entry_SYSCALL_64_after_hwframe+0x4b/0x53
> [ 1845.449529]
> [ 1845.449766] The buggy address belongs to the object at ffff8881507b5a40
> [ 1845.449766] which belongs to the cache mnt_cache of size 368
> [ 1845.450898] The buggy address is located 112 bytes inside of
> [ 1845.450898] freed 368-byte region [ffff8881507b5a40, ffff8881507b5bb0)
> [ 1845.452009]
> [ 1845.452250] The buggy address belongs to the physical page:
> [ 1845.452808] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1507b4
> [ 1845.453595] head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> [ 1845.454363] anon flags: 0x200000000000040(head|node=0|zone=2)
> [ 1845.454936] page_type: f5(slab)
> [ 1845.455300] raw: 0200000000000040 ffff8881009f5680 0000000000000000 dead000000000001
> [ 1845.456077] raw: 0000000000000000 0000000080240024 00000000f5000000 0000000000000000
> [ 1845.456857] head: 0200000000000040 ffff8881009f5680 0000000000000000 dead000000000001
> [ 1845.457616] head: 0000000000000000 0000000080240024 00000000f5000000 0000000000000000
> [ 1845.458399] head: 0200000000000002 ffffea000541ed01 ffffffffffffffff 0000000000000000
> [ 1845.459169] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> [ 1845.459945] page dumped because: kasan: bad access detected
> [ 1845.460506]
> [ 1845.460745] Memory state around the buggy address:
> [ 1845.461228] ffff8881507b5980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
> [ 1845.461963] ffff8881507b5a00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
> [ 1845.462759] >ffff8881507b5a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 1845.463480] ^
> [ 1845.463968] ffff8881507b5b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 1845.464704] ffff8881507b5b80: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
> [ 1845.465430] ==================================================================
> [ 1845.466181] Disabling lock debugging due to kernel taint
> [ 1845.466717] ==================================================================
> [ 1845.467443] BUG: KASAN: slab-use-after-free in clone_private_mount+0x313/0x390
> [ 1845.468192] Read of size 8 at addr ffff8881507b5a58 by task dockerd/8697
> [ 1845.468837]
> [ 1845.469072] CPU: 5 UID: 0 PID: 8697 Comm: dockerd Tainted: G B 6.14.0master_fbece6d #1 NONE
> [ 1845.469078] Tainted: [B]=BAD_PAGE
> [ 1845.469079] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> [ 1845.469082] Call Trace:
> [ 1845.469084] <TASK>
> [ 1845.469086] dump_stack_lvl+0x69/0xa0
> [ 1845.469093] print_report+0x156/0x523
> [ 1845.469098] ? __virt_addr_valid+0x1de/0x3c0
> [ 1845.469103] ? clone_private_mount+0x313/0x390
> [ 1845.469107] kasan_report+0xc1/0xf0
> [ 1845.469112] ? clone_private_mount+0x313/0x390
> [ 1845.469116] clone_private_mount+0x313/0x390
> [ 1845.469121] ovl_fill_super+0x2965/0x59e0 [overlay]
> [ 1845.469140] ? ovl_workdir_create+0x900/0x900 [overlay]
> [ 1845.469152] ? wait_for_completion_io_timeout+0x20/0x20
> [ 1845.469157] ? lockdep_init_map_type+0x58/0x220
> [ 1845.469161] ? lockdep_init_map_type+0x58/0x220
> [ 1845.469164] ? shrinker_register+0x177/0x200
> [ 1845.469169] ? sget_fc+0x449/0xb30
> [ 1845.469174] ? ovl_workdir_create+0x900/0x900 [overlay]
> [ 1845.469185] ? get_tree_nodev+0xa5/0x130
> [ 1845.469189] get_tree_nodev+0xa5/0x130
> [ 1845.469192] ? cap_capable+0xd0/0x320
> [ 1845.469198] vfs_get_tree+0x83/0x2e0
> [ 1845.469202] ? ns_capable+0x55/0xb0
> [ 1845.469206] path_mount+0x891/0x1aa0
> [ 1845.469210] ? finish_automount+0x860/0x860
> [ 1845.469217] ? kmem_cache_free+0x14c/0x4f0
> [ 1845.469221] ? user_path_at+0x3d/0x50
> [ 1845.469227] __x64_sys_mount+0x2d4/0x3a0
> [ 1845.469231] ? path_mount+0x1aa0/0x1aa0
> [ 1845.469235] do_syscall_64+0x6d/0x140
> [ 1845.469239] entry_SYSCALL_64_after_hwframe+0x4b/0x53
> [ 1845.469242] RIP: 0033:0x55e3487f1fea
> [ 1845.469246] Code: e8 1b 96 fa ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 4c 8b 54 24 28 4c 8b 44 24 30 4c 8b 4c 24 38 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 40 ff ff ff ff 48 c7 44 24 48
> [ 1845.469249] RSP: 002b:000000c000b563b8 EFLAGS: 00000212 ORIG_RAX: 00000000000000a5
> [ 1845.469253] RAX: ffffffffffffffda RBX: 000000c00006c000 RCX: 000055e3487f1fea
> [ 1845.469256] RDX: 000000c0012cf7d8 RSI: 000000c0012616c0 RDI: 000000c0012cf7d0
> [ 1845.469260] RBP: 000000c000b56458 R08: 000000c0004fa600 R09: 0000000000000000
> [ 1845.469261] R10: 0000000000000000 R11: 0000000000000212 R12: 000000c0012cf7d0
> [ 1845.469263] R13: 0000000000000000 R14: 000000c00098b6c0 R15: ffffffffffffffff
> [ 1845.469268] </TASK>
> [ 1845.469269]
> [ 1845.494368] Allocated by task 17044:
> [ 1845.494768] kasan_save_stack+0x1e/0x40
> [ 1845.495185] kasan_save_track+0x10/0x30
> [ 1845.495594] __kasan_slab_alloc+0x62/0x70
> [ 1845.496024] kmem_cache_alloc_noprof+0x1a0/0x4a0
> [ 1845.496518] alloc_vfsmnt+0x23/0x6c0
> [ 1845.496911] vfs_create_mount+0x82/0x4a0
> [ 1845.497333] path_mount+0x939/0x1aa0
> [ 1845.497728] __x64_sys_mount+0x2d4/0x3a0
> [ 1845.498167] do_syscall_64+0x6d/0x140
> [ 1845.498563] entry_SYSCALL_64_after_hwframe+0x4b/0x53
> [ 1845.499064]
> [ 1845.499295] Freed by task 0:
> [ 1845.499636] kasan_save_stack+0x1e/0x40
> [ 1845.500052] kasan_save_track+0x10/0x30
> [ 1845.500494] kasan_save_free_info+0x37/0x60
> [ 1845.500934] __kasan_slab_free+0x33/0x40
> [ 1845.501355] kmem_cache_free+0x14c/0x4f0
> [ 1845.501774] rcu_core+0x605/0x1d50
> [ 1845.502162] handle_softirqs+0x192/0x810
> [ 1845.502587] irq_exit_rcu+0x106/0x190
> [ 1845.502995] sysvec_apic_timer_interrupt+0x7c/0xb0
> [ 1845.503487] asm_sysvec_apic_timer_interrupt+0x16/0x20
> [ 1845.504002]
> [ 1845.504236] Last potentially related work creation:
> [ 1845.504748] kasan_save_stack+0x1e/0x40
> [ 1845.505164] kasan_record_aux_stack+0x97/0xa0
> [ 1845.505621] __call_rcu_common.constprop.0+0x6d/0xb40
> [ 1845.506136] task_work_run+0x111/0x1f0
> [ 1845.506545] syscall_exit_to_user_mode+0x1df/0x1f0
> [ 1845.507038] do_syscall_64+0x79/0x140
> [ 1845.507439] entry_SYSCALL_64_after_hwframe+0x4b/0x53
> [ 1845.507949]
> [ 1845.508187] Second to last potentially related work creation:
> [ 1845.508760] kasan_save_stack+0x1e/0x40
> [ 1845.509175] kasan_record_aux_stack+0x97/0xa0
> [ 1845.509630] task_work_add+0x178/0x250
> [ 1845.510040] mntput_no_expire+0x4fc/0x9f0
> [ 1845.510468] path_umount+0x4ed/0x10d0
> [ 1845.510870] __x64_sys_umount+0xfb/0x120
> [ 1845.511298] do_syscall_64+0x6d/0x140
> [ 1845.511700] entry_SYSCALL_64_after_hwframe+0x4b/0x53
> [ 1845.512210]
> [ 1845.512442] The buggy address belongs to the object at ffff8881507b5a40
> [ 1845.512442] which belongs to the cache mnt_cache of size 368
> [ 1845.513553] The buggy address is located 24 bytes inside of
> [ 1845.513553] freed 368-byte region [ffff8881507b5a40, ffff8881507b5bb0)
> [ 1845.514650]
> [ 1845.514883] The buggy address belongs to the physical page:
> [ 1845.515436] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1507b4
> [ 1845.516221] head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> [ 1845.516986] anon flags: 0x200000000000040(head|node=0|zone=2)
> [ 1845.517549] page_type: f5(slab)
> [ 1845.517912] raw: 0200000000000040 ffff8881009f5680 0000000000000000 dead000000000001
> [ 1845.518684] raw: 0000000000000000 0000000080240024 00000000f5000000 0000000000000000
> [ 1845.519445] head: 0200000000000040 ffff8881009f5680 0000000000000000 dead000000000001
> [ 1845.520220] head: 0000000000000000 0000000080240024 00000000f5000000 0000000000000000
> [ 1845.521006] head: 0200000000000002 ffffea000541ed01 ffffffffffffffff 0000000000000000
> [ 1845.521812] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> [ 1845.522581] page dumped because: kasan: bad access detected
> [ 1845.523131]
> [ 1845.523362] Memory state around the buggy address:
> [ 1845.523851] ffff8881507b5900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 1845.524588] ffff8881507b5980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
> [ 1845.525321] >ffff8881507b5a00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
> [ 1845.526059] ^
> [ 1845.526651] ffff8881507b5a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 1845.527378] ffff8881507b5b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 1845.528095] ==================================================================
>
> >
> > Thank you!
> >
> > --
> > Deet-doot-dot, I am a bot.
> > https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-03 8:29 ` Christian Brauner
@ 2025-04-03 15:15 ` Christian Brauner
2025-04-03 15:34 ` James Bottomley
2025-04-03 18:24 ` Leon Romanovsky
0 siblings, 2 replies; 32+ messages in thread
From: Christian Brauner @ 2025-04-03 15:15 UTC (permalink / raw)
To: Leon Romanovsky, Linus Torvalds
Cc: pr-tracker-bot, linux-fsdevel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 886 bytes --]
On Thu, Apr 03, 2025 at 10:29:37AM +0200, Christian Brauner wrote:
> On Tue, Apr 01, 2025 at 08:07:15PM +0300, Leon Romanovsky wrote:
> > On Mon, Mar 24, 2025 at 09:00:59PM +0000, pr-tracker-bot@kernel.org wrote:
> > > The pull request you sent on Sat, 22 Mar 2025 11:13:18 +0100:
> > >
> > > > git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.15-rc1.mount
> > >
> > > has been merged into torvalds/linux.git:
> > > https://git.kernel.org/torvalds/c/fd101da676362aaa051b4f5d8a941bd308603041
> >
> > I didn't bisect, but this PR looks like the most relevant candidate.
> > The latest Linus's master generates the following slab-use-after-free:
>
> Sorry, did just see this today. I'll take a look now.
So in light of "Liberation Day" and the bug that caused this splat it's
time to quote Max Liebermann:
"Ich kann nicht so viel fressen, wie ich kotzen möchte."
[-- Attachment #2: 0001-fs-actually-hold-the-namespace-semaphore.patch --]
[-- Type: text/x-diff, Size: 912 bytes --]
From 8822177b7a8a7315446b4227c7eb7a36916a6d6d Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Thu, 3 Apr 2025 16:43:50 +0200
Subject: [PATCH] fs: actually hold the namespace semaphore
Don't use a scoped guard use a regular guard to make sure that the
namespace semaphore is held across the whole function.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/namespace.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 16292ff760c9..348008b9683b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2478,7 +2478,8 @@ struct vfsmount *clone_private_mount(const struct path *path)
struct mount *old_mnt = real_mount(path->mnt);
struct mount *new_mnt;
- scoped_guard(rwsem_read, &namespace_sem)
+ guard(rwsem_read, &namespace_sem);
+
if (IS_MNT_UNBINDABLE(old_mnt))
return ERR_PTR(-EINVAL);
--
2.47.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-03 15:15 ` Christian Brauner
@ 2025-04-03 15:34 ` James Bottomley
2025-04-03 17:21 ` Mateusz Guzik
2025-04-03 18:24 ` Leon Romanovsky
1 sibling, 1 reply; 32+ messages in thread
From: James Bottomley @ 2025-04-03 15:34 UTC (permalink / raw)
To: Christian Brauner, Leon Romanovsky, Linus Torvalds
Cc: pr-tracker-bot, linux-fsdevel, linux-kernel
On Thu, 2025-04-03 at 17:15 +0200, Christian Brauner wrote:
> On Thu, Apr 03, 2025 at 10:29:37AM +0200, Christian Brauner wrote:
> > On Tue, Apr 01, 2025 at 08:07:15PM +0300, Leon Romanovsky wrote:
> > > On Mon, Mar 24, 2025 at 09:00:59PM +0000,
> > > pr-tracker-bot@kernel.org wrote:
> > > > The pull request you sent on Sat, 22 Mar 2025 11:13:18 +0100:
> > > >
> > > > > git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
> > > > > tags/vfs-6.15-rc1.mount
> > > >
> > > > has been merged into torvalds/linux.git:
> > > > https://git.kernel.org/torvalds/c/fd101da676362aaa051b4f5d8a941bd308603041
> > >
> > > I didn't bisect, but this PR looks like the most relevant
> > > candidate.
> > > The latest Linus's master generates the following slab-use-after-
> > > free:
> >
> > Sorry, did just see this today. I'll take a look now.
>
> So in light of "Liberation Day" and the bug that caused this splat
> it's
> time to quote Max Liebermann:
>
> "Ich kann nicht so viel fressen, wie ich kotzen möchte."
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2478,7 +2478,8 @@ struct vfsmount *clone_private_mount(const
> struct path *path)
> struct mount *old_mnt = real_mount(path->mnt);
> struct mount *new_mnt;
>
> - scoped_guard(rwsem_read, &namespace_sem)
> + guard(rwsem_read, &namespace_sem);
> +
> if (IS_MNT_UNBINDABLE(old_mnt))
> return ERR_PTR(-EINVAL);
>
Well that's a barfworthy oopsie, yes. However, it does strike me as an
easy one to make for a lot of these cleanup.h things since we have a
lot of scoped and unscoped variants. We should, at least, get
checkpatch to issue a warning about indentation expectations as it does
for our other scoped statements like for, while, if etc.
It looks quite simple if got my perl right (it's a bit rusty).
Regards,
James
---
diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 7b28ad331742..805b65098149 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -4347,7 +4347,7 @@ sub process {
}
# Check relative indent for conditionals and blocks.
- if ($line =~ /\b(?:(?:if|while|for|(?:[a-z_]+|)for_each[a-z_]+)\s*\(|(?:do|else)\b)/ && $line !~ /^.\s*#/ && $line !~ /\}\s*while\s*/) {
+ if ($line =~ /\b(?:(?:if|while|scoped_[a-z_]+|for|(?:[a-z_]+|)for_each[a-z_]+)\s*\(|(?:do|else)\b)/ && $line !~ /^.\s*#/ && $line !~ /\}\s*while\s*/) {
($stat, $cond, $line_nr_next, $remain_next, $off_next) =
ctx_statement_block($linenr, $realcnt, 0)
if (!defined $stat);
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-03 15:34 ` James Bottomley
@ 2025-04-03 17:21 ` Mateusz Guzik
2025-04-03 18:09 ` Linus Torvalds
0 siblings, 1 reply; 32+ messages in thread
From: Mateusz Guzik @ 2025-04-03 17:21 UTC (permalink / raw)
To: James Bottomley
Cc: Christian Brauner, Leon Romanovsky, Linus Torvalds,
pr-tracker-bot, linux-fsdevel, linux-kernel
On Thu, Apr 03, 2025 at 11:34:34AM -0400, James Bottomley wrote:
> On Thu, 2025-04-03 at 17:15 +0200, Christian Brauner wrote:
> > On Thu, Apr 03, 2025 at 10:29:37AM +0200, Christian Brauner wrote:
> > > On Tue, Apr 01, 2025 at 08:07:15PM +0300, Leon Romanovsky wrote:
> > > > On Mon, Mar 24, 2025 at 09:00:59PM +0000,
> > > > pr-tracker-bot@kernel.org wrote:
> > > > > The pull request you sent on Sat, 22 Mar 2025 11:13:18 +0100:
> > > > >
> > > > > > git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
> > > > > > tags/vfs-6.15-rc1.mount
> > > > >
> > > > > has been merged into torvalds/linux.git:
> > > > > https://git.kernel.org/torvalds/c/fd101da676362aaa051b4f5d8a941bd308603041
> > > >
> > > > I didn't bisect, but this PR looks like the most relevant
> > > > candidate.
> > > > The latest Linus's master generates the following slab-use-after-
> > > > free:
> > >
> > > Sorry, did just see this today. I'll take a look now.
> >
> > So in light of "Liberation Day" and the bug that caused this splat
> > it's
> > time to quote Max Liebermann:
> >
> > "Ich kann nicht so viel fressen, wie ich kotzen möchte."
>
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -2478,7 +2478,8 @@ struct vfsmount *clone_private_mount(const
> > struct path *path)
> > struct mount *old_mnt = real_mount(path->mnt);
> > struct mount *new_mnt;
> >
> > - scoped_guard(rwsem_read, &namespace_sem)
> > + guard(rwsem_read, &namespace_sem);
> > +
> > if (IS_MNT_UNBINDABLE(old_mnt))
> > return ERR_PTR(-EINVAL);
> >
>
> Well that's a barfworthy oopsie, yes. However, it does strike me as an
> easy one to make for a lot of these cleanup.h things since we have a
> lot of scoped and unscoped variants. We should, at least, get
> checkpatch to issue a warning about indentation expectations as it does
> for our other scoped statements like for, while, if etc.
>
I think this is too easy of a mistake to make to try to detect in
checkpatch.
I would argue it would be best if a language wizard came up with a way
to *demand* explicit use of { } and fail compilation if not present.
This would also provide a nice side effect of explicitly delineating
what's protected.
There are some legitimate { }-less users already, it should not pose
difficulty to patch them. I can do the churn, provided someone fixes the
problem.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-03 17:21 ` Mateusz Guzik
@ 2025-04-03 18:09 ` Linus Torvalds
2025-04-03 19:17 ` Mateusz Guzik
2025-04-04 8:28 ` Christoph Hellwig
0 siblings, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2025-04-03 18:09 UTC (permalink / raw)
To: Mateusz Guzik
Cc: James Bottomley, Christian Brauner, Leon Romanovsky,
pr-tracker-bot, linux-fsdevel, linux-kernel
On Thu, 3 Apr 2025 at 10:21, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I would argue it would be best if a language wizard came up with a way
> to *demand* explicit use of { } and fail compilation if not present.
I tried to think of some sane model for it, but there isn't any good syntax.
The only way to enforce it would be to also have a "end" marker, ie do
something like
scoped_guard(x) {
...
} end_scoped_guard;
and that you could more-or-less enforce by having
#define scoped_guard(..) ... real guard stuff .. \
do {
#define end_scope } while (0)
where in addition we could add some dummy variable declaration inside
scoped_guard(), and have a dummy use of that variable in the
end_scope, just to further make sure the two pair up.
It does have the advantage of allowing more flexibility with fewer
tricks when you can define your scope in the macros. Right now
"scoped_guard()" plays some rather ugly games internally, just in
order to avoid this pattern.
And that pattern isn't actually new. We *used* to have this pattern in
do_each_thread(g, t) {
...
} while_each_thread(g, t);
and honestly, people seemed to hate it.
(Also, sparse has that pattern as
FOR_EACH_PTR(filelist, file) {
...
} END_FOR_EACH_PTR(file);
and it actually works quite well and once you get used to it it's
nice, but I do think people tend to find it really really odd)
> This would also provide a nice side effect of explicitly delineating
> what's protected.
Sadly, I think we have too many uses for this to be worth it any more.
And I do suspect people would hate the odd "both beginning and end"
thing even if it adds some safety.
I dunno. I personally don't mind the "delineate both the beginning and
the end", but we don't have a great history of it.
Linus
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-03 15:15 ` Christian Brauner
2025-04-03 15:34 ` James Bottomley
@ 2025-04-03 18:24 ` Leon Romanovsky
2025-04-03 19:18 ` Linus Torvalds
2025-04-03 19:38 ` James Bottomley
1 sibling, 2 replies; 32+ messages in thread
From: Leon Romanovsky @ 2025-04-03 18:24 UTC (permalink / raw)
To: Christian Brauner
Cc: Linus Torvalds, pr-tracker-bot, linux-fsdevel, linux-kernel
On Thu, Apr 03, 2025 at 05:15:38PM +0200, Christian Brauner wrote:
> On Thu, Apr 03, 2025 at 10:29:37AM +0200, Christian Brauner wrote:
> > On Tue, Apr 01, 2025 at 08:07:15PM +0300, Leon Romanovsky wrote:
> > > On Mon, Mar 24, 2025 at 09:00:59PM +0000, pr-tracker-bot@kernel.org wrote:
> > > > The pull request you sent on Sat, 22 Mar 2025 11:13:18 +0100:
> > > >
> > > > > git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.15-rc1.mount
> > > >
> > > > has been merged into torvalds/linux.git:
> > > > https://git.kernel.org/torvalds/c/fd101da676362aaa051b4f5d8a941bd308603041
> > >
> > > I didn't bisect, but this PR looks like the most relevant candidate.
> > > The latest Linus's master generates the following slab-use-after-free:
> >
> > Sorry, did just see this today. I'll take a look now.
>
> So in light of "Liberation Day" and the bug that caused this splat it's
> time to quote Max Liebermann:
>
> "Ich kann nicht so viel fressen, wie ich kotzen möchte."
> From 8822177b7a8a7315446b4227c7eb7a36916a6d6d Mon Sep 17 00:00:00 2001
> From: Christian Brauner <brauner@kernel.org>
> Date: Thu, 3 Apr 2025 16:43:50 +0200
> Subject: [PATCH] fs: actually hold the namespace semaphore
>
> Don't use a scoped guard use a regular guard to make sure that the
> namespace semaphore is held across the whole function.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> fs/namespace.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 16292ff760c9..348008b9683b 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2478,7 +2478,8 @@ struct vfsmount *clone_private_mount(const struct path *path)
> struct mount *old_mnt = real_mount(path->mnt);
> struct mount *new_mnt;
>
> - scoped_guard(rwsem_read, &namespace_sem)
> + guard(rwsem_read, &namespace_sem);
I'm looking at Linus's master commit a2cc6ff5ec8f ("Merge tag
'firewire-updates-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394")
and guard is declared as macro which gets only one argument: include/linux/cleanup.h
318 #define guard(_name) \
319 CLASS(_name, __UNIQUE_ID(guard))
20:52:24 fs/namespace.c: In function 'clone_private_mount':
20:52:24 fs/namespace.c:2481:41: error: macro "guard" passed 2 arguments, but takes just 1
20:52:24 2481 | guard(rwsem_read, &namespace_sem);
20:52:24 | ^
20:52:24 In file included from ./include/linux/preempt.h:11,
20:52:24 from ./include/linux/spinlock.h:56,
20:52:24 from ./include/linux/wait.h:9,
20:52:24 from ./include/linux/wait_bit.h:8,
20:52:24 from ./include/linux/fs.h:7,
20:52:24 from ./include/uapi/linux/aio_abi.h:31,
20:52:24 from ./include/linux/syscalls.h:83,
20:52:24 from fs/namespace.c:11:
20:52:24 ./include/linux/cleanup.h:318:9: note: macro "guard" defined here
20:52:24 318 | #define guard(_name) \
20:52:24 | ^~~~~
20:52:24 fs/namespace.c:2481:9: error: 'guard' undeclared (first use in this function)
20:52:24 2481 | guard(rwsem_read, &namespace_sem);
20:52:24 | ^~~~~
20:52:24 fs/namespace.c:2481:9: note: each undeclared identifier is reported only once for each function it appears in
Do I need to apply extra patch?
Thanks
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-03 18:09 ` Linus Torvalds
@ 2025-04-03 19:17 ` Mateusz Guzik
2025-04-04 8:28 ` Christoph Hellwig
1 sibling, 0 replies; 32+ messages in thread
From: Mateusz Guzik @ 2025-04-03 19:17 UTC (permalink / raw)
To: Linus Torvalds
Cc: James Bottomley, Christian Brauner, Leon Romanovsky,
pr-tracker-bot, linux-fsdevel, linux-kernel
On Thu, Apr 3, 2025 at 8:10 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 3 Apr 2025 at 10:21, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > I would argue it would be best if a language wizard came up with a way
> > to *demand* explicit use of { } and fail compilation if not present.
>
> I tried to think of some sane model for it, but there isn't any good syntax.
>
> The only way to enforce it would be to also have a "end" marker, ie do
> something like
>
> scoped_guard(x) {
> ...
> } end_scoped_guard;
>
> and that you could more-or-less enforce by having
>
> #define scoped_guard(..) ... real guard stuff .. \
> do {
>
> #define end_scope } while (0)
>
Ye I was thinking about something like that would was thoroughly
dissatisfied with the idea.
Perhaps a tolerable fallback would be to rely on checkpatch after all,
but have it detect missing { } instead of relying on indentation
level?
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-03 18:24 ` Leon Romanovsky
@ 2025-04-03 19:18 ` Linus Torvalds
2025-04-03 19:45 ` Christian Brauner
2025-04-04 6:16 ` Leon Romanovsky
2025-04-03 19:38 ` James Bottomley
1 sibling, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2025-04-03 19:18 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Christian Brauner, pr-tracker-bot, linux-fsdevel, linux-kernel
On Thu, 3 Apr 2025 at 11:25, Leon Romanovsky <leon@kernel.org> wrote:
> >
> > - scoped_guard(rwsem_read, &namespace_sem)
> > + guard(rwsem_read, &namespace_sem);
>
> I'm looking at Linus's master commit a2cc6ff5ec8f ("Merge tag
> 'firewire-updates-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394")
> and guard is declared as macro which gets only one argument: include/linux/cleanup.h
> 318 #define guard(_name) \
> 319 CLASS(_name, __UNIQUE_ID(guard))
Christian didn't test his patch, obviously.
It should be
guard(rwsem_read)(&namespace_sem);
the guard() macro is kind of odd, but the oddity relates to how it
kind of takes a "class" thing as it's argument, and that then expands
to the constructor that may or may not take arguments itself.
That made some of the macros simpler, although in retrospect the odd
syntax probably wasn't worth it.
Linus
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-03 18:24 ` Leon Romanovsky
2025-04-03 19:18 ` Linus Torvalds
@ 2025-04-03 19:38 ` James Bottomley
1 sibling, 0 replies; 32+ messages in thread
From: James Bottomley @ 2025-04-03 19:38 UTC (permalink / raw)
To: Leon Romanovsky, Christian Brauner
Cc: Linus Torvalds, pr-tracker-bot, linux-fsdevel, linux-kernel
On Thu, 2025-04-03 at 21:24 +0300, Leon Romanovsky wrote:
> On Thu, Apr 03, 2025 at 05:15:38PM +0200, Christian Brauner wrote:
> > On Thu, Apr 03, 2025 at 10:29:37AM +0200, Christian Brauner wrote:
> > > On Tue, Apr 01, 2025 at 08:07:15PM +0300, Leon Romanovsky wrote:
> > > > On Mon, Mar 24, 2025 at 09:00:59PM +0000,
> > > > pr-tracker-bot@kernel.org wrote:
> > > > > The pull request you sent on Sat, 22 Mar 2025 11:13:18 +0100:
> > > > >
> > > > > > git@gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
> > > > > > tags/vfs-6.15-rc1.mount
> > > > >
> > > > > has been merged into torvalds/linux.git:
> > > > > https://git.kernel.org/torvalds/c/fd101da676362aaa051b4f5d8a941bd308603041
> > > >
> > > > I didn't bisect, but this PR looks like the most relevant
> > > > candidate.
> > > > The latest Linus's master generates the following slab-use-
> > > > after-free:
> > >
> > > Sorry, did just see this today. I'll take a look now.
> >
> > So in light of "Liberation Day" and the bug that caused this splat
> > it's time to quote Max Liebermann:
> >
> > "Ich kann nicht so viel fressen, wie ich kotzen möchte."
>
> > From 8822177b7a8a7315446b4227c7eb7a36916a6d6d Mon Sep 17 00:00:00
> > 2001
> > From: Christian Brauner <brauner@kernel.org>
> > Date: Thu, 3 Apr 2025 16:43:50 +0200
> > Subject: [PATCH] fs: actually hold the namespace semaphore
> >
> > Don't use a scoped guard use a regular guard to make sure that the
> > namespace semaphore is held across the whole function.
> >
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > fs/namespace.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index 16292ff760c9..348008b9683b 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -2478,7 +2478,8 @@ struct vfsmount *clone_private_mount(const
> > struct path *path)
> > struct mount *old_mnt = real_mount(path->mnt);
> > struct mount *new_mnt;
> >
> > - scoped_guard(rwsem_read, &namespace_sem)
> > + guard(rwsem_read, &namespace_sem);
>
> I'm looking at Linus's master commit a2cc6ff5ec8f ("Merge tag
> 'firewire-updates-6.15' of
> git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394")
> and guard is declared as macro which gets only one argument:
> include/linux/cleanup.h
> 318 #define guard(_name) \
> 319 CLASS(_name, __UNIQUE_ID(guard))
>
>
>
> 20:52:24 fs/namespace.c: In function 'clone_private_mount':
> 20:52:24 fs/namespace.c:2481:41: error: macro "guard" passed 2
> arguments, but takes just 1
> 20:52:24 2481 | guard(rwsem_read, &namespace_sem);
> 20:52:24 | ^
> 20:52:24 In file included from ./include/linux/preempt.h:11,
> 20:52:24 from ./include/linux/spinlock.h:56,
> 20:52:24 from ./include/linux/wait.h:9,
> 20:52:24 from ./include/linux/wait_bit.h:8,
> 20:52:24 from ./include/linux/fs.h:7,
> 20:52:24 from ./include/uapi/linux/aio_abi.h:31,
> 20:52:24 from ./include/linux/syscalls.h:83,
> 20:52:24 from fs/namespace.c:11:
> 20:52:24 ./include/linux/cleanup.h:318:9: note: macro "guard"
> defined here
> 20:52:24 318 | #define guard(_name) \
> 20:52:24 | ^~~~~
> 20:52:24 fs/namespace.c:2481:9: error: 'guard' undeclared (first use
> in this function)
> 20:52:24 2481 | guard(rwsem_read, &namespace_sem);
> 20:52:24 | ^~~~~
> 20:52:24 fs/namespace.c:2481:9: note: each undeclared identifier is
> reported only once for each function it appears in
>
> Do I need to apply extra patch?
I think the statement should be
guard(rwsem_read)(&namespace_sem);
Regards,
James
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-03 19:18 ` Linus Torvalds
@ 2025-04-03 19:45 ` Christian Brauner
2025-04-03 19:55 ` Christian Brauner
2025-04-04 6:16 ` Leon Romanovsky
1 sibling, 1 reply; 32+ messages in thread
From: Christian Brauner @ 2025-04-03 19:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: Leon Romanovsky, pr-tracker-bot, linux-fsdevel, linux-kernel
On Thu, Apr 03, 2025 at 12:18:45PM -0700, Linus Torvalds wrote:
> On Thu, 3 Apr 2025 at 11:25, Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > - scoped_guard(rwsem_read, &namespace_sem)
> > > + guard(rwsem_read, &namespace_sem);
> >
> > I'm looking at Linus's master commit a2cc6ff5ec8f ("Merge tag
> > 'firewire-updates-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394")
> > and guard is declared as macro which gets only one argument: include/linux/cleanup.h
> > 318 #define guard(_name) \
> > 319 CLASS(_name, __UNIQUE_ID(guard))
>
> Christian didn't test his patch, obviously.
Yes, I just sent this out as "I get why this happens." after my
screaming "dammit" moment. Sorry that I didn't make this clear. I had a
pretty strong "ffs" 10 minutes after I had waded through the overlayfs
code I added without being able to figure out how the fsck this could've
happened. In any case, there's the obviously correct version now sitting
in the tree and it's seen testing obviously.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-03 19:45 ` Christian Brauner
@ 2025-04-03 19:55 ` Christian Brauner
0 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-04-03 19:55 UTC (permalink / raw)
To: Linus Torvalds
Cc: Leon Romanovsky, pr-tracker-bot, linux-fsdevel, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1209 bytes --]
On Thu, Apr 03, 2025 at 09:45:59PM +0200, Christian Brauner wrote:
> On Thu, Apr 03, 2025 at 12:18:45PM -0700, Linus Torvalds wrote:
> > On Thu, 3 Apr 2025 at 11:25, Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > - scoped_guard(rwsem_read, &namespace_sem)
> > > > + guard(rwsem_read, &namespace_sem);
> > >
> > > I'm looking at Linus's master commit a2cc6ff5ec8f ("Merge tag
> > > 'firewire-updates-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394")
> > > and guard is declared as macro which gets only one argument: include/linux/cleanup.h
> > > 318 #define guard(_name) \
> > > 319 CLASS(_name, __UNIQUE_ID(guard))
> >
> > Christian didn't test his patch, obviously.
>
> Yes, I just sent this out as "I get why this happens." after my
> screaming "dammit" moment. Sorry that I didn't make this clear. I had a
> pretty strong "ffs" 10 minutes after I had waded through the overlayfs
> code I added without being able to figure out how the fsck this could've
> happened. In any case, there's the obviously correct version now sitting
> in the tree and it's seen testing obviously.
I'll also append it here just in case you want to apply it right now.
[-- Attachment #2: v2-0001-fs-actually-hold-the-namespace-semaphore.patch --]
[-- Type: text/x-diff, Size: 915 bytes --]
From f5ff87a84a8803eeb4b344b9a496e7060787b42a Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Thu, 3 Apr 2025 16:43:50 +0200
Subject: [PATCH v2] fs: actually hold the namespace semaphore
Don't use a scoped guard use a regular guard to make sure that the
namespace semaphore is held across the whole function.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/namespace.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 16292ff760c9..14935a0500a2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2478,7 +2478,8 @@ struct vfsmount *clone_private_mount(const struct path *path)
struct mount *old_mnt = real_mount(path->mnt);
struct mount *new_mnt;
- scoped_guard(rwsem_read, &namespace_sem)
+ guard(rwsem_read)(&namespace_sem);
+
if (IS_MNT_UNBINDABLE(old_mnt))
return ERR_PTR(-EINVAL);
--
2.47.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-03 19:18 ` Linus Torvalds
2025-04-03 19:45 ` Christian Brauner
@ 2025-04-04 6:16 ` Leon Romanovsky
1 sibling, 0 replies; 32+ messages in thread
From: Leon Romanovsky @ 2025-04-04 6:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, pr-tracker-bot, linux-fsdevel, linux-kernel
On Thu, Apr 03, 2025 at 12:18:45PM -0700, Linus Torvalds wrote:
> On Thu, 3 Apr 2025 at 11:25, Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > - scoped_guard(rwsem_read, &namespace_sem)
> > > + guard(rwsem_read, &namespace_sem);
> >
> > I'm looking at Linus's master commit a2cc6ff5ec8f ("Merge tag
> > 'firewire-updates-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394")
> > and guard is declared as macro which gets only one argument: include/linux/cleanup.h
> > 318 #define guard(_name) \
> > 319 CLASS(_name, __UNIQUE_ID(guard))
>
> Christian didn't test his patch, obviously.
>
> It should be
>
> guard(rwsem_read)(&namespace_sem);
>
> the guard() macro is kind of odd, but the oddity relates to how it
> kind of takes a "class" thing as it's argument, and that then expands
> to the constructor that may or may not take arguments itself.
Thanks, fixed.
Regarding syntax, in my opinion it is too odd and not intuitive.
>
> That made some of the macros simpler, although in retrospect the odd
> syntax probably wasn't worth it.
>
> Linus
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-03 18:09 ` Linus Torvalds
2025-04-03 19:17 ` Mateusz Guzik
@ 2025-04-04 8:28 ` Christoph Hellwig
2025-04-04 14:19 ` Linus Torvalds
1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2025-04-04 8:28 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mateusz Guzik, James Bottomley, Christian Brauner,
Leon Romanovsky, pr-tracker-bot, linux-fsdevel, linux-kernel
On Thu, Apr 03, 2025 at 11:09:41AM -0700, Linus Torvalds wrote:
> On Thu, 3 Apr 2025 at 10:21, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > I would argue it would be best if a language wizard came up with a way
> > to *demand* explicit use of { } and fail compilation if not present.
>
> I tried to think of some sane model for it, but there isn't any good syntax.
>
> The only way to enforce it would be to also have a "end" marker, ie do
> something like
Or just kill the non-scoped guard because it simply is an insane API.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-04 8:28 ` Christoph Hellwig
@ 2025-04-04 14:19 ` Linus Torvalds
2025-04-07 8:51 ` Christoph Hellwig
2025-04-07 11:22 ` Christian Brauner
0 siblings, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2025-04-04 14:19 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Mateusz Guzik, James Bottomley, Christian Brauner,
Leon Romanovsky, pr-tracker-bot, linux-fsdevel, linux-kernel
On Fri, 4 Apr 2025 at 01:28, Christoph Hellwig <hch@infradead.org> wrote:
>
> Or just kill the non-scoped guard because it simply is an insane API.
The scoped guard may be odd, but it's actually rather a common
situation. And when used with the proper indentation, it also ends up
being pretty visually clear about what part of a function is under the
lock.
But yeah, if you don't end up using it right, it ends up very very wrong.
Not that that is any different from "if ()" or any other similar
construct, but obviously people are much more *used* to 'if ()' and
friends.
An 'if ()" without the nested statement looks very wrong - although
it's certainly not unheard of - while a 'scoped_guard()' without the
nested statement might visually pass just because it doesn't trigger
the same visceral "that's not right" reaction.
So I don't think it's an insane API, I think it's mostly that it's a
_newish_ API.
Linus
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-04 14:19 ` Linus Torvalds
@ 2025-04-07 8:51 ` Christoph Hellwig
2025-04-07 16:00 ` Linus Torvalds
2025-04-07 11:22 ` Christian Brauner
1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2025-04-07 8:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Mateusz Guzik, James Bottomley,
Christian Brauner, Leon Romanovsky, pr-tracker-bot, linux-fsdevel,
linux-kernel
On Fri, Apr 04, 2025 at 07:19:27AM -0700, Linus Torvalds wrote:
> On Fri, 4 Apr 2025 at 01:28, Christoph Hellwig <hch@infradead.org> wrote:
> >
> > Or just kill the non-scoped guard because it simply is an insane API.
>
> The scoped guard may be odd, but it's actually rather a common
> situation. And when used with the proper indentation, it also ends up
> being pretty visually clear about what part of a function is under the
> lock.
The scoped one with proper indentation is fine. The non-scoped one is
the one that is really confusing and odd.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-04 14:19 ` Linus Torvalds
2025-04-07 8:51 ` Christoph Hellwig
@ 2025-04-07 11:22 ` Christian Brauner
1 sibling, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-04-07 11:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Mateusz Guzik, James Bottomley,
Leon Romanovsky, pr-tracker-bot, linux-fsdevel, linux-kernel
On Fri, Apr 04, 2025 at 07:19:27AM -0700, Linus Torvalds wrote:
> On Fri, 4 Apr 2025 at 01:28, Christoph Hellwig <hch@infradead.org> wrote:
> >
> > Or just kill the non-scoped guard because it simply is an insane API.
>
> The scoped guard may be odd, but it's actually rather a common
> situation. And when used with the proper indentation, it also ends up
> being pretty visually clear about what part of a function is under the
> lock.
>
> But yeah, if you don't end up using it right, it ends up very very wrong.
>
> Not that that is any different from "if ()" or any other similar
> construct, but obviously people are much more *used* to 'if ()' and
> friends.
>
> An 'if ()" without the nested statement looks very wrong - although
> it's certainly not unheard of - while a 'scoped_guard()' without the
> nested statement might visually pass just because it doesn't trigger
> the same visceral "that's not right" reaction.
>
> So I don't think it's an insane API, I think it's mostly that it's a
> _newish_ API.
Both the scoped and non-scoped guards are very useful. I initially used
a scoped variant but then reworked the code to use a non-scoped one and
fscked it up.
I agree with Linus here it was just me not having the same "Oh right,
that's odd reaction.".
I love the guard infrastructure. It's a massive improvement. Thanks to
Peter for finally bringing this into the kernel after I've worked with
this for years in userspace already. It literally helped obliterate
nearly all memory safety bugs in systemd and I'm confident it will have
positive effects in the kernel long-term as well.
And please, can we (collective we) for once all decide to not turn yet
another issue into a two week thread of New York Times Opinion pieces on
how Things Really Are and Should Have Been Done. :)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-07 8:51 ` Christoph Hellwig
@ 2025-04-07 16:00 ` Linus Torvalds
2025-04-08 5:06 ` Christoph Hellwig
0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2025-04-07 16:00 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Mateusz Guzik, James Bottomley, Christian Brauner,
Leon Romanovsky, pr-tracker-bot, linux-fsdevel, linux-kernel
On Mon, 7 Apr 2025 at 01:51, Christoph Hellwig <hch@infradead.org> wrote:
>
> The scoped one with proper indentation is fine. The non-scoped one is
> the one that is really confusing and odd.
Ahh, I misunderstood you.
You're obviously right in a "visually obvious" way - even if it was
the scoped one that caused problems.
But the non-scoped one is *so* convenient when you have a helper
function that just wants to run with some local (or RCU) held.
There's a reason we have more than two _thousand_ uses of it by now in
the kernel (~4x as many as the scoped version). It's just makes code
look so much simpler.
I was nervous about the lock guard macros initially, but it really
does get rid of pointless boilerplate code. Even without error
handling complications it just makes code simpler.
Linus
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] vfs mount
2025-04-07 16:00 ` Linus Torvalds
@ 2025-04-08 5:06 ` Christoph Hellwig
0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2025-04-08 5:06 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christoph Hellwig, Mateusz Guzik, James Bottomley,
Christian Brauner, Leon Romanovsky, pr-tracker-bot, linux-fsdevel,
linux-kernel
On Mon, Apr 07, 2025 at 09:00:10AM -0700, Linus Torvalds wrote:
> On Mon, 7 Apr 2025 at 01:51, Christoph Hellwig <hch@infradead.org> wrote:
> >
> > The scoped one with proper indentation is fine. The non-scoped one is
> > the one that is really confusing and odd.
>
> Ahh, I misunderstood you.
>
> You're obviously right in a "visually obvious" way - even if it was
> the scoped one that caused problems.
>
> But the non-scoped one is *so* convenient when you have a helper
> function that just wants to run with some local (or RCU) held.
I wish we'd just hage a way to run an existing scope, especially a
funtion fun with a lock, e.g.
int some_helper(....)
scoped_lock(&some_mutex)
{
...
}
which would give you that with a much more obvious and redable
syntax. Not taking the resource in the middle of the block and
releasing it at the end will also fix tons of bugs for non-obvious
behavior.
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2025-04-08 5:06 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-22 10:13 [GIT PULL] vfs mount Christian Brauner
2025-03-24 21:00 ` pr-tracker-bot
2025-04-01 17:07 ` Leon Romanovsky
2025-04-03 8:29 ` Christian Brauner
2025-04-03 15:15 ` Christian Brauner
2025-04-03 15:34 ` James Bottomley
2025-04-03 17:21 ` Mateusz Guzik
2025-04-03 18:09 ` Linus Torvalds
2025-04-03 19:17 ` Mateusz Guzik
2025-04-04 8:28 ` Christoph Hellwig
2025-04-04 14:19 ` Linus Torvalds
2025-04-07 8:51 ` Christoph Hellwig
2025-04-07 16:00 ` Linus Torvalds
2025-04-08 5:06 ` Christoph Hellwig
2025-04-07 11:22 ` Christian Brauner
2025-04-03 18:24 ` Leon Romanovsky
2025-04-03 19:18 ` Linus Torvalds
2025-04-03 19:45 ` Christian Brauner
2025-04-03 19:55 ` Christian Brauner
2025-04-04 6:16 ` Leon Romanovsky
2025-04-03 19:38 ` James Bottomley
-- strict thread matches above, loose matches on Subject: below --
2025-01-18 13:06 Christian Brauner
2025-01-20 0:10 ` Sasha Levin
2025-01-20 12:21 ` Christian Brauner
2025-01-20 18:59 ` pr-tracker-bot
2024-09-13 14:41 Christian Brauner
2024-09-14 2:33 ` Stephen Rothwell
2024-09-16 11:09 ` pr-tracker-bot
2024-05-10 11:46 Christian Brauner
2024-05-13 19:38 ` pr-tracker-bot
2023-06-23 11:03 [GIT PULL] vfs: mount Christian Brauner
2023-06-26 17:34 ` pr-tracker-bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).